Lost connection after reading 2147479552 bytes with sendfile

musicdenotation · July 10, 2015, 6:01pm

Hi!

I’ve bumped into a strange problem lately, and though I’ve found a
workaround I thought I’d post here to try and understand what the
underlying issue can be in case it affects anyone else.

The short version is:
Running Nginx 1.9.2 on Scientific Linux 6 with a
2.6.32-504.8.1.el6.x86_64 kernel, nginx is configured to serve a large
file (4GB) with sendfile enabled, the file download hangs exactly after
2147479552 bytes, after ~ 60s the connection is dropped (presumably
because of the default send_timeout).

Setting sendfile_max_chunk to 1G fixes the problem

Longer version:

Create a random 4G file:

mkdir /tmp/test
dd if=/dev/urandom of=/tmp/test/bigfile bs=1M count=4096

Basic NGINX configuration /tmp/nginx.conf:

worker_processes 1;
daemon off;

events {
worker_connections 768;
}

http {
sendfile on;

access_log /tmp/access.log;
error_log /tmp/error.log debug;

server {
    listen 8080 default_server;
    location / {
        root /tmp/test/;
    }
}

}

Nginx 1.9.2 was compiled by hand:

nginx -V
nginx version: nginx/1.9.2
built by gcc 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
built with OpenSSL 1.0.1e-fips 11 Feb 2013
TLS SNI support enabled
configure arguments: --prefix=/tmp/opt --conf-path=/etc/nginx/nginx.conf
–error-log-path=/tmp/error.log --http-log-path=/tmp/access.log
–pid-path=/tmp/nginx.pid
–http-client-body-temp-path=/var/cache/nginx/client_temp
–http-proxy-temp-path=/var/cache/nginx/proxy_temp
–http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp
–http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp
–http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx
–group=nginx --with-http_ssl_module --with-http_realip_module
–with-http_addition_module --with-http_sub_module
–with-http_dav_module --with-http_flv_module --with-http_mp4_module
–with-http_gunzip_module --with-http_gzip_static_module
–with-http_random_index_module --with-http_secure_link_module
–with-http_stub_status_module --with-http_auth_request_module
–with-mail --with-mail_ssl_module --with-file-aio --with-ipv6
–with-cc-opt=‘-O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic’
–without-pcre --without-http_rewrite_module --with-debug

Run nginx with the above configuration (non-root user, hence all the

/tmp/ stuff above)

nginx -c /tmp/nginx.conf

On the same machine run a wget (auto-retry disabled):

$ wget http://localhost:8080/bigfile -O /dev/null -t 1
–2015-07-10 17:02:22-- http://localhost:8080/bigfile
Resolving localhost… 127.0.0.1, ::1
Connecting to localhost|127.0.0.1|:8080… connected.
HTTP request sent, awaiting response… 200 OK
Length: 4294967296 (4.0G) [text/plain]
Saving to: “/dev/null”

49%
[===================================================================================>
] 2,147,479,552 --.-K/s in 96s

2015-07-10 17:03:58 (21.4 MB/s) - Connection closed at byte 2147479552.
Giving up.

The error logs is visible at:

https://gist.githubusercontent.com/mathiasuk/9022d3458ef373db1775/raw/e483a69aae10805558fefc4eefb381e34bbbeaf9/error-sc6.5.log

I tried the same thing on an Ubuntu 14.04.2 without problem, however, in
this case the sendfile system call always reads much fewer bytes at a
time
https://gist.githubusercontent.com/mathiasuk/018e0462d8788288d0d3/raw/error-ubuntu-14.04.02.log
:

$ grep sendfile /tmp/error-ubuntu-14.04.02.log|head
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: @0 2147479552
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: 3557719 of 2147479552
@0
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: @3557719 2147481257
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: 7792477 of 2147481257
@3557719
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: @11350196 2147483468
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: 1833524 of 2147483468
@11350196
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: @13183720 2147480856
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: 1899007 of 2147480856
@13183720
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: @15082727 2147482393
2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: 1964490 of 2147482393
@15082727

Whereas on the SL6 box sendfile tries to read 2147479552 in one go and
nginx then seems to stop there:

$ grep sendfile /tmp/error-sc6.5.log
2015/07/10 17:02:22 [debug] 22749#0: *1 sendfile: @0 2147479552
2015/07/10 17:02:22 [debug] 22749#0: *1 sendfile: 2147479552 of
2147479552 @0

I’m not sure if this is the right place to post this, please point me in
the right direction if not

Let me know if I can provide more information or details.

Cheers,
Mathias

–

Mathias Andre Web & Advanced Projects
Coordinator
E S European Southern Observatory ESO Education & Public Outreach
Department
O Karl Schwarzschildstr. 2
D-85748 Garching Phone : +49 (0)89 3200 6760
Germany Office 246
“For every problem, there is a solution that is simple, neat, and
wrong.”
H. L. Mencken

Mathias_Andre · July 12, 2015, 12:44am

Hello!

On Fri, Jul 10, 2015 at 06:01:30PM +0200, Mathias Andre wrote:

Setting sendfile_max_chunk to 1G fixes the problem

[…]

2015/07/10 17:08:51 [debug] 15680#0: *1 sendfile: 1833524 of 2147483468
@11350196
2015/07/10 17:02:22 [debug] 22749#0: *1 sendfile: @0 2147479552
2015/07/10 17:02:22 [debug] 22749#0: *1 sendfile: 2147479552 of 2147479552 @0

The 2147479552 is a limit applied by default to allow sendfile()
to work with larger files on Linux up to 2.6.16 (see
src/os/unix/ngx_linux_sendfile_chain.c for some comments). You can see
the
same limit on the first sendfile() call in the Ubuntu log as well.

The strange thing here is that on Scientific Linux 6 the call
pretends it send all the bytes in a single non-blocking call.
This is not nginx expects to ever happen, and this is what causes
the problem to appear. It would be interesting to dig further to
understand what causes this SL6 behaviour.

Using sendfile_max_chunk with some large value is a correct
workaround and expected to work fine.

–
Maxim D.
http://nginx.org/

Mathias_Andre · July 15, 2015, 9:49am

Hi,

Thanks for the detailed reply!

Maxim D. [email protected] wrote:

The 2147479552 is a limit applied by default to allow sendfile()
to work with larger files on Linux up to 2.6.16 (see
src/os/unix/ngx_linux_sendfile_chain.c for some comments). You can see the
same limit on the first sendfile() call in the Ubuntu log as well.

Indeed, I had also seen a lot of reference to this “magic” number
around,
so I thought it might be related to it.

The strange thing here is that on Scientific Linux 6 the call
pretends it send all the bytes in a single non-blocking call.
This is not nginx expects to ever happen, and this is what causes
the problem to appear. It would be interesting to dig further to
understand what causes this SL6 behaviour.

OK, I did write a tiny test program to try and reproduce the problem on
the SL box: it tries to copy 4GB from an existing file in one sendfile
call:

gist.github.com

https://gist.github.com/mathiasuk/cf46d0f0caf1dd597e59

sendfiletest.c

#include <sys/types.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/sendfile.h>

int main() {

This file has been truncated. show original

As expected the sendfile calls return 2147479552, and the output file is
indeed 2147479552 bytes long, so this seems to work.
Here’s the trace:

gist.github.com

https://gist.github.com/mathiasuk/694177cf6446428f9498

sendfiletest-trace

execve("./a.out", ["./a.out"], [/* 34 vars */]) = 0
brk(0)                                  = 0x1276000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f41ab417000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/opt/rh/python27/root/usr/lib64/tls/x86_64/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/rh/python27/root/usr/lib64/tls/x86_64", 0x7fffcd26e880) = -1 ENOENT (No such file or directory)
open("/opt/rh/python27/root/usr/lib64/tls/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/rh/python27/root/usr/lib64/tls", {st_mode=S_IFDIR|0555, st_size=4096, ...}) = 0
open("/opt/rh/python27/root/usr/lib64/x86_64/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/rh/python27/root/usr/lib64/x86_64", 0x7fffcd26e880) = -1 ENOENT (No such file or directory)

This file has been truncated. show original

I wonder if this could be because my test uses an output file and not a
socket. I’ll try and investigate some more.

Using sendfile_max_chunk with some large value is a correct
workaround and expected to work fine.

Thanks!
Mathias

Mathias_Andre · July 16, 2015, 8:38pm

Hello!

On Wed, Jul 15, 2015 at 09:48:50AM +0200, Mathias Andre wrote:

src/os/unix/ngx_linux_sendfile_chain.c for some comments). You can see the

I wonder if this could be because my test uses an output file and not a
socket. I’ll try and investigate some more.

The question is “how this can legitimately happen on a
non-blocking socket”. The “socket” and “non-blocking” parts are
both important.

For sure this can happen on a file and/or blocking socket.

–
Maxim D.
http://nginx.org/