0.8.52 AIO problem

Brane_FSGraSSSSnar · October 7, 2010, 12:12pm

Hello!

I’m using nginx for serving static content. Nginx is serving from a
folder
containing millions of small files.

Becouse machine doesn’t have lots of ram (2G) and becouse access pattern
is
completely random VFS cache gets trashed.

That’s why i tried aio + directio.

Configuration:

location / {
aio on;
directio 1;
}

It seems that everything works fine, vmstat reports smaller cache size,
etc.
But number of opened filedescriptors on system grows and nginx status
module
reports more and more connections in state “writing”.

Nginx main error contains tons of familiar messages:

— snip —
2010/10/06 17:48:42 [alert] 1589#0: open socket #307 left in connection
128
2010/10/06 17:48:42 [alert] 1589#0: open socket #313 left in connection
131
2010/10/06 17:48:42 [alert] 1589#0: open socket #297 left in connection
132
2010/10/06 17:48:42 [alert] 1589#0: open socket #312 left in connection
133
2010/10/06 17:48:42 [alert] 1589#0: open socket #364 left in connection
134
2010/10/06 17:48:42 [alert] 1589#0: open socket #394 left in connection
136
2010/10/06 17:48:42 [alert] 1589#0: open socket #321 left in connection
138
2010/10/06 17:48:42 [alert] 1589#0: open socket #320 left in connection
139
2010/10/06 17:48:42 [alert] 1589#0: open socket #260 left in connection
141
2010/10/06 17:48:42 [alert] 1589#0: open socket #336 left in connection
145
2010/10/06 17:48:42 [alert] 1589#0: open socket #340 left in connection
148
2010/10/06 17:48:42 [alert] 1589#0: open socket #347 left in connection
149
— snip —

Am i doing something wrong?

nginx version: nginx/0.8.52
TLS SNI support enabled
configure arguments: --with-cc-opt=’-O2 -pipe -fomit-frame-pointer
-funroll-
loops’ --prefix=/usr --conf-path=/etc/nginx/nginx.conf --error-log-
path=/tmp/nginx.error --http-log-path=/tmp/nginx.access --pid-
path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --user=daemon –
group=daemon
–http-client-body-temp-path=/var/spool/nginx/client_body_temp –
http-proxy-temp-path=/var/spool/nginx/proxy_temp --http-fastcgi-temp-
path=/var/spool/nginx/fastcgi_temp --with-ipv6 --with-http_ssl_module
–with-
http_realip_module --with-http_addition_module --with-http_dav_module
–with-
http_flv_module --with-http_gzip_static_module
–with-http_random_index_module
–with-http_secure_link_module --with-http_stub_status_module
–with-mail –
with-mail_ssl_module --with-pcre --with-file-aio --add-
module=/export/tmp/pkgcompile/nginx/work/nginx-0.8.52/modules/nginx_mod_h264_streaming-2.2.7
–add-module=/export/tmp/pkgcompile/nginx/work/nginx-0.8.52/modules/ngx-
fancyindex --add-
module=/export/tmp/pkgcompile/nginx/work/nginx-0.8.52/modules/ngx_http_log_request_speed

Linux 2.6.35, x86_64, Ubuntu 10.04LTS.

Best regards, Brane

Brane_FSGraSSSSnar · October 7, 2010, 1:46pm

Hello!

On Thu, Oct 07, 2010 at 12:11:59PM +0200, Brane F. GraÄnar wrote:

Configuration:
Nginx main error contains tons of familiar messages:
2010/10/06 17:48:42 [alert] 1589#0: open socket #260 left in connection 141
loops’ --prefix=/usr --conf-path=/etc/nginx/nginx.conf --error-log-
–add-module=/export/tmp/pkgcompile/nginx/work/nginx-0.8.52/modules/ngx-
fancyindex --add-
module=/export/tmp/pkgcompile/nginx/work/nginx-0.8.52/modules/ngx_http_log_request_speed

Linux 2.6.35, x86_64, Ubuntu 10.04LTS.

You may want to try to reproduce the problem without third party
modules, but most likely it’s problem in nginx itself. AFAIK aio
support still has some socket leaks. Probably Igor knows more
details.

If you feel yourself brave enough and want to debug this problem -
you may recompile nginx --with-debug, configure debug logging,
enable core dumps and add “debug_points abort;” to your
nginx.conf. This will cause abort() to be called when nginx
detects open sockets on worker process shutdown (and hence core
dump), and will allow to carefully examine connections with open
sockets left.

See here for basic debugging instructions:

Maxim D.

Brane_FSGraSSSSnar · October 7, 2010, 2:22pm

On Thursday 07 of October 2010 13:44:58 Maxim D. wrote:

You may want to try to reproduce the problem without third party
modules, but most likely it’s problem in nginx itself. AFAIK aio
support still has some socket leaks. Probably Igor knows more
details.

Ok, deal, i will recompile nginx without 3rd party modules and with
debugging
option (i feel brave enough, hehe).

I’ll be back with results…

Thanks!

Best regards, Brane

Brane_FSGraSSSSnar · October 7, 2010, 4:29pm

On Thursday 07 of October 2010 13:44:58 Maxim D. wrote:

dump), and will allow to carefully examine connections with open
sockets left.

I recompiled nginx without 3rd party modules and with debug enabled.

nginx -V
— snip —
nginx version: nginx/0.8.52
TLS SNI support enabled
configure arguments: --with-cc-opt=’-O2 -pipe -fomit-frame-pointer
-funroll-loops -g’ --prefix=/usr --conf-path=/etc/nginx/nginx.conf
–error-log-
path=/tmp/nginx.error --http-log-path=/tmp/nginx.access
–pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock
–user=daemon --group=daemon --http-
client-body-temp-path=/var/spool/nginx/client_body_temp
–http-proxy-temp-path=/var/spool/nginx/proxy_temp --http-fastcgi-temp-
path=/var/spool/nginx/fastcgi_temp --with-ipv6 --with-http_ssl_module
–with-http_realip_module --with-http_addition_module
–with-http_dav_module --with-
http_flv_module --with-http_gzip_static_module
–with-http_random_index_module --with-http_secure_link_module
–with-http_stub_status_module --with-mail
–with-mail_ssl_module --with-pcre --with-file-aio --with-debug
— snip —

The problem persists. See attachments.

Nginx crashes/aborts with messages like “[alert] 15248#0: open socket
#491 left in connection 217” only if i reload nginx configuration (kill
-HUP).

Stacktrace:

— snip —
#0 0x00007f8687407a75 in raise () from /lib/libc.so.6
(gdb) bt
#0 0x00007f8687407a75 in raise () from /lib/libc.so.6
#1 0x00007f868740b5c0 in abort () from /lib/libc.so.6
#2 0x000000000042ae95 in ngx_debug_point () at
src/os/unix/ngx_process.c:564
#3 0x000000000042d07a in ngx_worker_process_exit (cycle=0x1265c40) at
src/os/unix/ngx_process_cycle.c:1039
#4 0x000000000042d15e in ngx_worker_process_cycle (cycle=0x1265c40,
data=) at src/os/unix/ngx_process_cycle.c:789
#5 0x000000000042b0a4 in ngx_spawn_process (cycle=0x1265c40,
proc=, data=, name=0x49b82e
“worker process”,
respawn=-3) at src/os/unix/ngx_process.c:196
#6 0x000000000042c40e in ngx_start_worker_processes (cycle=0x1265c40,
n=8, type=-3) at src/os/unix/ngx_process_cycle.c:355
#7 0x000000000042d6d5 in ngx_master_process_cycle (cycle=0x1265c40) at
src/os/unix/ngx_process_cycle.c:136
#8 0x00000000004070d7 in main (argc=, argv=) at src/core/nginx.c:401
— snip —

Btw, everything works ok if aio and directio are disabled.

Best regards, Brane

Brane_FSGraSSSSnar · October 7, 2010, 5:26pm

Hello!

On Thu, Oct 07, 2010 at 04:28:30PM +0200, Brane F. GraÄnar wrote:

detects open sockets on worker process shutdown (and hence core
path=/tmp/nginx.error --http-log-path=/tmp/nginx.access --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --user=daemon --group=daemon --http-
client-body-temp-path=/var/spool/nginx/client_body_temp --http-proxy-temp-path=/var/spool/nginx/proxy_temp --http-fastcgi-temp-
path=/var/spool/nginx/fastcgi_temp --with-ipv6 --with-http_ssl_module --with-http_realip_module --with-http_addition_module --with-http_dav_module --with-
http_flv_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_stub_status_module --with-mail
–with-mail_ssl_module --with-pcre --with-file-aio --with-debug
— snip —

The problem persists. See attachments.

Ok, that was expected.

Nginx crashes/aborts with messages like “[alert] 15248#0: open
socket #491 left in connection 217” only if i reload nginx
configuration (kill -HUP).

That’s expected too, as this check only happens on worker exit.

#5 0x000000000042b0a4 in ngx_spawn_process (cycle=0x1265c40, proc=, data=, name=0x49b82e “worker process”,
respawn=-3) at src/os/unix/ngx_process.c:196
#6 0x000000000042c40e in ngx_start_worker_processes (cycle=0x1265c40, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:355
#7 0x000000000042d6d5 in ngx_master_process_cycle (cycle=0x1265c40) at src/os/unix/ngx_process_cycle.c:136
#8 0x00000000004070d7 in main (argc=, argv=) at src/core/nginx.c:401
— snip —

Backtrace is mostly useless here as it shows trace to check, not
to the real problem. Though it shows that you may want to
recompile again with lower optimization to simplify debugging
(note this “” things). I usually debug
with -O0 as it minimizes compiler artifacts. For CPU-bound hosts
you may want to use -O1 instead.

As I already said, you have to examine connections with open
sockets left. Something like this in gdb should be usefull
(assuming your error message with connection 217 matches core):

p ngx_cycle->connections[217]
p ngx_cycle->connections[217].log->connection
p *((ngx_http_request_t *) ngx_cycle->connections[217].data)

In debug log records with with * should be for the
connection in question, where is one from 2nd gdb command
(log->connection).

And, BTW, you may want to show your config if you aren’t going to
debug it yourself but asking for help instead.

Maxim D.

Brane_FSGraSSSSnar · October 8, 2010, 8:22pm

Hello!

On Fri, Oct 08, 2010 at 06:45:08PM +0200, “Brane F. GraÄnar” wrote:

On 10/07/2010 05:25 PM, Maxim D. wrote:

Backtrace is mostly useless here as it shows trace to check, not
to the real problem. Though it shows that you may want to
recompile again with lower optimization to simplify debugging
(note this “” things). I usually debug
with -O0 as it minimizes compiler artifacts. For CPU-bound hosts
you may want to use -O1 instead.

Ok, nginx recompiled again

[…]

See attachments for all worker processes stacktraces.

As I already said, backtrace is useless. You have to provide
output of various gdb commands I wrote in previous message.

It should enough to examine only one core, but please keep it (and
coresponding binary) for future reference.

Main server error log (700KB):
http://frost.ath.cx/.tmp/main-err.log.gz

Vhost error log (2.3MB):
http://frost.ath.cx/.tmp/vhost-err.log.gz

It would be much better to have combined one to preserve order of
messages. I.e. just comment out error_log in vhost.

And, more importantly, I don’t see “open socket left” alerts in
these logs. Looks like they were omitted somehow.

Maxim D.

Brane_FSGraSSSSnar · October 8, 2010, 6:46pm

On 10/07/2010 05:25 PM, Maxim D. wrote:

Backtrace is mostly useless here as it shows trace to check, not
to the real problem. Though it shows that you may want to
recompile again with lower optimization to simplify debugging
(note this “” things). I usually debug
with -O0 as it minimizes compiler artifacts. For CPU-bound hosts
you may want to use -O1 instead.

Ok, nginx recompiled again

— snip —
nginx version: nginx/0.8.52
built by gcc 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
TLS SNI support enabled
configure arguments: --with-cc-opt=‘-O0 -g’ --prefix=/usr
–conf-path=/etc/nginx/nginx.conf --error-log-path=/tmp/nginx.error
–http-log-path=/tmp/nginx.access --pid-path=/var/run/nginx.pid
–lock-path=/var/run/nginx.lock --user=daemon --group=daemon
–http-client-body-temp-path=/var/spool/nginx/client_body_temp
–http-proxy-temp-path=/var/spool/nginx/proxy_temp
–http-fastcgi-temp-path=/var/spool/nginx/fastcgi_temp --with-ipv6
–with-http_ssl_module --with-http_realip_module
–with-http_addition_module --with-http_dav_module
–with-http_flv_module --with-http_gzip_static_module
–with-http_random_index_module --with-http_secure_link_module
–with-http_stub_status_module --with-mail --with-mail_ssl_module
–with-pcre --with-file-aio --with-debug
— snip —

Nginx configuration (only relevant parts, becouse only one vhost was
configured to use AIO):

— snip —
user www-data;
worker_processes 8;
worker_rlimit_nofile 100000;

error_log “/var/log/nginx/main-error.log” debug;
pid “/var/run/nginx.pid”;

debugging stuff…

#worker_rlimit_core 500M;
#working_directory “/tmp”;
#debug_points abort;

events {
worker_connections 8000;
}

server {
listen 80;

    server_name host.example.org;
    server_name_in_redirect off;

    # document root...
    root "/webroot/host.example.org";

    # gzip disabled. Nothing to compress here.
    gzip off;
    gzip_static off;

    location / {
            autoindex off;
            fancyindex off;
            fancyindex_exact_size off;

            # AIO
aio on;
directio 1;

# lots of rewrites...
# lots of rewrites...
# lots of rewrites...

# this is stupid, but webapp needs this
            error_page 404 =404 "/meta/blank.gif";

            # Expires headers...
            expires 2y;
    }

    log_not_found off;
    access_log "/logs/nginx/vhost-maps-host.example.com.access"

combined;
error_log “/logs/nginx/vhost-host.example.com.error” debug;
###################################################
}

— snip —

See attachments for all worker processes stacktraces.

Main server error log (700KB):
http://frost.ath.cx/.tmp/main-err.log.gz

Vhost error log (2.3MB):
http://frost.ath.cx/.tmp/vhost-err.log.gz

Hope this helps.

Best regards, Brane

Brane_FSGraSSSSnar · October 12, 2010, 10:15am

Hello!

On Mon, Oct 11, 2010 at 02:44:54PM +0200, Brane F. Gračnar wrote:

On Saturday 09 October 2010 04:46:41 Maxim D. wrote:

Forget, I was able to trace the problem (and reproduce it here)
from debug logs provided. Please try the attached patch.

Maxim D.

I can confirm that provided patch resolves issue.

Thank you for testing. You may want to apply other patches from
this patch series:

http://nginx.org/pipermail/nginx-devel/2010-October/thread.html#498

At least this one may affect your setup as well:

http://nginx.org/pipermail/nginx-devel/2010-October/000500.html

Others are mostly for FreeBSD with aio sendfile, not having them
shouldn’t cause any issues on Linux.

Maxim D.

Brane_FSGraSSSSnar · October 10, 2010, 3:43am

Hello!

On Fri, Oct 08, 2010 at 10:20:36PM +0400, Maxim D. wrote:

coresponding binary) for future reference.
And, more importantly, I don’t see “open socket left” alerts in
these logs. Looks like they were omitted somehow.

Forget, I was able to trace the problem (and reproduce it here)
from debug logs provided. Please try the attached patch.

Maxim D.