Lots of CLOSE_WAIT sockets, nginx+php (WordPress site)

Vicente_A · February 21, 2010, 11:20am

Hi

I have a WordPress-mu site (a couple personal and friends’ blogs, very
light traffic) which I migrated some months ago from lighttpd+php-fcgi
to nginx+php-fcgi. Ever since the migration the site sometimes goes
down, I never had the time to look into it and just programmed a script
that monitored the site and restarted everything when it went down.

We’re going to start using WP-mu at work so I’ve been looking into it
lately and the problem seems to be browser-server connections stuck on
the CLOSE_WAIT state. With netstat -nap I get loads of these:

$ netstat -nap | grep CLOSE_WAIT
tcp 1 0 10.10.10.10:80 1.2.3.4:52132 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:52133 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:50857 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:51348 CLOSE_WAIT
27673/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:50846 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:52126 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:52354 CLOSE_WAIT
27672/nginx: worker
[…]

Where 10.10.10.10 is the web server and 1.2.3.4 the browser. Right now I
have 67 of these after having restarted nginx and doing some admin stuff
on wp for a couple of minutes (CPU-intensive stuff, uploading, scaling
and watermarking images with the NexGen Gallery plugin).

The connections between nginx and php doesn’t seem to get stuck, they go
from active to TIME_WAIT and disappear from netstat normally. They don’t
get stuck in the CLOSE_WAIT state:

$ netstat -nap | grep :9000
tcp 0 0 127.0.0.1:9000 0.0.0.0:*
LISTEN 27662/php5-fpm
tcp 0 0 127.0.0.1:9000 127.0.0.1:52917
TIME_WAIT -
[…]

On friday I moved from spawn-fcgi+php-cgi to php-fpm to no avail. I’ve
noticed some log entries on php5-fpm.log like these on the moments I’m
working with wp and CLOSE_WAIT connections start to clog up:

Feb 21 10:48:45.080836 [NOTICE] fpm_got_signal(), line 48: received
SIGCHLD
Feb 21 10:48:45.080918 [NOTICE] fpm_children_bury(), line 217: child
27665 (pool default) exited with code 0 after 35512.611171 seconds from
start
Feb 21 10:48:45.089499 [NOTICE] fpm_children_make(), line 354: child
30370 (pool default) started

So I guess there might be a connection between the two. Anyway this is
not a 1:1 ratio, right now I have 5 of those php SIGCHLD and 67 sockets
on CLOSE_WAIT with nginx. And the php SIGCHILD relate to moments when
I’ve got an error on wp (failed creating a thumbnail) while the
CLOSE_WAIT connections are not related to application nor connectivity
errors.

I’m almost sure that despite the CLOSE_WAIT sockets belong to the
browser-nginx connections, the problems lies in the nginx-php
connection. At work we have a farm of nginx+Tomcat servers (via
proxy_pass, not fastcgi_pass) and I haven’t seen this behavior. And I
think it has to do with PHP CPU use, as the site usually went down when
hit simultaneously by a couple visits and some search ngines’ spiders
and now I’m being able to reproduce it by scaling and watermarking pics.
But I don’t know where else to look at.

Anybody else has seen this behaviour?

Thanks in advance

Regards

–
Vicente A. removed_email_address@domain.invalid | http://www.bisente.com

Vicente_A · February 21, 2010, 8:55pm

Hello!

On Sun, Feb 21, 2010 at 11:19:48AM +0100, Vicente A. wrote:

into it lately and the problem seems to be browser-server
connections stuck on the CLOSE_WAIT state. With netstat -nap I
get loads of these:

$ netstat -nap | grep CLOSE_WAIT
tcp 1 0 10.10.10.10:80 1.2.3.4:52132
CLOSE_WAIT 27672/nginx: worker

[…]

What does nginx -V show? What’s in config?

Maxim D.

Vicente_A · February 22, 2010, 7:47am

What does nginx -V show? What’s in config?

You’re right, should have started there. Sorry.

$ nginx -V
nginx version: nginx/0.7.65
TLS SNI support enabled
configure arguments: --conf-path=/etc/nginx/nginx.conf
–error-log-path=/var/log/nginx/error.log --pid-path=/var/run/nginx.pid
–lock-path=/var/lock/nginx.lock
–http-log-path=/var/log/nginx/access.log
–http-client-body-temp-path=/var/lib/nginx/body
–http-proxy-temp-path=/var/lib/nginx/proxy
–http-fastcgi-temp-path=/var/lib/nginx/fastcgi --with-debug
–with-http_stub_status_module --with-http_flv_module
–with-http_ssl_module --with-http_dav_module
–with-http_gzip_static_module --with-http_sub_module --with-mail
–with-mail_ssl_module --with-ipv6 --with-http_perl_module
–add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/nginx-upstream-fair
–add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/ngx_http_upstream_memcached_hash_module-0.04
–add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/ngx_http_secure_download
–with-http_proxy_s3_auth

some bits of nginx config:

worker_processes 4;

error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;

events {
worker_connections 1024;
multi_accept on;
}

http {
include /etc/nginx/mime.types;

access_log  /var/log/nginx/access.log;

sendfile        on;
tcp_nopush     on;

#keepalive_timeout  0;
keepalive_timeout  20;
keepalive_requests 50;
tcp_nodelay        on;

gzip  on;

client_max_body_size 32m;

gzip_static on;

gzip_http_version 1.1;
gzip_proxied expired no-cache no-store private auth;
gzip_disable “MSIE [1-6].”;
gzip_vary on;

location ~ .php$ {
    fastcgi_split_path_info ^(.+\.php)(.*)$;
    fastcgi_pass   127.0.0.1:9000;
    fastcgi_index  index.php;
    include fastcgi_params;
     fastcgi_param SCRIPT_FILENAME

$document_root$fastcgi_script_name;
fastcgi_param SERVER_NAME $http_host;
fastcgi_param QUERY_STRING $query_string;
fastcgi_param REQUEST_METHOD $request_method;
fastcgi_param CONTENT_TYPE $content_type;
fastcgi_param CONTENT_LENGTH $content_length;
fastcgi_intercept_errors on;
fastcgi_ignore_client_abort on;
fastcgi_connect_timeout 60;
fastcgi_send_timeout 180;
fastcgi_read_timeout 180;
fastcgi_buffer_size 128k;
fastcgi_buffers 4 256k;
fastcgi_busy_buffers_size 256k;
fastcgi_temp_file_write_size 256k;
}

All the rest are the usual locations, rewrites, etc.

Have tried with several different combinations of worker_processes,
sendfile, nopush, keepalive_* … to no avail. Sometimes it takes longer
to hang, but it always end up not responding with > 100 connections in
CLOSE_WAIT. Killing nginx, waiting a couple of seconds for the
connections in CLOSE_WAIT to disappear from netstat and starting nginx
again fixes the issue. No need to restart the PHP processes.

Bye

–
Vicente A. removed_email_address@domain.invalid | http://www.bisente.com

Vicente_A · February 22, 2010, 11:57am

Hello!

On Mon, Feb 22, 2010 at 07:46:24AM +0100, Vicente A. wrote:

–pid-path=/var/run/nginx.pid --lock-path=/var/lock/nginx.lock
–add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/ngx_http_upstream_memcached_hash_module-0.04
–add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/ngx_http_secure_download
–with-http_proxy_s3_auth

Try compiling without third party modules and patches and check if
you are able to reproduce the problem. But see below for more
simple test to do before this one.

[…]

    fastcgi_ignore_client_abort on;
    fastcgi_connect_timeout 60;
    fastcgi_send_timeout 180;
    fastcgi_read_timeout 180;

You are ignoring client aborts, and has relatively large timeouts
set for fastcgi. Are you sure the connections in question aren’t
disappear as soon as your fastcgi backend finishes preparing
response? I.e. check if any particular connection stay for at
least 5 minutes or so.

Additionally, check if you are able to reproduce the problem with
fastcgi_ignore_client_abort off.

[…]

Have tried with several different combinations of
worker_processes, sendfile, nopush, keepalive_* … to no avail.
Sometimes it takes longer to hang, but it always end up not
responding with > 100 connections in CLOSE_WAIT. Killing nginx,
waiting a couple of seconds for the connections in CLOSE_WAIT to
disappear from netstat and starting nginx again fixes the issue.
No need to restart the PHP processes.

Not responding just because of 100 connections seems strange for
nginx even with worker_connections 1024, so I suspect you just run
out of php processes and CLOSE_WAIT’s are because of
fastcgi_ignore_client_abort.

Maxim D.

Vicente_A · February 22, 2010, 12:21pm

Hi

Additionally, check if you are able to reproduce the problem with
fastcgi_ignore_client_abort off.

That was my current config which I copied from a site discussing
php-fpm. My initial fastcgi config was:

location ~ .php$ {
# By all means use a different server for the fcgi processes if you
need
# to

fastcgi_pass unix:/tmp/php-fastcgi.sock;

fastcgi_pass   127.0.0.1:9000;
fastcgi_index  index.php;
fastcgi_param  SCRIPT_FILENAME /var/www/$host/$fastcgi_script_name;
include /etc/nginx/fastcgi_params;
fastcgi_intercept_errors on;

}

And also had the problem.

Not responding just because of 100 connections seems strange for
nginx even with worker_connections 1024, so I suspect you just run
out of php processes and CLOSE_WAIT’s are because of
fastcgi_ignore_client_abort.

That’s what I think too, but there are no stuck PHP connections in
netstat. Whenever a PHP page is loaded I got some nginx-PHP sockets but
they all close OK, none gets stuck. Only on the client-nginx end is
where I can see this behavior with netstat.

Strange.

Vicente_A · February 22, 2010, 8:26pm

Hi

Do you also had CLOSE_WAIT sockets which
fastcgi_ignore_client_abort off?

Have changed it to off after your previous mail and still haven’t seen a
single CLOSE_WAIT socket. Anyway the site doesn’t really have that much
traffic and today I haven’t tried to stress it. Will keep and eye on it
and tomorrow will try to reproduce it again.

By “stuck” you mean sockets in CLOSE_WAIT state? It’s expected
that there is no CLOSE_WAIT sockets between nginx and php.

Well, I thought that if the CLOSE_WAIT sockets on the browser-nginx end
are really caused by PHP, there should be another somehow blocked
connection on the nginx-PHP end, but that’s not the case.

Regards

–
Vicente A. removed_email_address@domain.invalid | http://www.bisente.com

Vicente_A · February 22, 2010, 1:21pm

Hello!

On Mon, Feb 22, 2010 at 12:20:26PM +0100, Vicente A. wrote:

least 5 minutes or so.
fastcgi_pass 127.0.0.1:9000;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME /var/www/$host/$fastcgi_script_name;
include /etc/nginx/fastcgi_params;
fastcgi_intercept_errors on;
}

And also had the problem.

Do you also had CLOSE_WAIT sockets which
fastcgi_ignore_client_abort off?

Not responding just because of 100 connections seems strange
for nginx even with worker_connections 1024, so I suspect you
just run out of php processes and CLOSE_WAIT’s are because of
fastcgi_ignore_client_abort.

That’s what I think too, but there are no stuck PHP connections
in netstat. Whenever a PHP page is loaded I got some nginx-PHP
sockets but they all close OK, none gets stuck. Only on the
client-nginx end is where I can see this behavior with netstat.

By “stuck” you mean sockets in CLOSE_WAIT state? It’s expected
that there is no CLOSE_WAIT sockets between nginx and php.

Maxim D.

Vicente_A · February 22, 2010, 11:00pm

Hello!

On Mon, Feb 22, 2010 at 08:25:20PM +0100, Vicente A. wrote:

By “stuck” you mean sockets in CLOSE_WAIT state? It’s
expected that there is no CLOSE_WAIT sockets between nginx and
php.

Well, I thought that if the CLOSE_WAIT sockets on the
browser-nginx end are really caused by PHP, there should be
another somehow blocked connection on the nginx-PHP end, but
that’s not the case.

There will be another connection between nginx and php, but it
will be live (in ESTABLISHED state). It will wait for php to
finish request processing and sending response back to nginx.
Once php is done - nginx will close both nginx-php and
client-nginx connections.

Maxim D.

Vicente_A · February 25, 2010, 10:12am

Hi

There will be another connection between nginx and php, but it
will be live (in ESTABLISHED state). It will wait for php to
finish request processing and sending response back to nginx.
Once php is done - nginx will close both nginx-php and
client-nginx connections.

That’s what I though, but I’ve got no nginx-php connections in
ESTABLISHED mode when a client-nginx connection goes to CLOSE_WAIT.

This is what I get now on a netstat -nap:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address
State PID/Program name
tcp 0 0 127.0.0.1:9000 0.0.0.0:*
LISTEN 13040/php5-fpm
tcp 0 0 127.0.0.1:3306 0.0.0.0:*
LISTEN 6731/mysqld
tcp 0 0 0.0.0.0:80 0.0.0.0:*
LISTEN 16936/nginx
tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN 881/sshd
tcp 0 0 127.0.0.1:25 0.0.0.0:*
LISTEN 1416/exim4
tcp 324 0 10.10.10.10:80 66.249.68.247:36979
ESTABLISHED -
tcp 113 0 127.0.0.1:80 127.0.0.1:39229
CLOSE_WAIT -
tcp 335 0 10.10.10.10:80 65.55.207.102:47057
ESTABLISHED -
tcp 116 0 10.10.10.10:80 83.170.113.102:60149
ESTABLISHED -
tcp 413 0 10.10.10.10:80 81.52.143.26:51658
CLOSE_WAIT -
tcp 117 0 10.10.10.10:80 83.170.113.102:56117
CLOSE_WAIT -
tcp 1 0 10.10.10.10:80 66.249.68.247:62666
CLOSE_WAIT 16940/nginx: worker
tcp 1 0 10.10.10.10:80 66.249.68.247:37085
CLOSE_WAIT 16939/nginx: worker
tcp 1 0 10.10.10.10:80 66.249.68.247:51700
CLOSE_WAIT 16938/nginx: worker
tcp 1 0 10.10.10.10:80 67.195.115.83:53503
CLOSE_WAIT 16937/nginx: worker
tcp 117 0 10.10.10.10:80 74.52.50.50:52833
CLOSE_WAIT -
tcp 483 0 10.10.10.10:80 213.171.250.126:35648
ESTABLISHED -
tcp 0 288 10.10.10.10:22 193.145.230.6:50184
ESTABLISHED 17110/0
tcp 1 0 10.10.10.10:80 87.68.237.93:51575
CLOSE_WAIT 16939/nginx: worker
tcp 244 0 10.10.10.10:80 123.125.66.48:22675
CLOSE_WAIT -
tcp 1 0 10.10.10.10:80 193.145.230.6:50134
CLOSE_WAIT 16938/nginx: worker
tcp 116 0 10.10.10.10:80 74.52.50.50:55178
ESTABLISHED -
tcp6 0 0 :::22 :::*
LISTEN 881/sshd
udp 0 0 10.10.10.10:53 0.0.0.0:*
1515/tinydns
udp 0 0 0.0.0.0:68 0.0.0.0:*
769/dhclient3
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node PID/Program
name Path
unix 2 [ ACC ] STREAM LISTENING 481377 6731/mysqld
/var/run/mysqld/mysqld.sock
unix 3 [ ] DGRAM 601444 867/rsyslogd
/dev/log
unix 2 [ ] DGRAM 673 298/udevd
@/org/kernel/udev/udevd
unix 2 [ ] DGRAM 611850 17110/0
unix 3 [ ] STREAM CONNECTED 609800 16936/nginx
unix 3 [ ] STREAM CONNECTED 609799 16936/nginx
unix 3 [ ] STREAM CONNECTED 609797 16936/nginx
unix 3 [ ] STREAM CONNECTED 609796 16936/nginx
unix 3 [ ] STREAM CONNECTED 609794 16936/nginx
unix 3 [ ] STREAM CONNECTED 609793 16936/nginx
unix 3 [ ] STREAM CONNECTED 609791 16936/nginx
unix 3 [ ] STREAM CONNECTED 609790 16936/nginx
unix 2 [ ] DGRAM 533124 769/dhclient3
unix 2 [ ] DGRAM 481379 6735/logger
unix 3 [ ] STREAM CONNECTED 324617
27662/php5-fpm
unix 3 [ ] STREAM CONNECTED 324616
27662/php5-fpm
unix 3 [ ] STREAM CONNECTED 324613
27662/php5-fpm
unix 3 [ ] STREAM CONNECTED 324612
27662/php5-fpm

11 client-nginx connections in CLOSE_WAIT, no nginx-php connections.
Unless they’re the last four unix domain sockets connections, but I’ve
configured nginx to use 127.0.0.1:9000 as the fcgi server.

How can I debug what these CLOSE_WAIT connections were doing, which
request were they serving? Anything I can activate on the logs or on
nginx-status, a la Apache’s extended server-status?

Thanks

Vicente_A · February 25, 2010, 2:57pm

Hi

From the output you provided it looks like all nginx workers are
locked out, either doing something or waiting for some system
resources. As you can see - all connections accepted by nginx (6
connections which have nginx process listed in pid column) are in
CLOSE_WAIT state, and there are other connections to port 80 which
are sitting in listen queue. Am I right in the assumption that
nginx does not answer any requests?

Yes, that’s the issue. nginx becomes unresponsive at this point until I
restart it.

Note well: you haven’t posted full config you use, so please check
yourself for possible loops in it. I’ve recently posted some
patches which take care of several loops which aren’t automatically
resolved now, see here for patch and example loops:

[PATCH 5 of 5] Core: resolve various cycles with named locations and post_action

It should be trivial to find if it’s the cause though, as nginx
worker will eat 100% cpu once caught in such loop.

I have a monitoring script that detects these situations (wget can’t
download from localhost with a 20s timeout) and restarts nginx, but
before that it captures a netstat -nap, ps and other system metrics.
This is an example of what ps shows:

www-data 24610 0.0 0.1 7476 2452 ? S 07:44 0:00 nginx:
worker process
www-data 24611 0.0 0.1 7668 2412 ? S 07:44 0:00 nginx:
worker process
www-data 24612 0.0 0.1 7668 2416 ? S 07:44 0:00 nginx:
worker process
www-data 24613 0.0 0.1 7736 2624 ? S 07:44 0:00 nginx:
worker process

And vmstat:

procs -----------memory---------- —swap-- -----io---- -system–
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
2 0 440 157012 181076 1180340 0 0 2 32 27 46 2 0
95 0
0 0 440 156904 181076 1180340 0 0 0 0 26 28 2 0
94 0
0 0 440 156888 181076 1180348 0 0 0 0 13 24 0 0
100 0
0 0 440 156888 181076 1180348 0 0 0 0 12 21 0 0
100 0
0 0 440 156888 181080 1180348 0 0 0 128 22 34 0 0
99 1

So the nginx processes don’t seem to be in a loop, CPU use is
negligible.

Note well 2: I’ve already asked you to try compiling without third
party modules and patches and check if you are able to reproduce
the problem. It doesn’t really make sense to proceed any further
without doing this.

I have to admit I still haven’t tried this, sorry. Will try.

You have to enable debug log (see
A debugging log). Then it will be
possible to map fd number to the particular request (and it’s full
logs). Under linux it should be possible to find out fd number of
the particular connection via lsof -p .

Will look into this too and get that info on the monitoring script. Can
you think of any other system parameter that can be useful to monitor in
these cases?

Thanks a lot Maxim. You’re being really helpful.

Regards

Vicente_A · February 25, 2010, 1:59pm

Hello!

On Thu, Feb 25, 2010 at 10:11:23AM +0100, Vicente A. wrote:

tcp 113 0 127.0.0.1:80 127.0.0.1:39229 CLOSE_WAIT -
tcp 0 288 10.10.10.10:22 193.145.230.6:50184 ESTABLISHED 17110/0
unix 3 [ ] DGRAM 601444 867/rsyslogd /dev/log
unix 2 [ ] DGRAM 533124 769/dhclient3
the fcgi server.
From the output you provided it looks like all nginx workers are
locked out, either doing something or waiting for some system
resources. As you can see - all connections accepted by nginx (6
connections which have nginx process listed in pid column) are in
CLOSE_WAIT state, and there are other connections to port 80 which
are sitting in listen queue. Am I right in the assumption that
nginx does not answer any requests?

You have to examine nginx workers to find out what they are doing.
Try starting from top, truss, gdb and examining your system logs.

Note well: you haven’t posted full config you use, so please check
yourself for possible loops in it. I’ve recently posted some
patches which take care of several loops which aren’t automatically
resolved now, see here for patch and example loops:

http://nginx.org/pipermail/nginx-devel/2010-January/000099.html

It should be trivial to find if it’s the cause though, as nginx
worker will eat 100% cpu once caught in such loop.

Note well 2: I’ve already asked you to try compiling without third
party modules and patches and check if you are able to reproduce
the problem. It doesn’t really make sense to proceed any further
without doing this.

How can I debug what these CLOSE_WAIT connections were doing,
which request were they serving? Anything I can activate on the
logs or on nginx-status, a la Apache’s extended server-status?

You have to enable debug log (see
A debugging log). Then it will be
possible to map fd number to the particular request (and it’s full
logs). Under linux it should be possible to find out fd number of
the particular connection via lsof -p .

Maxim D.

Vicente_A · February 26, 2010, 8:29am

Hi

Note well 2: I’ve already asked you to try compiling without third
party modules and patches and check if you are able to reproduce
the problem. It doesn’t really make sense to proceed any further
without doing this.

I’ve gone back to the original Debian Lenny package (0.6.32-3+lenny3, I
was using a patched 0.7.65) and in ~14h have had no issues at all, 0
sockets in the CLOSE_WAIT state.

I’m going to leave it like this the whole weekend and see, but it seems
there was some issue with the 0.7 release or some of the patches I was
using. Funny thing is, I’m using that same binary at work but with
Tomcat instead of PHP and had no issues at all. Anyway, next week I’ll
upgrade patch by patch and try to guess which one was causing the
problem.

Not sure if this could be related to the event module as Benjamin
suggested. I’m using the default one (nothing on nginx.conf), I’ve tried
to change it but nginx always failed to start. I’m not sure which ones
are compiled in ATM, will try to find that out today.

I’ll tell you when I find out what the cause of the problem was, just
for the sake of having it documented and showing up on Google in case
somebody else hits the same issue.

Thanks again

Vicente_A · February 26, 2010, 9:25am

HI

Not sure if this could be related to the event module as Benjamin suggested. I’m using the default one (nothing on nginx.conf), I’ve tried to change it but nginx always failed to start. I’m not sure which ones are compiled in ATM, will try to find that out today.

Using the epoll event module with both nginx 0.6.32 and 0.7.65. No other
event modules compiled in.

BTW the server is an Amazon EC2 instance, not sure if that might affect
things or if in this case some event module is better than other. :-?

Regards

Vicente_A · February 25, 2010, 3:50pm

Vicente A. a écrit :

tcp 117 0 10.10.10.10:80 83.170.113.102:56117
CLOSE_WAIT -

Out of curiosity, did you try switching the event module to
anything but the default (epoll) ?
ie. something like:
events {
use select;
}

The accumulating non-empty Recv-Qs and the pending CLOSE_WAITs (ie.
close() never triggered on server side) behaviors are typical
symptoms for races conditions when using an edge-triggered I/O
interface…

Vicente_A · February 26, 2010, 8:37pm

Hi

I’ll tell you when I find out what the cause of the problem was, just for the sake of having it documented and showing up on Google in case somebody else hits the same issue.

I think I might have found what’s going on:

On my blog I have some sample scripts for running several servers with
daemontools, and some of them are browsable (autoindex on) with the full
daemontools directories structure. If you’ve worked with daemontools
you’ll know it uses some named pipes (fifos) …

I’ve tracked several different processes with sockets on CLOSE_WAIT on
the debug log and the last line of all of them was accessing one of
these fifos. I’ve tried requesting those files on a freshly restarted
nginx and have reproduced the issue: each GET to one of the fifos always
produced one or sometimes two CLOSE_WAIT sockets. So in the end it seems
it had nothing to do with PHP.

I’ve removed all the fifos from my site and will keep running nginx
0.6.32 during the weekend. If I have no more issues (I’m pretty
confident now I won’t), on monday I’ll go back to my patched 0.7.65.

Configuring the debug log has been crucial here. Thanks for that tip,
Maxim.

On a side note: while I agree this was my fault (I should’n have had
those “empty” pipes there on the first place), neither Apache nor
lighttpd had any problems with this. Or maybe they had but some internal
process-cleaning was in place and these stuck processes were being
silently killed, I don’t know. In any case as I’ve stated before, I
think the problem is not really nginx but the fifos that shouldn’t be
there.

Regards

–
Vicente A. removed_email_address@domain.invalid | http://www.bisente.com

Vicente_A · March 3, 2010, 2:29pm

Hi

I’ve tracked several different processes with sockets on CLOSE_WAIT on the debug log and the last line of all of them was accessing one of these fifos. I’ve tried requesting those files on a freshly restarted nginx and have reproduced the issue: each GET to one of the fifos always produced one or sometimes two CLOSE_WAIT sockets. So in the end it seems it had nothing to do with PHP.

I can confirm the problem was because of the FIFOs. I’ve had no
CLOSE_WAIT sockets at all since last friday when I removed them, the
weekend with nginx 0.6.32 and since monday morning with my patched
0.7.65.

Thanks to everyone who helped me debugging this, specially to Maxim.

Regards