Hi
I have a WordPress-mu site (a couple personal and friends’ blogs, very
light traffic) which I migrated some months ago from lighttpd+php-fcgi
to nginx+php-fcgi. Ever since the migration the site sometimes goes
down, I never had the time to look into it and just programmed a script
that monitored the site and restarted everything when it went down.
We’re going to start using WP-mu at work so I’ve been looking into it
lately and the problem seems to be browser-server connections stuck on
the CLOSE_WAIT state. With netstat -nap I get loads of these:
$ netstat -nap | grep CLOSE_WAIT
tcp 1 0 10.10.10.10:80 1.2.3.4:52132 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:52133 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:50857 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:51348 CLOSE_WAIT
27673/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:50846 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:52126 CLOSE_WAIT
27672/nginx: worker
tcp 1 0 10.10.10.10:80 1.2.3.4:52354 CLOSE_WAIT
27672/nginx: worker
[…]
Where 10.10.10.10 is the web server and 1.2.3.4 the browser. Right now I
have 67 of these after having restarted nginx and doing some admin stuff
on wp for a couple of minutes (CPU-intensive stuff, uploading, scaling
and watermarking images with the NexGen Gallery plugin).
The connections between nginx and php doesn’t seem to get stuck, they go
from active to TIME_WAIT and disappear from netstat normally. They don’t
get stuck in the CLOSE_WAIT state:
$ netstat -nap | grep :9000
tcp 0 0 127.0.0.1:9000 0.0.0.0:*
LISTEN 27662/php5-fpm
tcp 0 0 127.0.0.1:9000 127.0.0.1:52917
TIME_WAIT -
[…]
On friday I moved from spawn-fcgi+php-cgi to php-fpm to no avail. I’ve
noticed some log entries on php5-fpm.log like these on the moments I’m
working with wp and CLOSE_WAIT connections start to clog up:
Feb 21 10:48:45.080836 [NOTICE] fpm_got_signal(), line 48: received
SIGCHLD
Feb 21 10:48:45.080918 [NOTICE] fpm_children_bury(), line 217: child
27665 (pool default) exited with code 0 after 35512.611171 seconds from
start
Feb 21 10:48:45.089499 [NOTICE] fpm_children_make(), line 354: child
30370 (pool default) started
So I guess there might be a connection between the two. Anyway this is
not a 1:1 ratio, right now I have 5 of those php SIGCHLD and 67 sockets
on CLOSE_WAIT with nginx. And the php SIGCHILD relate to moments when
I’ve got an error on wp (failed creating a thumbnail) while the
CLOSE_WAIT connections are not related to application nor connectivity
errors.
I’m almost sure that despite the CLOSE_WAIT sockets belong to the
browser-nginx connections, the problems lies in the nginx-php
connection. At work we have a farm of nginx+Tomcat servers (via
proxy_pass, not fastcgi_pass) and I haven’t seen this behavior. And I
think it has to do with PHP CPU use, as the site usually went down when
hit simultaneously by a couple visits and some search ngines’ spiders
and now I’m being able to reproduce it by scaling and watermarking pics.
But I don’t know where else to look at.
Anybody else has seen this behaviour?
Thanks in advance
Regards
–
Vicente A. [email protected] | http://www.bisente.com