Howdy,
First let me say how happy we are with Nginx Yesterday was a
pretty big traffic day for us, about 36 million dynamic pageviews
peaking at about 15k requests/sec. Couldn’t have done it without Nginx.
http://wordpress.com/stats/traffic/
We have about 350 web servers behind Nginx so it is a semi-regular
occurrence that one of them fails for some reason (usually hardware).
Pound has a dedicated health check thread, that would perform the
health checks and then mark servers up/down as appropriate. Nginx,
however, seems to use the response (or lack thereof) from the user-
initiated request as the health check. The problem with this is that
a % (max_fails/fail_timeout) of user responses will be slowed down by
this. To minimize the impact we could set our timeouts lower and
increase fail_timeout, but this could be dangerous – in the event of
a problem with a backend service (database, etc) which results in slow
responses from all web servers, nginx could mark every server as down
for an extended period of time. I was wondering if there has been any
thought about adding a dedicated health check thread to nginx to avoid
affecting user requests. This could also in theory allow for more
advanced and customizable health checks.
Another option would be to use another program such as keepalived to
do the health checks and then modify the nginx config on the fly, but
it seems less than ideal.
We are a company of PHP developers, so hacking on Nginx’s beautiful C
code is not our forte, but I would be open to sponsoring some
development in this area if someone is interested and the community
thinks it would be useful.
Thoughts?