Nginx health checks

Howdy,

First let me say how happy we are with Nginx :slight_smile: Yesterday was a
pretty big traffic day for us, about 36 million dynamic pageviews
peaking at about 15k requests/sec. Couldn’t have done it without Nginx.

http://wordpress.com/stats/traffic/

We have about 350 web servers behind Nginx so it is a semi-regular
occurrence that one of them fails for some reason (usually hardware).
Pound has a dedicated health check thread, that would perform the
health checks and then mark servers up/down as appropriate. Nginx,
however, seems to use the response (or lack thereof) from the user-
initiated request as the health check. The problem with this is that
a % (max_fails/fail_timeout) of user responses will be slowed down by
this. To minimize the impact we could set our timeouts lower and
increase fail_timeout, but this could be dangerous – in the event of
a problem with a backend service (database, etc) which results in slow
responses from all web servers, nginx could mark every server as down
for an extended period of time. I was wondering if there has been any
thought about adding a dedicated health check thread to nginx to avoid
affecting user requests. This could also in theory allow for more
advanced and customizable health checks.

Another option would be to use another program such as keepalived to
do the health checks and then modify the nginx config on the fly, but
it seems less than ideal.

We are a company of PHP developers, so hacking on Nginx’s beautiful C
code is not our forte, but I would be open to sponsoring some
development in this area if someone is interested and the community
thinks it would be useful.

Thoughts?

On Tue, Jun 10, 2008 at 10:21 AM, Barry A.
[email protected] wrote:

has a dedicated health check thread, that would perform the health checks
and then mark servers up/down as appropriate.

Nginx, however, seems to use
the response (or lack thereof) from the user-initiated request as the health
check.

Can you elaborate on what you mean here? Things aren’t getting marked
as down from pound until a user makes a request?

The problem with this is that a % (max_fails/fail_timeout) of user
responses will be slowed down by this. To minimize the impact we could set
our timeouts lower and increase fail_timeout, but this could be dangerous –
in the event of a problem with a backend service (database, etc) which
results in slow responses from all web servers, nginx could mark every
server as down for an extended period of time.

It seems like this should be tuned at the pound level. You should
definitely have sane thresholds for timeouts, but it sounds like your
load balancer software is letting you down here.

I was wondering if there has
been any thought about adding a dedicated health check thread to nginx to
avoid affecting user requests. This could also in theory allow for more
advanced and customizable health checks.

Another option would be to use another program such as keepalived to do the
health checks and then modify the nginx config on the fly, but it seems less
than ideal.

We use keepalived tcp checks and it works pretty well. We face a
problem that the appservers we reverse proxy to might be down so it’s
not a spectacular health check. We experimented with http checks
going to a dynamically generated (but really light) appserver page and
we actually found it to be less reliable than the tcp checks.

In your pound config are you using tcp, dynamic http checks or a
static page pull? Are you guys using fcgis or some other appserver
type thing for php?

We are a company of PHP developers, so hacking on Nginx’s beautiful C code
is not our forte, but I would be open to sponsoring some development in this
area if someone is interested and the community thinks it would be useful.

Very cool. :slight_smile:

On Tue, Jun 10, 2008 at 6:21 PM, Barry A. [email protected]
wrote:

We have about 350 web servers behind Nginx so it is a semi-regular
occurrence that one of them fails for some reason (usually hardware). Pound
has a dedicated health check thread, that would perform the health checks
and then mark servers up/down as appropriate.

HAProxy also offers this kind of periodic health check. You can set it
up to request a specific URL by HTTP, and the intervals can be
regulated according to whether the server is down, up or in between.

It can stagger the checks to avoid spikes, and supports dependencies,
where a server is only considered up if another server is up.

HAProxy can also be configured to listen to a port in a special
“health” mode that simply returns “200 OK” to every connection. This
can be used with Nagios or other tools for monitoring.

Alexander.

On 6/10/08, Barry A. [email protected] wrote:

Another option would be to use another program such as keepalived to do the
health checks and then modify the nginx config on the fly, but it seems less
than ideal.

I’ve had this same desire and thought about doing this too…

Right now I’ve gone back to using ipvsadm/ldirectord/LVS on my
frontend “load balancer” server and then using nginx on my web/app
servers (3 of them) - that way ldirectord does the healthchecks and
removes them from the pool and I don’t have to mess with nginx at all
tweaking timeouts/etc.

The one major bonus about nginx is if it determines one backend is
failed, it will try the next one. The problem is, I would be rebooting
one of my 3 backends for instance and I’d have thousands of log file
entries about it. I guess I could use some parameters to say once one
is dead, forget about it for 30 seconds or something. But adding a
little bit of load balancer healthcheck type intelligence could mean
even more nginx adoption as it would alleviate the need for
ldirectord, and you’d get the bonus of gzip, SSL and layer 7
capabilities on the frontend.

On Jun 10, 2008, at 2:12 PM, Alexander S. wrote:

up to request a specific URL by HTTP, and the intervals can be
regulated according to whether the server is down, up or in between.

Yeah, in my testing, the performance of HAProxy was less than stellar
and I couldn’t figure out how to get it to spawn multiple process to
use multiple CPUs. Also, I would like to avoid adding another layer
if possible.

On Jun 10, 2008, at 1:36 PM, Corey D. wrote:

WordPress.com
the health
check.

Can you elaborate on what you mean here? Things aren’t getting marked
as down from pound until a user makes a request?

We are using the reverse proxy functionality of nginx, not the web
server functionality. We switched from pound to nginx:

It seems like this should be tuned at the pound level.

Pound is no longer part of the equation.

You should
definitely have sane thresholds for timeouts, but it sounds like your
load balancer software is letting you down here.

Nginx is our load balancer in this case.

Hi Barry,

first of all I like both very much nginx and haproxy :wink:

On Die 10.06.2008 16:16, Barry A. wrote:

On Jun 10, 2008, at 2:12 PM, Alexander S. wrote:

HAProxy also offers this kind of periodic health check. You can set
it up to request a specific URL by HTTP, and the intervals can be
regulated according to whether the server is down, up or in between.

Yeah, in my testing, the performance of HAProxy was less than stellar
and I couldn’t figure out how to get it to spawn multiple process to
use multiple CPUs. Also, I would like to avoid adding another layer
if possible.

If you really use only the reverse proxy feature then I think the
haproxy is the better choice due the follwoing aguments:

1.) HAProxy - The Reliable, High Performance TCP/HTTP Load Balancer

http://haproxy.1wt.eu/download/1.3/doc/haproxy-en.txt search for

2.) 1.5) Increasing the overall processing power
3.) 3.1) Server monitoring
4.) 3.4) Limiting the number of concurrent sessions on each server

Some of this features would be also very nice in nginx :wink:

What I don’t know is do you use the fcgi backend or the http backend?

One of the coolest / best feature of nginx is that he deliver the
static content as fast a possible from he disc ;-))

What I don’t understand why you don’t use this feature, due the fact
that, as far as I understand, you deliver ALL the content thru the
application, also the static one.

As you can see I haven’t setuped a blog with wordpress.

Cheers

Aleks

On Die 10.06.2008 18:02, Barry A. wrote:

3.) 3.1) Server monitoring
4.) 3.4) Limiting the number of concurrent sessions on each server

Thanks for pointing these out. I was looking for #2 when I did my
testing, but couldn’t find it. That may change my mind about
performance.

I hope so.

Some of this features would be also very nice in nginx :wink:

I am mostly interested in “3.) 3.1) Server monitoring” at this point.
The first one is already supported, and although the last one, in
theory, would be nice, since we use round robin for most of our load
distribution and have enough backends, the law of averages says that
the probability of one backend becoming overloaded while the others
are under-utilized is relatively small. The fair proxy balancer patch
for nginx may deal with this theoretical problem better than counting
concurrent connections anyway

The last one is the swiss army knife with maxconn minconn maxqueue :wink:
Yes the fair module looks nice.

What I don’t know is do you use the fcgi backend or the http backend?

Currently http, but may switch to fcgi at some point.

With which http server on the backend?

One of the coolest / best feature of nginx is that he deliver the
static content as fast a possible from he disc ;-))

What I don’t understand why you don’t use this feature, due the fact
that, as far as I understand, you deliver ALL the content thru the
application, also the static one.

Most of our content is dynamic. Static content is not served through
the application.

Also not the css,js, images, movies, …?!

  1. Abundance of 3rd party modules which allow you to extend the
    capabilities of the software and also build a community.

Full Ack.

1+3 will be done in the future I hope :wink:

As you can see I haven’t setuped a blog with wordpress.

You should :slight_smile:

Well, I don’t know what I should write :wink:

Thanks for the feedback, it is much appreciated!

Your welcome, and thanks to trust in nginx and bring it to #4 on
netcraft ;-)))))

Aleks

On Jun 10, 2008, at 6:15 PM, Aleksandar L. wrote:

With which http server on the backend?

A mixture of Litespeed, apache, and lighttpd.

the application.

Also not the css,js, images, movies, …?!

Most of the css/js is static and served via a CDN. The origin servers
for this stuff are currently running lighttpd, but I will probably
switch them to nginx at some point. The images and movies are
currently served through php with some caching in front courtesy of
Varnish.

Well, I don’t know what I should write :wink:

What about HAProxy and Nginx? It sounds like you have some
interesting experiences to share.

On Jun 10, 2008, at 4:56 PM, Aleksandar L. wrote:

If you really use only the reverse proxy feature then I think the
haproxy is the better choice due the follwoing aguments:

1.) HAProxy - The Reliable, High Performance TCP/HTTP Load Balancer

http://haproxy.1wt.eu/download/1.3/doc/haproxy-en.txt search for

2.) 1.5) Increasing the overall processing power
3.) 3.1) Server monitoring
4.) 3.4) Limiting the number of concurrent sessions on each server

Thanks for pointing these out. I was looking for #2 when I did my
testing, but couldn’t find it. That may change my mind about
performance.

Some of this features would be also very nice in nginx :wink:

I am mostly interested in “3.) 3.1) Server monitoring” at this point.
The first one is already supported, and although the last one, in
theory, would be nice, since we use round robin for most of our load
distribution and have enough backends, the law of averages says that
the probability of one backend becoming overloaded while the others
are under-utilized is relatively small. The fair proxy balancer patch
for nginx may deal with this theoretical problem better than counting
concurrent connections anyway – I don’t know though since I haven’t
tried it.

What I don’t know is do you use the fcgi backend or the http backend?

Currently http, but may switch to fcgi at some point.

One of the coolest / best feature of nginx is that he deliver the
static content as fast a possible from he disc ;-))

What I don’t understand why you don’t use this feature, due the fact
that, as far as I understand, you deliver ALL the content thru the
application, also the static one.

Most of our content is dynamic. Static content is not served through
the application.

There are a few things I really like about nginx that I don’t find in
HAProxy –

  1. SSL – no need for stunnel or something else to do the SSL
    negotiation. Our % of SSL traffic is relatively low, so for us it is
    not a bottleneck. Maybe one day it will be…

  2. It’s ability to be used as a web server, which could allow us to
    standardize on a single software package.

  3. Abundance of 3rd party modules which allow you to extend the
    capabilities of the software and also build a community.

As you can see I haven’t setuped a blog with wordpress.

You should :slight_smile:

Thanks for the feedback, it is much appreciated!

On Tue, Jun 10, 2008 at 10:16 PM, Barry A.
[email protected] wrote:

Yeah, in my testing, the performance of HAProxy was less than stellar and I
couldn’t figure out how to get it to spawn multiple process to use multiple
CPUs. Also, I would like to avoid adding another layer if possible.

That’s pretty much the opposite of my experience with HAProxy versus
Nginx, although my test case may be different from yours. I have
briefly blogged about this today:

Alexander.

On Wed, Jun 11, 2008 at 12:02 AM, Barry A.
[email protected] wrote:

I am mostly interested in “3.) 3.1) Server monitoring” at this point. The
first one is already supported, and although the last one, in theory, would
be nice, since we use round robin for most of our load distribution and have
enough backends, the law of averages says that the probability of one
backend becoming overloaded while the others are under-utilized is
relatively small. The fair proxy balancer patch for nginx may deal with
this theoretical problem better than counting concurrent connections anyway
– I don’t know though since I haven’t tried it.

And the probability approaches 1.0 as the number of concurrent
requests approaches the number of backends. This is why HAProxy’s
least-connections algorithm may work better on high-traffic sites.

In the case of Rails, which does not support parallelizing requests,
it make sense to use HAProxy’s “maxconn” set to 1, which prevents the
proxy from ever sending more than one concurrent connection to a
backend; so if all backends are full, instead of queueing up
connections in the backend, you queue up connections in the proxy so
that it can immediately use the next backend that becomes idle.

Alexander.

Hi Alexander. Your post is quite interesting. I would be interested in
seeing configuration for HAProxy and nginx for your test case. Also
found this post on the same topic. Are you going nginx => haproxy =>
mongels. From the following, you are advocating putting haproxy in
front. Many thanks.

http://www.igvita.com/2008/05/13/load-balancing-qos-with-haproxy/

Regards,
David

On Wed, Jun 11, 2008 at 2:59 AM, David P. [email protected]
wrote:

Hi Alexander. Your post is quite interesting. I would be interested in
seeing configuration for HAProxy and nginx for your test case.

Sure, I will post it later.

Also found
this post on the same topic. Are you going nginx => haproxy => mongels.
From the following, you are advocating putting haproxy in front. Many
thanks.

Load Balancing & QoS with HAProxy - igvita.com

I’m putting HAProxy in front, with Nginx and Varnish behind it.

Alexander.

On Die 10.06.2008 19:15, Barry A. wrote:

On Jun 10, 2008, at 6:15 PM, Aleksandar L. wrote:

Also not the css,js, images, movies, …?!

Most of the css/js is static and served via a CDN.

:wink:

The origin servers for this stuff are currently running lighttpd, but
I will probably switch them to nginx at some point. The images and
movies are currently served through php with some caching in front
courtesy of Varnish.

Do you know about X-Accel-Redirect

http://wiki.codemongers.com/NginxXSendfile

Well, I don’t know what I should write :wink:

What about HAProxy and Nginx? It sounds like you have some
interesting experiences to share.

Ok I will try to make a setup, no promises :wink:

Aleks

On 6/11/08, Barry A. [email protected] wrote:

   • Big one for us; a backend can instruct Perlbal to fetch the user's

data from a completely separate server and port and URL, 100% transparent to
the user
• Can actually give Perlbal a list of URLs to try. Perlbal will find
one that’s alive. Again, the end user sees no redirects happening.

I believe based on nginx’s proxy-everything approach, this can work. I
was looking into it for MogileFS (so there’s no app-level interaction
required, purely nginx ↔ MogileFS based on the request URI)

I think it can do reproxy on the fly, but I am not sure 100%, I never
got that far but I am sure I saw examples. This would be huge if we
could get that into nginx for sure, then I could use MogileFS without
having to dip into PHP/perl/whatever to hit the trackers, nginx could
handle it and completely offload the need to talk to PHP for the
binaries/whatever is stored in Mogile…

On Jun 11, 2008, at 3:52 AM, Aleksandar L. wrote:

The origin servers for this stuff are currently running lighttpd, but
I will probably switch them to nginx at some point. The images and
movies are currently served through php with some caching in front
courtesy of Varnish.

Do you know about X-Accel-Redirect

http://wiki.codemongers.com/NginxXSendfile

Yep, but this doesn’t work for us because the content is not local
(usually it is on S3, but sometimes it is elsewhere). What would be
neat was if Nginx supported Perlbal-style internal redirects. From

• Big one for us; a backend can instruct Perlbal to fetch the user’s
data from a completely separate server and port and URL, 100%
transparent to the user
• Can actually give Perlbal a list of URLs to try. Perlbal will find
one that’s alive. Again, the end user sees no redirects happening.

On Mit 11.06.2008 10:15, Barry A. wrote:

? Big one for us; a backend can instruct Perlbal to fetch the
user’s data from a completely separate server and port and URL, 100%
transparent to the user
? Can actually give Perlbal a list of URLs to try. Perlbal will
find one that’s alive. Again, the end user sees no redirects
happening.

Hm sounds interesting, but how is the delay in the time that perlbal
search for the right content?!

Cheers

Aleks

On Mit 11.06.2008 11:46, mike wrote:

required, purely nginx <-> MogileFS based on the request URI)

I think it can do reproxy on the fly, but I am not sure 100%, I never
got that far but I am sure I saw examples. This would be huge if we
could get that into nginx for sure, then I could use MogileFS without
having to dip into PHP/perl/whatever to hit the trackers, nginx could
handle it and completely offload the need to talk to PHP for the
binaries/whatever is stored in Mogile…

I’am quite instrested into expirience with mogilefs and nginx ;-), do
you need help?

Aleks

On 6/11/08, Aleksandar L. [email protected] wrote:

I’am quite instrested into expirience with mogilefs and nginx ;-), do
you need help?

Aleks

There might need to be a patch or two added to nginx for seamless
nginx/MogileFS integration. I am not sure.

There’s been interest and funding pledged from myself and Engine Y.:
http://marc.info/?l=nginx&m=120950783121201&w=2

I can’t find a good full thread list of the discussion/my ideas, but
here’s the place with ALL of the mentions of mogilefs:
http://marc.info/?l=nginx&w=2&r=1&s=mogilefs&q=b

Some specific posts:
http://marc.info/?l=nginx&m=120902903131858&w=2
http://marc.info/?l=nginx&m=120968070404391&w=2
http://marc.info/?l=nginx&m=120965618429262&w=2
http://marc.info/?l=nginx&m=120952941317717&w=2
http://marc.info/?l=nginx&m=120952315710511&w=2

Someone who knows C, working together with dormando from danga to
understand the best interaction methods could possibly whip up a very
low-overhead easy way to skip interacting with a middle tier (PHP,
Perl, Python, etc) and have nginx talk directly to MogileFS instances
and have a damn scalable solution.

I already have nginx being used as the mogstored webdav server
(instead of using perlbal) and it seems to work well. I’m just looking
for the frontend mapping of

user GET → somehow resolves to mogilefs key → ask mogilefs tracker
→ get location → tell nginx to reproxy it to that URL