In-flight HTTP requests fail during hot configuration reload (SIGHUP)

We have recently migrated across from HAProxy to Nginx because it
supports true zero-downtime configuration reloads. However, we are
occasionally getting 502 and 504 errors from our monitoring systems
during deployments. Looking into this, I have been able to consistently
replicate the 502 and 504 errors as follows. I believe this is an error
in how Nginx handles in-flight requests, but wanted to ask the community
in case I am missing something obvious.

Note the set up of Nginx is as follows:

  • Ubuntu 14.04

  • Nginx version 1.9.1

  • Configuration for an HTTP listener:
    map $http_upgrade $connection_upgrade {
    default upgrade;
    ‘’ close;
    }
    server {
    listen 8080;

    pass on real client’s IP

    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    access_log /var/log/nginx/access.ws-8080.log combined;

    location / {
    proxy_pass http://server-ws-8080;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    }
    }

    upstream server-ws-8080 {
    least_conn;
    server 172.17.0.51:8080 max_fails=0;
    }

  1. Telnet to the Nginx server on the HTTP port it is listening on.

  2. Send a HTTP/1.1 request to the upstream server (172.17.0.51):
    GET /health HTTP/1.1
    Host: localhost
    Connection: Keep-Alive

This request succeeds and the response is valid

  1. Start a new HTTP/1.1 request but don’t finish the request i.e. send
    the following line using telnet:
    GET /health HTTP/1.1

  2. Whilst that request is now effectively in-flight because it’s not
    finished and Nginx is waiting for the request to be completed,
    reconfigure Nginx with a SIGHUP signal. The only difference in the
    config preceding the SIGHUP signal is that the upstream server has
    changed i.e. we intentionally want all new requests to go to the new
    upstream server.

  3. Terminate the old upstream server 172.17.0.51

  4. Complete the in-flight HTTP/1.1 request started in point 3 above
    with:
    Host: localhost
    Connection: Keep-Alive

  5. Nginx will consistently respond with a 502 if the old upstream server
    rejects the request, or a 504 if there is no response on that IP and
    port.

I believe this behaviour is incorrect as Nginx, once it receives the
complete request, should direct the request to the current available
upstream server. However, it seems that that Nginx is instead deciding
which upstream server to send the request to before the request is
completed and as such is directing the request to a server that no
longer exists.

Any advice appreciated.

BTW. I tried to raise an issue on http://trac.nginx.com/
http://trac.nginx.com/, however it seems that the authentication
system is completely broken. I tried logging in with Google, My Open
Id, Wordpress and Yahoo, and all of those OpenID providers no longer
work.

Thanks,
Matt

Hello!

On Mon, Jun 01, 2015 at 03:40:12PM +0100, Matthew O’Riordan wrote:

  • Ubuntu 14.04
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    upstream server-ws-8080 {

upstream server has changed i.e. we intentionally want all new
upstream server rejects the request, or a 504 if there is no
response on that IP and port.

I believe this behaviour is incorrect as Nginx, once it receives
the complete request, should direct the request to the current
available upstream server. However, it seems that that Nginx is
instead deciding which upstream server to send the request to
before the request is completed and as such is directing the
request to a server that no longer exists.

Your problem is in step (5). While you’ve started new nginx
workers to handle new requests in step (4), this doesn’t guarantee
that old upstream servers are no longer needed. Only new
connections will be processed by new worker processes with new
nginx configuration. Old workers continue to service requests
started before you’ve reconfigured nginx, and will only terminate
once all previously started requests are finished. This includes
requests already send to an upstream server and reading a
response, and requests not yet read from a client. For these
requests previous configuration apply, and you shouldn’t stop old
upstream servers till old worker processes are shut down.

Some details about reconfiguration process can be found here:

http://nginx.org/en/docs/control.html#reconfiguration


Maxim D.
http://nginx.org/

Hi Maxim

Thanks for the reply.

Few comments below with context for others reading this thread:

i.e. send the following line using telnet:

workers to handle new requests in step (4), this doesn’t guarantee
that old upstream servers are no longer needed.

I realise that is the problem, but I am not quite sure what the best
strategy to correct this is. We are experiencing this problem in
production environments because Nginx sits behind an Amazon ELB. ELB by
default will maintain a connection to the client (browser for example)
and a backend server (Nginx in this case). What we seem to be
experiencing is that because ELB has opened a connection to Nginx, Nginx
has automatically assigned this socket to an existing healthy upstream
server. So even if a SIGHUP is sent to Nginx, ELB’s next request will
always be processed by the old upstream server at the time the
connection to Nginx was opened. So therefore for us to do rolling
deployments, we have to keep the old server running for periods of up to
say 2 minutes to ensure existing connection requests are completed. We
have designed our upstream server so that it will complete existing
in-flight requests, however our upstream server thinks that an in-flight
request is one that is being responded to, not one that is perhaps just
opened and no data has been sent from the client to the server on the
socket yet.

Only new connections will be processed by new worker processes with new
nginx configuration. Old workers continue to service requests
started before you’ve reconfigured nginx, and will only terminate
once all previously started requests are finished. This includes
requests already send to an upstream server and reading a
response, and requests not yet read from a client. For these
requests previous configuration apply, and you shouldn’t stop old
upstream servers till old worker processes are shut down.

Ok, however we do need a sensible timeout to ensure we do actually shut
down our old upstream servers too. This is the problem I am finding with
the strategy we currently have.
ELB, for example, pipelines requests using a single TCP connection in
accordance with the HTTP/1.1 spec. When a SIGHUP is sent to Nginx, how
does it then deal with pipelined requests? Will it process all received
requests and then issue a "Connection: Close” header, or will it process
the current request and then close the connection? If the former, then
it’s quite possible that in the time those in-flight requests are
responded to, another X number of requests will have been received also
in the pipeline.

Some details about reconfiguration process can be found here:
Controlling nginx
http://nginx.org/en/docs/control.html#reconfiguration
I have read that page previously, but unfortunately I found it did not
reveal much in regards to how it handles keep-alive and pipelining.

Thanks again,
Matt

Hello!

On Mon, Jun 01, 2015 at 09:21:12PM +0100, Matthew O’Riordan wrote:

[…]

opened a connection to Nginx, Nginx has automatically assigned
has been sent from the client to the server on the socket yet.
Ideally, you should keep old upstream servers running till all old
worker processes are terminated. This way you won’t depend on
configuration and/or implementation details. This can be a bit
too long though, as old workers usually busy sending big responses
to slow clients.

actually shut down our old upstream servers too. This is the
problem I am finding with the strategy we currently have.
ELB, for example, pipelines requests using a single TCP
connection in accordance with the HTTP/1.1 spec. When a SIGHUP
is sent to Nginx, how does it then deal with pipelined requests?
Will it process all received requests and then issue a
"Connection: Close” header, or will it process the current
request and then close the connection? If the former, then it’s
quite possible that in the time those in-flight requests are
responded to, another X number of requests will have been
received also in the pipeline.

Upon a SIGHUP, nginx will finish processing of requests it already
started to process. No additional requests will be processed,
including pipelined requests. This is considered to be an
implementation details though, not something guaranteed.


Maxim D.
http://nginx.org/