Debugging 504 Gateway Timeout and its actual cause and solution

Hi,

We are running following stack on our web server Varnish + Nginx +
FastCGI
(php-fpm) on RHEL 6.6

Its a dynamic website with different result sets everytime and has
around 2
million url’s indexed with Google.

  • Its running on nginx/1.5.12 and PHP 5.3.3 (Will be upgraded to latest
    nginx and PHP soon)
  • Nginx connects to php-fpm running locally on same server on port 9000

We are getting 504 Gateway timeout intermittently on some pages which we
are
unable to resolve. The URL’s which give 504 works fine after sometime.
We get to know about 504 from our logs and we haven’t been able to
replicate
this as it randomly happens on any URL and works after sometime.

I have had couple of discussions with developer but as per him the
underlying php script hardly does anything and it should not take this
long
(120 seconds) but still it is giving 504 Gateway timeout.

Need to establish where exactly the issue occurs :

  • Is it a problem with Nginx ?
  • Is it a problem with php-fpm ?
  • Is it a problem with underlying php scripts ?
  • Is it possible that nginx is not able to connect to php-fpm ?
  • Would it resolve if we use Unix socket instead of TCP/IP connection to
    ?

The URL times out after 120 seconds with 504

Below is the error seen :
2016/01/04 17:29:20 [error] 1070#0: *196333149 upstream timed out (110:
Connection timed out) while connecting to upstream, client:
66.249.74.95,
server: x.x.x.x, request: “GET /Some/url HTTP/1.1”, upstream:
“fastcgi://127.0.0.1:9000”, host: “example.com

Earlier with fastcgi_connect_timeout of 150 seconds - it used to give at
502
status code after 63 seconds with default net.ipv4.tcp_syn_retries = 5
on
RHEL 6.6 ; afterwards we set net.ipv4.tcp_syn_retries = 6 and then it
started giving 502 after 127 seconds.

Once I set fastcgi_connect_timeout = 120 it started giving 504 status
code.
I understand fastcgi_connect_timeout with such high value is not good.

Need to findout why exactly we are getting 504 (I know its timeout but
the
cause is unknown). Need to get to the root cause to fix it permanently.

How do I confirm where exactly the issue is ?

If its poorly written code then I need to inform the developer that 504
is
happening due to issue in php code and not due to nginx or php-fpm and
if
its due to Nginx or Php-fpm then need to fix that.

Thanks in Advance!

Posted at Nginx Forum:

Here are some of the timeouts already defined :

Under server wide nginx.conf :

keepalive_timeout 5;
send_timeout 150;

under specific vhost.conf :

proxy_send_timeout 100;
proxy_read_timeout 100;
proxy_connect_timeout 100;
fastcgi_connect_timeout 120;
fastcgi_send_timeout 300;
fastcgi_read_timeout 300;

Different values for timeouts are used so I can figured out which
timeout
was exactly triggered.

Below are some of the settings from sysctl.conf :

net.ipv4.ip_local_port_range = 1024 65500
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_syn_retries = 6
net.core.netdev_max_backlog = 8192
net.ipv4.tcp_max_tw_buckets = 2000000
net.core.somaxconn = 4096
net.ipv4.tcp_no_metrics_save = 1
vm.max_map_count = 256000

Thanks!

Posted at Nginx Forum:

Keyur Wrote:

If its poorly written code then I need to inform the developer that
504 is happening due to issue in php code and not due to nginx or
php-fpm and if its due to Nginx or Php-fpm then need to fix that.

How many php master/slave processes are running generally and how many
when
you get 5xx errors?
It is best to define and use a pool (upstream) for php.

Posted at Nginx Forum:

Have you checked the php-fpm logs? It seems like your backend is
overloaded
and not accepting connections fast enough.

Thanks Richard & itpp2015 for your response.

Further update :

There are 2 cases :

  1. 504 @ 120 seconds coming with below mentioned error :

2016/01/05 03:50:54 [error] 1070#0: *201650845 upstream timed out (110:
Connection timed out) while connecting to upstream, client:
66.249.74.99,
server: x.x.x.x, request: “GET /some/url HTTP/1.1”, upstream:
“fastcgi://127.0.0.1:9000”, host: “example.com

  1. 504 @ 300 seconds coming with below mentioned error :

2016/01/05 00:51:43 [error] 1067#0: *200656359 upstream timed out (110:
Connection timed out) while reading response header from upstream,
client:
115.112.161.9, server: 192.168.12.101, request: “GET /some/url
HTTP/1.1”,
upstream: “fastcgi://127.0.0.1:9000”, host: “example.com

  • No errors found in php-fpm logs.

  • Number of php-fpm processes were also normal. Backend doesn’t look
    overloaded as other requests were served out fine at the same time.

  • Only one php-fpm pool is being used. One php-fpm master (parent)
    process
    and other slave (child) processes are usually at normal range only when
    5xx
    are observed. There is no significant growth in number of php-fpm
    processes
    and even if grows then server has enough capacity to fork new ones and
    serve
    the request.

Posted at Nginx Forum:

Thanks Maxim for the details.

Will monitor / debug as per your suggestion as let you know.

Regards,
Keyur

Posted at Nginx Forum:

Hello!

On Wed, Jan 06, 2016 at 02:56:43AM -0500, Keyur wrote:

server: x.x.x.x, request: “GET /some/url HTTP/1.1”, upstream:
“fastcgi://127.0.0.1:9000”, host: “example.com

This means that nginx failed to connect to your backend server in
time. This can happen in two basic cases:

  • network problems (unlikely for localhost though); e.g., this can
    happen if you have a statefull firewall configured between nginx
    and there aren’t enough states.

  • backend is overloaded and doesn’t accept connections fast
    enough;

The latter is more likely, and usually happens when using
Linux. Try watching your backend listen socket queue (something
like “ss -nlt” should work on Linux) and/or try switching on
net.ipv4.tcp_abort_on_overflow sysctl to see if it’s the case.

  1. 504 @ 300 seconds coming with below mentioned error :

2016/01/05 00:51:43 [error] 1067#0: *200656359 upstream timed out (110:
Connection timed out) while reading response header from upstream, client:
115.112.161.9, server: 192.168.12.101, request: “GET /some/url HTTP/1.1”,
upstream: “fastcgi://127.0.0.1:9000”, host: “example.com

The message suggests the backend failed to respond in time to a
particular request. Depending on the request this may be either
some generic problem (i.e., the backend is overloaded) or a
problem with handling of the particular request. Try debugging
what happens on the backend.


Maxim D.
http://nginx.org/