502 Bad gateways, Read timeouts randomly in high traffic rails application

mike.jhons24 · March 12, 2020, 10:53am

Hi,

I have a rails application with an api which receives around 600k to ~ 1000k requests per day. these requests are primarily targeted at a file list and fetch endpoint.

The file list endpoint fetches a list of 2000 records which contain file information from elasticsearch, each of these has an id which is then used to fetch the file through the api, using send_file (using the default X-SendFile header with Nginx) and send_data (in case the file is compressed its uncompressed on the fly and read out).

This is all hosted on a hypervisor lxd setup where we had issues with memory filling up although we had 32gigs of it, we increased it to 48gb but that too filled up gradually.

We initially thought the 502s and read timeouts were because the memory was getting filled up. we did a bit of profiling to find out that reading out files consumed memory and there seemed to be an issue with the lxd container not releasing memory which we fixed it though a script which releases memory every 30minutes.

Now that the memory is stable, we still get some readouts and 502s randomly and along with an SSL_Connect issue that starts appearing(randomly) after we get a 502 and stops if we restart the application.

We use the default puma config and 5 pool connection for postgres db.

Any thoughts on optimization and debugging will be helpful.

Thanks.