Nginx in high concurrency setups

Dennis_J · December 19, 2009, 3:25pm

Hi,
I’m currently experimenting how many concurrent connections nginx can
handle. The problem I’m running into is that for each request I send to
the
server I get a connection in TIME_WAIT state. If I do this using
benchmarking tools like httperf or ab I quickly seem to hit a ceiling.
Once
the number of TIME_WAIT connections reaches about 16000 the benchmarking
tools just freeze and I have to wait until that number comes down again.
What is the reason for these TIME_WAIT connections and how can I get rid
of
them faster? I’m only serving small static files and the delivery is not
supposed to take longer than say 300ms so any connection that takes
longer
than that can be aborted if that is necessary to make room for new
incoming
connections.
Does anyone have experience with serving lots of small static requests
using nginx?

Regards,
Dennis

Dennis_J · December 19, 2009, 3:25pm

On Mon, Dec 14, 2009 at 11:16 AM, Dennis J. [email protected]
wrote:

than that can be aborted if that is necessary to make room for new incoming
nginx Info Page

Hello,

You will need to tune your OS’s TCP and socket settings, I do believe.
It is dependent on your OS what exactly you must do.

Also, keep in mind, that when you are doing these tests, ideally you
should be sending the test-load from multiple machines that are not
the same machine that is serving. This is to rule out the
benchmarking program fighting for resources with nginx and to rule out
a single machine’s ceilings.

Thanks,
Merlin

Dennis_J · December 19, 2009, 3:25pm

do you know of a single place that describes all the the tunable
parameters for all the systems that nginx supports? it would probably
be a good thing to put together in one place on the wiki…

–timball

On Mon, Dec 14, 2009 at 3:16 PM, merlin corey [email protected]
wrote:

supposed to take longer than say 300ms so any connection that takes longer
[email protected]
should be sending the test-load from multiple machines that are not
nginx Info Page

–
GPG key available on pgpkeys.mit.edu
pub 1024D/511FBD54 2001-07-23 Timothy Lu Hu Ball [email protected]
Key fingerprint = B579 29B0 F6C8 C7AA 3840 E053 FE02 BB97 511F BD54

Dennis_J · December 19, 2009, 3:25pm

Richard J. of Last.fm fame has written some about this kind of
testing:

On Mon, Dec 14, 2009 at 21:16, merlin corey [email protected]
wrote:

supposed to take longer than say 300ms so any connection that takes longer
[email protected]
should be sending the test-load from multiple machines that are not
nginx Info Page

–
Rasmus Andersson

Dennis_J · December 19, 2009, 3:25pm

Thanks for the pointer. Right now the setup I’m testing with looks like
this (both hosts and VMs using Centos 5.4):

2 virtual machines with 1 vcpu each and 1gb of ram. These get balanced
by
LVS-DR running on the host system using a weighted round-robin scheduler
with persistence disabled.
The payload is 50.000 files with random characters each exactly 1kb in
size
distributed across 50 directories with 1000 files each.
On the nginx side I’m pretty much running with the default config right
now
with access logging disabled (1 worker thread and events
{worker_connections 1024;} ).

The machine I’m testing from is connected to the same gbit switch as the
host with the 2 VMs. The client machine runs the default setup but the
load-balanced IP is excluded from connection-tracking in the iptables
firewall.
The VMs have their firewalls completely disabled and have been modified
in
the following way:

net.ipv4.tcp_fin_timeout = 10
net.core.somaxconn = 10000
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_synack_retries = 3

The “tcp_tw_reuse” really helped with my initial TIME_WAIT problem.
My “siege” results now look like this:

[root@virt1 ~]# siege -b -c 250
** SIEGE 2.69
** Preparing 250 concurrent users for battle.
The server is now under siege…
Lifting the server siege… done.
Transactions: 420203 hits
Availability: 100.00 %
Elapsed time: 24.34 secs
Data transferred: 410.36 MB
Response time: 0.01 secs
Transaction rate: 17263.89 trans/sec
Throughput: 16.86 MB/sec
Concurrency: 239.56
Successful transactions: 420204
Failed transactions: 0
Longest transaction: 21.00
Shortest transaction: 0.00

The nginx version I’m running is 0.7.64.

What I’m wondering about at the moment are these stray requests that
take
much longer than 99% of the others. Here is a distribution with “ab”:

Percentage of the requests served within a certain time (ms)
50% 11
66% 15
75% 17
80% 18
90% 21
95% 23
98% 26
99% 29
100% 3024 (longest request)

As you can see 99% of the requests are delivered in 29ms or less but
most
of the time there is at least one request that takes 3s (and always
pretty
much exactly 3s at least with “ab”).

Any ideas for further optimizations? Should I maybe choose a different
event-model for this particular load (lots of small short-lived
requests)?
Also when I tries configuring 2 worker processes it looked like only 1
cpu
was really under load which is why I reduced the VMs to 1vcpu since the
second one didn’t get utilised well and I instead created the second VM
and
added the load-balancing to get a more even load distribution.

Regards,
Dennis