Scaling unicorn

Interested in some feeback on this (does it sound right?), or maybe
this might be of interest to others.

We are launching a new facebook app in a couple weeks and we did some
load testing over the weekend on our unicorn web cluster. The servers
are 8 way xeon’s with 24gb ram. Our app ended up being primarily cpu
bound. So far the sweet spot for the number of unicorns seems to be
around 40. This seemed to yield the most requests per second without
overloading the server or hitting memory bandwidth issues. The
backlog is at the somaxconn default of 128, I’m still not sure if we
will bump that up or not. Increasing the number of unicorns beyond a
certain point resulted in a noticable drop in the requests per second
the server could handle. I’m pretty sure the cause is the box
running out of memory bandwidth. The load average and resource usage
in general (except for memory) would keep going down but so did the
requests per second. At 80 unicorns the requests per second dropped
by more then half. I’m going to disable hyperthreading and rerun some
of the tests to see what impact that has.

Chris

snacktime [email protected] wrote:

Interested in some feeback on this (does it sound right?), or maybe
this might be of interest to others.

Hi Chris,

I think you meant to post this to the [email protected]
list, not [email protected] :>

We are launching a new facebook app in a couple weeks and we did some
load testing over the weekend on our unicorn web cluster. The servers
are 8 way xeon’s with 24gb ram. Our app ended up being primarily cpu
bound. So far the sweet spot for the number of unicorns seems to be
around 40. This seemed to yield the most requests per second without
overloading the server or hitting memory bandwidth issues. The
backlog is at the somaxconn default of 128, I’m still not sure if we
will bump that up or not.

The default backlog we try to specify is actually 1024 (same as
Mongrel). But it’s always a murky value anyways, as it’s
kernel/sysctl-dependent. With Unix domain sockets, some folks use
crazy values like 2048 to look better on synthetic benchmarks :slight_smile:

Increasing the number of unicorns beyond a
certain point resulted in a noticable drop in the requests per second
the server could handle. I’m pretty sure the cause is the box
running out of memory bandwidth. The load average and resource usage
in general (except for memory) would keep going down but so did the
requests per second. At 80 unicorns the requests per second dropped
by more then half. I’m going to disable hyperthreading and rerun some
of the tests to see what impact that has.

That’s “8 way xeon” before hyperthreading, right? Which family of
Xeons are you using, the Pentium4-based crap or the awesome new ones?

How much memory is each Unicorn worker using for your app?

40 workers for 8 physical cores sounds reasonable. Depending on the
app, I think the reasonable range is anywhere from 2-8 workers per
physical core. More if you’re (unfortunately) limited by external
network calls, but since you claim to be CPU bound, less.

Do you have actual performance numbers you’re able to share?
Mean/median request times/rates would be very useful. If your requests
run very quickly, you may be limited by contention with the accept()
syscall on the listen socket, too.

I assume you’re using nginx as the proxy, is this with Unix domain
sockets or TCP sockets? Unix domain sockets should give a small
performance over TCP if it’s all on the same box.

With TCP, you should also check to see you have enough local ports
available if you’re hitting extremely high (and probably unrealistic :slight_smile:
request rates.

What was the request rate and total bandwidth flowing at your peak?

How far is that from your theoretical potential on the box?