Heap fragmentation in a long-running ruby process

Abstract
In a long-running ruby process with a highly dynamic object-space, we
encountered performance degradation and finally memory-allocation
failure due to heap fragmentation. The problem can be mitigated by
linking ruby against ptmalloc3.

Hi all! I’m writing this mail in the hope that my experiences may point
you in the right direction, if you ever encounter a similar problem.
Naturally I would be delighted to read your comments and advice on my
conclusions and the steps taken.

http://ch.oddb.org [1] provides information on the swiss health-care
market. Behind an Apache/mod-ruby setup lies a single ruby-process,
which acts as a DRb-Server. Predating Ruby on Rails, the application is
based on self-baked libraries [2-4].

A couple of weeks ago we experienced a spike in user requests. Although
the application seemed to scale well most of the time, we began
experiencing outages after a couple of hours. Whenever that happened,
CPU-Load rose to 100% and DRb-Requests were hanging, sometimes for
several minutes. At the same time, memory usage started rising
considerably. If left to run for enough time, the application would
crash with a NoMemoryError: ‘Failed to allocate Memory’ - even though
there was still plenty of Memory available in the system.

Thanks to Jamis B. [5] and Mauricio F. [6] I was able to
determine that the application was stuck for several seconds in glibc’s
realloc, which may be called (via ruby_xrealloc) from basically anywhere
within ruby where a new or enlarged chunk of memory might be required.

Having stated the diagnosis: heap fragmentation [7], there were a couple
of things I could try to improve the performance of our application, all
revolving around the principle of creating fewer objects, and in
particular fewer Strings, Arrays and Hashes. By eliminating a number of
obvious suspects (mainly to do with the on-demand sorting of values
stored in a large Hash), I was able to raise the life-expectancy of our
application considerably - close, but no cigar.

And then - all praise bugzilla - I found a bugreport [8] describing
almost exactly my problems and leading me to ptmalloc3 [9]. Glibc’s
malloc implementation is based on ptmalloc2, and may be replaced by
simply linking ruby against ptmalloc3.

As far as I understand, ptmalloc3 does not eliminate heap fragmentation.
However, due to the bit-wise tree employed in the newer version, it
finds free chunks of the right size in shorter time by several orders of
magnitude. Additionally, it seems that glibc 2.5 abandons its attempts
to find a best-fit chunk after a while (possibly after 10000 tries),
instead expanding the heap as long as possible and finally failing to
allocate memory - causing first the fast rise in memory usage and later
the observed NoMemoryError.

At this time, http://ch.oddb.org has run - powered by ruby and ptmalloc3

  • for a little more than 24 hours without displaying any of the signs I
    have come to associate with heap fragmentation. Significantly less time
    is spent in allocating memory - and consequently in GC, and the overall
    memory-footprint has decreased by about 30%.

I hope this is of use - thanks in advance for any thoughts you want to
share.

Hannes

[1] Open Drug Database
http://scm.ywesee.com/?p=oddb.org;a=summary
[2] Object-Database Access and Object Cache
http://scm.ywesee.com/?p=odba;a=summary
[3] State-Based Session Management
http://scm.ywesee.com/?p=sbsm;a=summary
[4] Component-Based Html generator
http://scm.ywesee.com/?p=htmlgrid;a=summary
[5] Inspecting a live ruby process, Jamis B.
Buckblog: Inspecting a live Ruby process
[6] Ruby live process introspection, Mauricio F.
eigenclass.org
[7] Heap fragmentation, Bruno R. Preiss
brpreiss.com
[8] Glibc bugzilla report 4349, Mingzhou Sun, Tomash Brechko
http://sourceware.org/bugzilla/show_bug.cgi?id=4349
[9] Ptmalloc home, Wolfram Gloger
Wolfram Gloger's malloc homepage