M. Edward (Ed) Borasky wrote:
I’ve never used it, but I believe “valgrind” is the tool one uses to
check for memory leaks in C code – like MRI, for example.
In many cases, yes — but I tried it for the 1.8.6-p230 segfault
problem, and regrettably, MRI’s GC confused the hell out of it.
FWIW, I’m the guy who produced the “Smartleaf patch” which eliminates
the immediate segfault problems (though not, apparently, this reported
leak). What this amounted to was identifying the particular patch
(of many that went into p230 as released) that was causing the
trouble,
and reverting that one patch (so, the “smartleaf patch” comes straight
from
a reversed “svn diff”). The strategy I used was a little
time-consuming,
but it has the advantage of not requiring expertise in much of anything;
it was bisection search (i.e., binary search among the 1.8.6 patch
stream).
The idea of a bisection search is basically this: There’s an ordered
series of svn revisions that turned 1.8.6-p111 into 1.8.6-p230 (or
whatever), and among those, there’s going to be a first revision that
exhibits the problem. If we assume that nothing else broke in the
course of the stream of revisions, then we can find that first revision
by binary search: Check out the release midway between the two we’re
trying to evaluate, and see if that release leaks storage the way p238
does. If the memory leak problem is already there, then it must have
been introduced somewhere in the series of revisions between 1.8.6-p111
and [midpoint]. If not, then it was introduced between [midpoint] and
1.8.6-p230. Either way, we’ve now bracketed it to within an interval
half the original size. If we skip to the midpoint of that interval,
we can cut it in half again, and proceed to the one patch that
introduced
the problem. (The git suite of tools can actually automate a lot of
this,
if you happen to have the relevant history in a local git repo — see
“git bisect”.)
In this case, of course, there’s one complication: something else did
break in the middle — in particular, the segfaults that start
happening
at revision 17222. So, if you get to (or past) that point in the
history,
you’d have to revert that particular change to look for the storage
leak.
But if anyone out there can keep track of that complication, and has
time
to evaluate a dozen or so ruby-1.8.6 revisions to see if they have this
problem, that’s enough to identify the particular change which first
introduced the problem. It doesn’t need to be that many, I believe —
but I’m allowing for screwups.
(BTW, It’s conceivable that there’s a subsequent patch which also
introduced
a leak, or depends in some other way on the one that did — in which
case,
you’ll have to root around in the subsequent history to get a more
complete
picture. But the most likely thing is that if you take the patch you
find,
and you revert it against 1.8.6-p238, you’ll find the problem goes
away.)
One last note: this quite likely gives you something that you can apply
to p238 to eliminate the problem. It does not give the Japanese
maintenance
team a simple, self-contained piece of code which replicates the problem
and if they’re reluctant to deal with bug reports that can only be
replicated
by installing tens of thousands of lines of other peoples’ code, some
with
native code extensions, I do have a certain degree of sympathy. For the
segfault problem, I was eventually able to produce a short,
self-contained
script which also blew up — see separate thread here:
http://www.ruby-forum.com/topic/157617
No rocket science there either. I’d identified the revision 17222
patch as a likely culprit, and also figured out what operations were
affected by that patch. So, I wrote a simple script with a long loop,
kept stuffing it with those operations until I saw segfaults, and
then eliminated irrelevant lines from that script until I had the
smallest thing I could which still blew up. (Again, there are no
absolute guarantees here: the suspect patch could conceivably
introduce more than one leak, and a minimal script would replicate
only one. Which is why I said "may be the simplest demonstration
in the post linked above. But while this isn’t guaranteed to find
the problem, it’s generally likely to.)
So, ahem… why am I not doing this? Well, for one thing, I don’t
use mongrel, so I can’t follow Igal’s replication recipe exactly —
and for another, I’m really pressed for time this week. I may
have time soon, but I’d be happier if someone else gets there
first ;-).
HTH,
rst@{ai,alum}.mit.edu