Multi-threading benchmarks in JRuby

I had a tough time finding recent performance benchmarks for JRuby, so I
put together an analysis of multi-threaded performance (JRuby vs 1.9 vs
1.8) here:

Exercise:
Counting to 1 million on 4 separate threads (simultaneously) - repeat 20
times & take average

JRuby code:

Ruby 1.8/1.9 code:

Results:
JRuby(1.7.1): 199.3ms
Ruby(1.9.3p286): 610.0ms
Ruby(ree-1.8.7-2012.02): 748.6ms

Does anyone else have some pointers to other benchmarks for JRuby >1.7?

Philip -

I’m not an expert on HotSpot optimization, but I think I remember
hearing that in cases like yours it’s possible for the optimizer to
notice that removing the loop code, and even the loop itself would have
no effect on the running program, and remove them.

So for benchmarks like these, I try to make it less likely for HotSpot
to be confident about that. One way is to introduce a side effect
(such as outputting something somewhere), but that affects the timing
measurements. Another way is to call a function from inside the loop.

Anyone have any wisdom about this?

  • Keith

Keith R. Bennett

“as separate threads” – “as separate graphs”

On Wed, Feb 6, 2013 at 10:45 AM, Thomas E Enebo [email protected]
wrote:

up including other optimizations and not show native threading
the work you are performing per-thread can improperly influence
JRuby and a linear slowdown on MRI. At least I think showing the

I had a tough time finding recent performance benchmarks for JRuby, so I


mail: [email protected]

blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

At this point in time we do not inline that times block. So hotspot
cannot realize it is a simple loop incrementing an unused local
variable (thus potentially eliminating the loop and the variable value
set to a no-op). I think your advice about making sure there is a
side-effect is sage though.

The minor nit I have (and I am really I think just expounding on
Keith’s point) with this benchmark is that the actual work needed is
not as important as showing that this work is happening in parallel.
If the work performed per thread can optimize a lot then it might end
up including other optimizations and not show native threading
benefit. To the lay reader perhaps they don’t care to see that
performance benefit isolated, but I think it may muddle the picture a
bit.

[sidebar: I was going to stop there but I remembered that MRI 1.9.3
used tagged pointers for fixnum so they can do not box fixnums like we
do. If I change fixnum to a float (MRI 2.0 has flonums but 1.9.3 does
not) we then see both implementations allocating a full Ruby object
and MRI perf drops by a factor of 2 on this bench. JRuby performance
stays the same. This is just an example of how the optimization of
the work you are performing per-thread can improperly influence
benefit or parallelism.]

In an unrealistic perfect world (ignore Amdahl’s law – at 4 cores it
is not a big player anyways?) we should see the time mostly linear up
to the number of usable cores. Your bench showed nearly 4x speed up
on a 4 core machine. This is interesting since GC and a few other
threads get started on the JVM (maybe hyperthreads are helping here?).
If I were to suggest something I would consider showing the graphs of
1 - 4 threads as separate threads. If your bench is showing actual
benefit of parallel execution you should see mostly the same result on
JRuby and a linear slowdown on MRI. At least I think showing the
slowdown on GIL side is more compelling (IMHO).

-Tom

On Wed, Feb 6, 2013 at 9:01 AM, Keith B. [email protected]
wrote:


Exercise:
JRuby(1.7.1): 199.3ms


blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]