Jruby 1.7.0 with invokedynamic.all=true on HotSpot not faster than 1.6.8?

Hi,

I’m testing jruby 1.7.0 and comparing some performance results with
jruby
on 1.6.8. It seems like I cannot see any performance improvements with
jruby 1.7.0 even if I enable invokedynamic.all=true.

I’m probably doing something wrong because invokedynamic.all=true should
give performance boost. Does somebody has any idea where to check for
possible problems? Do I need a special flags for HotSpot java7?

Here is my setup:
$ uname -a
Linux Ubuntu-1204-precise-64-minimal 3.2.0-29-generic #46-Ubuntu SMP Fri
Jul 27 17:03:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04 LTS
Release: 12.04
Codename: precise

$ cat ~/.jrubyrc
compat.version=1.8
backtrace.mask=true
backtrace.style=mri
invokedynamic.all=true

$ java -version
java version “1.7.0_07”
Java™ SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot™ 64-Bit Server VM (build 23.3-b01, mixed mode)

t$ jruby -v
jruby 1.7.0.RC2 (ruby-1.8.7p370) 2012-10-13 3b90805 on Java HotSpot™
64-Bit Server VM 1.7.0_07-b10 [linux-amd64]

I also tried adding the following command line option and no luck:
-J-XX:CompileCommand=dontinline,org.jruby.runtime.invokedynamic.InvokeDynamicSupport::invocationFallback

Do I need a special flags for HotSpot java7?

Thanks,
kbranko

Hello!

On Mon, Oct 22, 2012 at 2:39 PM, Kristy Branko
[email protected] wrote:

I’m testing jruby 1.7.0 and comparing some performance results with jruby on
1.6.8. It seems like I cannot see any performance improvements with jruby
1.7.0 even if I enable invokedynamic.all=true.

The flag you are looking for is compile.invokedynamic=true. On Java 7,
invokedynamic is disabled, due to bugs in the implementation currently
available in OpenJDK/Oracle JDK 7. -Xcompile.invokedynamic=true (or
equivalent in .jrubyrc) will turn it on.

Please do let us know how the performance looks once you have it
enabled, and feel free to file issues if anything is slower with
invokedynamic enabled.

  • Charlie

Hi,

On Darwin Kernel Version 12.2.1: root:xnu-2050.20.9~2/RELEASE_X86_64
x86_64, I see that jruby with -Xcompile.invokedynamic=false being
significantly faster than with -Xcompile.invokedynamic=true for my
trivial test program. Why is this so ?

I am a ruby/jruby noob but I have a decent understanding of JVM
internals. Please excuse me if this slowdown is due to a noob
error/oversight.

jruby build

[Karthik:~/test/jruby]jruby --version
jruby 1.7.2 (1.9.3p327) 2013-01-04 302c706 on Java HotSpot™ 64-Bit
Server VM 1.7.0_11-b21 [darwin-x86_64]

Simple program

[Karthik:~/test/jruby]cat dispatch.rb

def twoArg(a,b)
a + b
end

$i = 0
$limit = 100000
$sum = 0

while $i < $limit
$sum = $sum + twoArg(5,6)
$i = $i + 1
end
puts(“done #{$sum}”)

Results

[Karthik:~/test/jruby]time jruby -Xcompile.invokedynamic=false
dispatch.rb
done 1100000

real 0m2.042s
user 0m3.560s
sys 0m0.164s

[Karthik:~/test/jruby]time jruby -Xcompile.invokedynamic=true
dispatch.rb
done 1100000

real 0m9.721s
user 0m16.963s
sys 0m0.542s

Karthik -

I have no idea, but is it possible invokedynamic involves some startup
cost that is responsible for this? You could try with just one
iteration and see what the difference is.

One thing I learned today is that 100,000 iterations is not enough to
accurately test JVM performance. I’d say if it doesn’t take minutes, a
test isn’t long enough. :wink:

Also, some Ruby style tips

  • Is there a reason you’re using global variables (with ‘$’)? They’re
    frowned upon, unless there is a compelling reason to use them.

  • One of the greatest things about Ruby is the enumerable functions and
    the improved looping syntax. For the kind of thing you’re doing, it
    could be done more cleanly like this, without the need for the ‘i’
    variable:

limit.times { sum += twoArg(5,6) }

  • Ruby convention is to use “snake_case” for method names instead of
    “camel_case”, two_arg instead of twoArg (though ‘add’ would be a good
    name too).

Regards,
Keith


Keith R. Bennett

Basically what Keith said - your benchmark is faulty. It doesn’t do
warmup, the iterations are too few, and it doesn’t do multiple runs,
and accessing globals like that is slow.

Here is a somewhat better version.

require ‘benchmark’

def twoArg(a,b)
a + b
end

10.times {
puts Benchmark.measure {
i = 0
limit = 10000000
sum = 0
while i < limit
sum += twoArg(5, 6)
i += 1
end
}
}

On 22 January 2013 18:11, Keith B. [email protected] wrote:

  • One of the greatest things about Ruby is the enumerable functions and the
    improved looping syntax. For the kind of thing you’re doing, it could be done
    more cleanly like this, without the need for the ‘i’ variable:

limit.times { sum += twoArg(5,6) }

Its nice syntax, but the block dispatch overhead starts to distort
benchmarks. In some of the benchmarks I do, I use a partially
unrolled while loop (e.g. doing 4 method invocations per loop
iteration), because even the i+= 1 can add up. i +=1 will cause a new
Fixnum to be created each iteration. If the method you are benching
does something trivial such as add two small integers, (which hits
JRuby’s fixnum cache), then the loop counter increment is actually the
major overhead in each loop.

I used the testcase posted by Wayne and I see that invokedynamic does
speed things up a bit for his testcase.

I ran an experiment to see what happened if the same code was changed to
use globals. I observed the following :

a) With or without invokedynamic, code operating on locals is
significantly faster than code operating on globals

b) While there is improvement, with invokedynamic, for code with locals;
code with globals becomes very slow(or hangs?) with invokedynamic.

What is the reason for a) and b) ?

The following code snippets and results should be self explanatory

==============================with locals==============================
[Karthik:~/test/jruby]cat loop_locals.rb
require ‘benchmark’
def twoArg(a,b)
a + b
end

limit = 0
sum = 0
i = 0

10.times {
puts Benchmark.measure {
i = 0
limit = 10000000
while i < limit
sum = sum + twoArg(5,6)
i = i + 1
end
}
}
puts(“done. loop count: #{limit} checksum: #{sum}”)
[Karthik:~/test/jruby]time jruby -Xcompile.invokedynamic=false
loop_locals.rb
0.960000 0.050000 1.010000 ( 0.636000)
0.420000 0.010000 0.430000 ( 0.387000)
0.410000 0.000000 0.410000 ( 0.405000)
0.410000 0.000000 0.410000 ( 0.404000)
0.410000 0.000000 0.410000 ( 0.409000)
0.400000 0.000000 0.400000 ( 0.401000)
0.400000 0.010000 0.410000 ( 0.405000)
0.410000 0.000000 0.410000 ( 0.403000)
0.420000 0.000000 0.420000 ( 0.410000)
0.400000 0.010000 0.410000 ( 0.410000)
done. loop count: 10000000 checksum: 1100000000

real 0m6.214s
user 0m7.891s
sys 0m0.247s
[Karthik:~/test/jruby]time jruby -Xcompile.invokedynamic=true
loop_locals.rb
0.760000 0.050000 0.810000 ( 0.531000)
0.320000 0.010000 0.330000 ( 0.291000)
0.310000 0.000000 0.310000 ( 0.303000)
0.310000 0.000000 0.310000 ( 0.309000)
0.350000 0.000000 0.350000 ( 0.347000)
0.400000 0.000000 0.400000 ( 0.393000)
0.370000 0.010000 0.380000 ( 0.369000)
0.370000 0.000000 0.370000 ( 0.368000)
0.410000 0.000000 0.410000 ( 0.414000)
0.390000 0.000000 0.390000 ( 0.388000)
done. loop count: 10000000 checksum: 1100000000

real 0m5.667s
user 0m7.420s
sys 0m0.251s

============================with
globals===================================

[Karthik:~/test/jruby]cat loop_globals.rb
require ‘benchmark’
def twoArg(a,b)
a + b
end

$limit = 0
$sum = 0
$i = 0

10.times {
puts Benchmark.measure {
$i = 0
$limit = 10000000
while $i < $limit
$sum = $sum + twoArg(5,6)
$i = $i + 1
end
}
}
puts(“done. loop count: #{$limit} checksum: #{$sum}”)
[Karthik:~/test/jruby]time jruby -Xcompile.invokedynamic=false
loop_globals.rb
2.550000 0.070000 2.620000 ( 2.210000)
2.020000 0.010000 2.030000 ( 1.978000)
1.990000 0.000000 1.990000 ( 1.989000)
2.000000 0.010000 2.010000 ( 1.991000)
2.070000 0.010000 2.080000 ( 2.062000)
2.040000 0.000000 2.040000 ( 2.023000)
2.020000 0.010000 2.030000 ( 2.011000)
2.020000 0.000000 2.020000 ( 2.007000)
2.310000 0.010000 2.320000 ( 2.313000)
2.340000 0.010000 2.350000 ( 2.330000)
done. loop count: 10000000 checksum: 1100000000

real 0m22.756s
user 0m24.597s
sys 0m0.296s

[Karthik:~/test/jruby]time jruby -Xcompile.invokedynamic=true
loop_globals.rb
^C
real 2m4.312s
user 2m26.083s
sys 0m0.851s
[Karthik:~/test/jruby]
[[ Killed the last test run. It does not complete even a single
iteration in 6x the time that the previous(without invokedynamic) run
finished 10 iterations ]]

On 24 January 2013 11:37, Keith B. [email protected] wrote:

Wayne -

I hadn’t thought about the block dispatch overhead. I guess raw looping would
always be faster than any kind of function/block call – some kind of goto or jump
call in byte code with no need to push or pop parameters to/from the stack I
presume.

However, if one is testing different strategies, the relationship of the
performance results should be accurate (which is greater than which), wouldn’t it?
It’s just that the ratios would not be accurate, right? That is, if the .times
iteration overhead took .5 seconds, and 1 strategy produced 1.0 second and another
2.0, then the latter would really be 3 times slower ((2.0 - 0.5 = 1.5) / (1.0 -
0.5 = 0.5), rather than the 2x indicated by the numbers (2.0 / 1.0).

The problem comes when e.g. block dispatch overhead = 0.9 seconds, and
the method under test takes 0.1 seconds. You start to lose
resolution. And you need to benchmark empty block dispatch and take
that away from the result … and its just easier to use a partially
unrolled while loop.

Can you say more about the production of the intermediate fixnum using i += 1?
I thought it was just a syntactic convenience that would be treated identically to
the more verbose i = i + 1.

It is identical (afaik). Fixnums in JRuby are immutable objects - so
when you do i = i + 1, you’re doing something equivalent to:

i = new RubyFixnum(i.getLongValue() + 1)

So, every loop iteration does object allocation.

Wayne -

I hadn’t thought about the block dispatch overhead. I guess raw looping
would always be faster than any kind of function/block call – some kind
of goto or jump call in byte code with no need to push or pop parameters
to/from the stack I presume.

However, if one is testing different strategies, the relationship of the
performance results should be accurate (which is greater than which),
wouldn’t it? It’s just that the ratios would not be accurate, right?
That is, if the .times iteration overhead took .5 seconds, and 1
strategy produced 1.0 second and another 2.0, then the latter would
really be 3 times slower ((2.0 - 0.5 = 1.5) / (1.0 - 0.5 = 0.5), rather
than the 2x indicated by the numbers (2.0 / 1.0).

Can you say more about the production of the intermediate fixnum using i
+= 1? I thought it was just a syntactic convenience that would be
treated identically to the more verbose i = i + 1.

Thanks,
Keith