I need some help in collecting more information from a 32-bit Windows
JVM crash. Here are the first 60 or so lines of the resulting hs_err*
file.
It’s calling from Ruby → Java → C to make some COM calls. Note the
native frames listed in the pastie… every crash has the same offset
for the racob-x86.dll and more often that not it the frame above it is
the same memory address (0xc6608a41).
Unfortunately, the code to reproduce this uses a proprietary Windows
program that I can’t distribute so getting other people to duplicate the
error is difficult/impossible. Does anyone have any suggestions on tools
or techniques I could use to narrow down the problem?
Please note that the code runs okay under MRI 1.9.2-p180. There are
threads in the Ruby code (WIN32OLE events fire on their own native
thread) so I am assuming there is some threading issue that MRI avoids
due to its GIL that JRuby cannot avoid. I have checked over my Ruby code
and verified that all events firing on other threads is serialized back
to a single “main” thread for processing.
I cannot run this code in production under MRI because it is too slow.
JRuby’s perf rocks on Windows, so I’d love to get this resolved. Thanks
in advance for any suggestions.
Unfortunately, the code to reproduce this uses a proprietary Windows program
that I can’t distribute so getting other people to duplicate the error is
difficult/impossible. Does anyone have any suggestions on tools or techniques I
could use to narrow down the problem?
Tweaking the filtering of ProcessMonitor has helped me narrow things
down in the past
Unfortunately, the code to reproduce this uses a proprietary Windows program
that I can’t distribute so getting other people to duplicate the error is
difficult/impossible. Does anyone have any suggestions on tools or techniques I
could use to narrow down the problem?
I followed its instructions for disassembling the DLL and tracking down
the offending offset. Here’s a pastie with a portion of the disassembled
DLL (the whole thing exceeds a gist size limit).
Take a look at line 1238 which corresponds to offset 10001BC4. Can
anyone tell me what C++ source code that maps back to? The racob source
is available here:
Unfortunately, the code to reproduce this uses a proprietary Windows program
that I can’t distribute so getting other people to duplicate the error is
difficult/impossible. Does anyone have any suggestions on tools or techniques I
could use to narrow down the problem?
I have figured out the issue. I don’t know if it’s a bug in
jruby-win32ole or if it’s a limitation of COM/OLE. MRI doesn’t crash but
its threading model is quite different from JRuby.
If an application delivers events where the arguments are passed ByRef,
then that object must only be accessed from the Win32OLE thread that
dispatched the event. If the object is accessed from another thread, the
Ruby script will crash.
Events that pass arguments ByVal do not suffer from this limitation.
I tried to gin up an example using Explorer.exe, but none of its events
actually pass a complex object ByRef. It looks like JRuby will
translate simple types (TrueClass & FalseClass for Boolean, Fixnum for
Long, etc.) if that simple type is passed ByRef, so it doesn’t cause a
crash.
I’ll continue looking through the standard suite of Windows apps to see
if I can find a contender for writing an example that shows the crash.
When I do, I’ll also open up a bug report in JIRA.
If an application delivers events where the arguments are passed ByRef, then
that object must only be accessed from the Win32OLE thread that dispatched the
event. If the object is accessed from another thread, the Ruby script will crash.
I wonder if we can add some crash protection even if we actually
disallow the invocation (which I think is a valid limitation based on
what little I know of COM). Like perhaps record the thread in a field
and then compare the executing thread against that field to at least
report a warning/error if you try calling from non-originating thread.
Events that pass arguments ByVal do not suffer from this limitation.
I tried to gin up an example using Explorer.exe, but none of its events actually pass a complex object ByRef. It looks like JRuby will translate simple types
(TrueClass & FalseClass for Boolean, Fixnum for Long, etc.) if that simple type is
passed ByRef, so it doesn’t cause a crash.
Can you think of any simple type where we do this where it will break?
I think this is probably the most radical change that Racob does over
its Jacob ancestor.
I’ll continue looking through the standard suite of Windows apps to see if I can
find a contender for writing an example that shows the crash. When I do, I’ll also
open up a bug report in JIRA.
Interesting. If they are related, it’s inside the Java guts of the
jruby-win32ole gem.
I did not really follow this…racob (Java guts part calls into JNI
directly and is C/C++ code calling directly the MS libraries
directly). Could calling convention really come into play here? [BTW-
There is some portions of win32ole which is using FFI via
win32/registry … but that is no where near your stack unless it
crashed it from a second thread?]