Hey everyone,
Tracking down this problem was NOT easy, but here we go.
I created a small program that made some basic embedded API calls,
mirroring
the functionality I was trying to add into E.V.E. Paradox. It worked
perfectly, suggesting the problem was not Ruby.
I then took the existing E.V.E. Paradox code and tried to add traces and
extra
calls in to find the problem. Basically I added calls that repeatedly
called
rb_funcall at various points. Basically at various points in my code it
was
possible to call rb_funcall successfully, branch off into other code,
then
make the exact same call again and have it crash.
Tracking it down further was a nightmare. Basically the point of the bug
moved
as extra calls were added to perform traces and call rb_funcall. The
probability of a crash tended towards zero in any region of code as more
traces and calls were added. Thus as I got closer to the problem, it
would
move somewhere else. It suggested some kind of stack dependency from
Ruby or
corruption being caused in my code or the libraries I was using.
For a while I thought the cause was a poor interaction between Ruby and
freetype, as it most commonly crashed after calls relating to font
manipulation when inside that library.
However, I was finally able to create some isolated code that crashed
with
high frequency. So it’s not E.V.E. Paradox or freetype. I did this by
writing
a function that simulates API calls being made in the context of a
larger
program through controlled recursive calls and alloca. Basically the
call
wraps various Ruby interactions inside and alongside repeated adding and
removing of stack frames through nested calls.
This code is included below (NB: Linux only).
Interestingly enough, I haven’t seen it crash yet when profiling is
disabled.
Only when it is enabled. Is there some negative interaction between the
profiling code in the gcc 4.* series and Ruby? Anyway, I thought it
interesting.
As for running the code, simply build it with make (you may need to fix
the
path to Ruby) and run it in one of these four modes:
./foo 0
This will run a simple set of code that outputs an series of annoying
traces. No stack trickery is done. It works.
./foo 1
This will run a simple set of code that calls an embedded Ruby script,
which calls a callback in the code. It works.
./foo 2
This performs the same traces but uses “messy” to fire them off within
differing stack contexts. It is slow, but works.
./foo 3
This performs the same set of Ruby calls, but uses “messy” to fire them
off
within differing stack contexts. It is slow, and eventually crashes.
Thus Embedded Ruby works fine in simple code, but put it in the context
of
a larger body and it has the potential to crash.
At some point I will have a play around with different versions of Ruby
to
see if they also trigger this crash, and if I can work around the
problem
this way.
Anyway, what I’d like to ask of the ordinary Ruby users like me and of
the
Ruby Gods is:
- Does the test code look okay? Did I make any mistakes? I’m only human,
and
the crash test isn’t trivial.
- Does “./foo 3” also crash on your machine? Or is it just mine?
- Are there any Ruby Gods for whom the code crashes with the skill to
diagnose the problem with Ruby itself? I fear this may be beyond me
(but
I’ll be giving it a shot).
Other info:
- The rest of E.V.E. Paradox and the multiple external libraries work
fine
with profiling enabled. IF it is a compiler/profiler bug, it
affects Ruby
only.
- My development machine is pretty stable, quite able to perform
days-long
tests and compiles consistently- although I have to admit the DVD
tray
mechanism is a bit touchy. This isn’t a memory or solar flare thing.
- I use alloca to simulate calls to functions with different stack depth
requirements. If you object to alloca, please note that mode 2 of the
test program works perfectly fine- this is heavy alloca use without
Ruby
calls. It crashes only when we add Ruby calls to the mix.
- As noted, I think it only happens if profiling is enabled. I can’t
guarantee
it doesn’t crash without, I just haven’t seen it do so.
- “Don’t enable profiling” isn’t really an acceptable workaround when it
works fine for the rest of the large body of code and external
libraries
and my development build includes it by default. I really don’t want
to
have to rebuild the whole project when I’m after timing
information…
Besides, “does not work with profiling” appears to be a significant
bug.
Anyway, this is enormously frustrating, as after all of this struggle
I’ve
been able to put together a proof-of-concept that I can use the
embedded
Ruby interface in my code, but I cannot actually keep it due to the
stability issues. :(
I am available for questions, further information, and testing should
anyone
with the requisite knowledge to diagnose the problem need my input.
Please
feel free to drop me an email if I can help (see the contact page at
www.entropicsoftware.com for my direct email).
Further detailed information below. Apologies in advance for the huge
post,
but there is a lot of relevant detail to convey.
Garth
============================================================================
== Contents of foo.cpp:
#include <ruby.h>
#include
#include
#include
#include
#define MYRAND(r) (rand()%r)
using namespace std;
static VALUE obj;
typedef void (MessyFunc)(void *);
typedef VALUE (*RubyGenericFunc)(…);
static int mode = 0;
static bool use_messy;
static bool use_ruby;
static VALUE RubyP(int argc, VALUE *argv, VALUE self)
{
VALUE str = rb_str_new(“”, 0);
for (int i=0; i<argc; i++)
{
if (i > 0)
rb_str_cat(str, ", ", 2);
rb_str_concat(str, rb_inspect(argv[i]));
}
cerr << "Ruby P: " << RSTRING(str)->ptr << “\n”;
return Qnil;
}
static VALUE RubySafeGCX(VALUE a)
{
rb_gc();
return Qnil;
}
static VALUE RubySafeGC()
{
int status;
VALUE rv = rb_protect(RubySafeGCX, Qnil, &status);
if (status)
{
VALUE einfo = rb_obj_as_string(ruby_errinfo);
cerr << "Ruby reported error: " << status
<< " " << RSTRING(einfo)->ptr << “.\n”;
}
return rv;
}
static void RubySafeGCZ(void *t)
{
RubySafeGC();
}
static void InitRuby(void *t)
{
ruby_init();
RubySafeGC();
}
static void InitRuby2(void *t)
{
rb_define_global_function(“p”, (RubyGenericFunc)RubyP, -1);
}
static void InitRuby3(void *t)
{
int status;
rb_eval_string_protect(
“class Foo\n”
" def initialize(z)\n"
" @a = z\n"
" end\n"
" def update(v)\n"
" p(v)\n"
" end\n"
“end\n”
, &status);
if (status)
{
VALUE einfo = rb_obj_as_string(ruby_errinfo);
cerr << "Ruby reported error: " << status
<< " " << RSTRING(einfo)->ptr << “.\n”;
}
}
static void InitRuby4(void *t)
{
VALUE c_foo = rb_const_get(rb_cObject, rb_intern(“Foo”));
VALUE c_bar = INT2NUM(555);
obj = rb_funcall(c_foo, rb_intern(“new”), 1, c_bar);
}
static void update(int i)
{
rb_funcall(obj, rb_intern(“update”), 1, INT2NUM(i));
}
static void update2(void t)
{
update(((int *)t));
}
static void badger(int i)
{
cerr << (((i % 6) < 4) ? "Badger " : “Mushroom”) << " " << i <<
“\n”;
}
static void badger2(void t)
{
badger(((int *)t));
}
static void Nothing(void *t)
{
}
static void messy2(MessyFunc t, void *td, int depth, bool &done)
{
// Add random crap to the stack that will be checked and reclaimed at
// the end.
int ss = MYRAND(40)*4+8;
int mv = MYRAND(256);
unsigned char *data = (unsigned char *)alloca(ss);
memset(data, mv, ss);
for (int i=0; i<ss; i++)
{
if (data[i] != mv)
{
cerr << “Stack corrupted immediately.\n”;
cerr << (int)(data[i]) << " " << mv << “\n”;
exit(1);
}
}
//cerr << "D: " << depth << “\n”;
if (depth < 0)
return;
for (int i=0; i<(MYRAND(3)+1); i++)
messy2(t, td, depth-1, done);
if ((depth == 0) && (!done) && (t))
{
(*t)(td);
done = true;
}
for (int i=0; i<(MYRAND(3)+1); i++)
messy2(t, td, depth-1, done);
for (int i=0; i<ss; i++)
if (data[i] != mv)
{
cerr << “Stack was corrupted.\n”;
cerr << (int)(data[i]) << " " << mv << “\n”;
exit(1);
}
}
void messy(MessyFunc t, void *td = NULL)
{
int depth = MYRAND(5)+6;
bool done = false;
if (use_messy)
messy2(t, td, depth, done);
else
{
if (t)
(*t)(td);
}
}
int main(int argc, char *argv[])
{
if (argc != 2)
{
cerr << “Usage: foo \n”;
cerr << " 0 - Simple Badger.\n";
cerr << " 1 - Simple Ruby.\n";
cerr << " 2 - Badger with stack use.\n";
cerr << " 3 - Ruby with stack use.\n";
exit(1);
}
mode = atoi(argv[1]);
use_messy = (mode & 2);
use_ruby = (mode & 1);
cerr << "Mode " << mode
<< ". Messy " << (use_messy ? “enabled” : “disabled”)
<< ". Ruby " << (use_ruby ? “enabled” : “disabled”)
<< “.\n”;
cerr << “Empty test…\n”;
messy(Nothing);
// Initialise.
cerr << “Initialising…\n”;
if (use_ruby)
{
messy(InitRuby);
messy(RubySafeGCZ);
messy(InitRuby2);
messy(RubySafeGCZ);
messy(InitRuby3);
messy(RubySafeGCZ);
messy(InitRuby4);
messy(RubySafeGCZ);
}
// Test over and over.
cerr << “Thrashing…\n”;
for (int i=0; i<10000; i++)
{
if (use_ruby)
{
messy(RubySafeGCZ);
messy(update2, &i);
messy(RubySafeGCZ);
}
else
messy(badger2, &i);
}
// Clean up.
cerr << “Done thrashing.\n”;
if (use_ruby)
{
messy(RubySafeGCZ);
}
cerr << “Everything was okay.\n”;
}
============================================================================
== Contents of Makefile:
foo: foo.o
g++ -Wall -pg -g -o foo foo.o -L/packages/ruby/lib -lruby
foo.o: foo.cpp
g++ -Wall -pg -g -c -o foo.o foo.cpp
-I/packages/ruby/lib/ruby/1.8/i686-linux
clean:
rm -f foo.o foo *~
============================================================================
== Extra Notes
uname -a
Linux notimportant 2.6.18-1.2798.fc6 #1 SMP Mon Oct 16 14:54:20 EDT 2006
i686
athlon i386 GNU/Linux
cat /etc/fedora-release
Fedora Core release 6 (Zod)
ldd foo | grep ruby
libruby.so.1.8 => /packages/ruby/lib/libruby.so.1.8
(0x00110000)
ls -la /packages/ruby/lib/libruby.so*
lrwxrwxrwx 1 root root 16 Jan 15 23:19
/packages/ruby/lib/libruby.so →
libruby.so.1.8.5
lrwxrwxrwx 1 root root 16 Jan 15 23:19
/packages/ruby/lib/libruby.so.1.8 →
libruby.so.1.8.5
-rwxr-xr-x 1 root root 1897472 Jan 15 23:18
/packages/ruby/lib/libruby.so.1.8.5
ls -la /usr/lib/libruby.so*
ls: No match.
echo $LD_LIBRARY_PATH | grep /packages/ruby/lib | wc -l
1
g++ -v
Using built-in specs.
Target: i386-redhat-linux
Configured with: …/configure --prefix=/usr --mandir=/usr/share/man
–infodir=/usr/share/info --enable-shared --enable-threads=posix
–enable-checking=release --with-system-zlib --enable-__cxa_atexit
–disable-libunwind-exceptions --enable-libgcj-multifile
–enable-languages=c,c++,objc,obj-c++,java,fortran,ada
–enable-java-awt=gtk
–disable-dssi --enable-plugin
–with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre
–with-cpu=generic
–host=i386-redhat-linux
Thread model: posix
gcc version 4.1.1 20061011 (Red Hat 4.1.1-30)
ruby -v
ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-linux]
##########################
Ruby was built from source with:
tar xfz …/ruby-1.8.5-p12.tar.gz
cd ruby-1.8.5-p12 || exit 1
./configure --enable-shared --prefix=/packages/ruby-1.8.5-p12 || exit 1
make || exit 1
make install || exit 1
Also note:
ls -lad /packages/ruby
lrwxrwxrwx 1 root root 14 Jan 15 23:19 /packages/ruby → ruby-1.8.5-p12
##########################
make
g++ -Wall -pg -g -c -o foo.o foo.cpp
-I/packages/ruby/lib/ruby/1.8/i686-linux
g++ -Wall -pg -g -o foo foo.o -L/packages/ruby/lib -lruby
./foo 3
Mode 3. Messy enabled. Ruby enabled.
Empty test…
Initialising…
Thrashing…
Ruby P: 0
Ruby P: 1
[ … lines removed … ]
Ruby P: 124
Ruby P: 125
(eval):7: [BUG] Segmentation fault
ruby 1.8.5 (2006-12-25) [i686-linux]
Abort
##########################
gdb ./foo
GNU gdb Red Hat Linux (6.5-8.fc6rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for
details.
This GDB was configured as “i386-redhat-linux-gnu”…Using host
libthread_db
library “/lib/libthread_db.so.1”.
(gdb) set args 3
(gdb) run
Starting program: /home/garthy/dev/test/ruby/foo 3
Mode 3. Messy enabled. Ruby enabled.
Empty test…
Initialising…
Thrashing…
Ruby P: 0
Ruby P: 1
[ … lines removed … ]
Ruby P: 124
Ruby P: 125
Program received signal SIGSEGV, Segmentation fault.
0x004935fe in st_lookup (table=0x0, key=6905, value=0xbfd83c08) at
st.c:250
250 hash_val = do_hash(key, table);
Current language: auto; currently c
(gdb) bt
#0 0x004935fe in st_lookup (table=0x0, key=6905, value=0xbfd83c08) at
st.c:250
#1 0x0042aa7c in search_method (klass=3085892280, id=6905,
origin=0xbfd83c38)
at eval.c:480
#2 0x0042aadd in rb_get_method_body (klassp=0xbfd83c68, idp=0xbfd83c74,
noexp=0xbfd83c78) at eval.c:501
#3 0x00434263 in rb_call (klass=3085892280, recv=3085892320, mid=6905,
argc=1, argv=0xbfd83ca0, scope=1) at eval.c:6024
#4 0x00434866 in vafuncall (recv=3085892320, mid=6905, n=1,
ar=0xbfd83d04)
at eval.c:6125
#5 0x004349d0 in rb_funcall (recv=3085892320, mid=6905, n=1) at
eval.c:6142
#6 0x08048ce3 in update (i=126) at foo.cpp:101
#7 0x08048d02 in update2 (t=0xbfd8440c) at foo.cpp:106
#8 0x080491bd in messy2 (t=0x8048cea , td=0xbfd8440c, depth=0,
done=@0xbfd843cf) at foo.cpp:150
#9 0x08049151 in messy2 (t=0x8048cea , td=0xbfd8440c, depth=1,
done=@0xbfd843cf) at foo.cpp:147
#10 0x08049151 in messy2 (t=0x8048cea , td=0xbfd8440c, depth=2,
done=@0xbfd843cf) at foo.cpp:147
#11 0x08049151 in messy2 (t=0x8048cea , td=0xbfd8440c, depth=3,
done=@0xbfd843cf) at foo.cpp:147
#12 0x08049151 in messy2 (t=0x8048cea , td=0xbfd8440c, depth=4,
done=@0xbfd843cf) at foo.cpp:147
#13 0x08049151 in messy2 (t=0x8048cea , td=0xbfd8440c, depth=5,
done=@0xbfd843cf) at foo.cpp:147
#14 0x08049151 in messy2 (t=0x8048cea , td=0xbfd8440c, depth=6,
done=@0xbfd843cf) at foo.cpp:147
#15 0x08049151 in messy2 (t=0x8048cea , td=0xbfd8440c, depth=7,
done=@0xbfd843cf) at foo.cpp:147
#16 0x0804934b in messy (t=0x8048cea , td=0xbfd8440c) at
foo.cpp:171
#17 0x08049611 in main (argc=2, argv=0xbfd844b4) at foo.cpp:223