okay, after wrestling with rails, jruby and hadoop for a few days, i’m
finally able to use my rails app in the mapper class. the key sticking
point turned out to be the fact that rails searches for its
configuration by walking up the directory tree by using File.dirname.
but if running from a jar, jruby’s RubyFile implementation refuses to
treat the “root” directory as a valid path entry. the easy solution was
to modify radoop’s package.rb to jar up my rails app code under a
/classes directory instead of at the top level. i then needed to add
/classes and /classes/lib to the LOAD_PATH. (it has be /classes due to
hadoop’s classpath when the job jar is run by the workers.)
i’m attaching my patched up version package.rb, which is the only change
to rubydoop.rb i needed to make. if it looks interesting, let me know
and i can send you a pull request on github.
thanks for all your help.
Ilya K. wrote in post #1089906:
thanks, that helped me make progress. the thing i’m struggling with now
is that it seems like EMR does not unjar my jar before running it, which
is what my local hadoop does (i.e. everything gets loaded from a
/tmp/… directory that my jar gets uncompressed to locally). in EMR, i
get stacktraces like:
org.jruby.exceptions.RaiseException: (ENOENT) No such file or directory
jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!
at org.jruby.RubyFile.realpath(org/jruby/RubyFile.java:760)
at
RUBY.realpath(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/lib/jruby-complete-1.7.1.jar!/META-INF/jruby.home/lib/ruby/1.9/pathname.rb:446)
at
RUBY.find_root_with_flag(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:637)
at
RUBY.config(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:511)
at
org.jruby.RubyBasicObject.send(org/jruby/RubyBasicObject.java:1659)
at RUBY.config(jar:0)
at
RUBY.Engine(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/cs-base/engine.rb:11)
from which it looks like EMR hadoop is loading directly from that jar
file, which makes any attempts to call Pathname.new().realpath (for
example) fail. some gems try to do that (to load their configs, etc).
i’m trying to figure out how to get EMR to stage my jar similarly to
what my local hadoop installation does.