Hi,
As the Ferret site is down for a while, I am reporting this bug here,
so it gets documented somewhere and people with more experience with
Ferret can comment.
I was hitting this crash easily on Sup, that uses Ferret for its
index[1]. After some investigation I’ve found the cause of the crash,
but I don’t know what would be the best behaviour for Ferret on this
case.
The crash happens when the IndexReader object where a lazy_doc was
loaded
from gets closed. After closing the IndexReader, trying to get a field
from the lazy_doc will trigger a read from a closed and freed
InputStream,
sometimes causing segfaults, sometimes causing spurious I/O errors.
I’ve initially seen the bug using Ferret 0.11.6, but I’ve tested this
using Ferret from the git repository[2], and it happens there, also.
Below is a simple script that will trigger the crash:
========================================
Example A
require “ferret”
p = “/tmp/ferret-test.#$$”
puts “Using #{p} as storage”
i = Ferret::Index::Index.new(:path => p)
i << { :body => "Loren ipsum dolor "*1000 }
doc = i[0]
this will cause the IndexReader to be closed by Ferret::Index::Index
i << { :body => “another document” }
puts doc[:body]
It happens because writing to the Ferret index will close the
IndexReader.
A simpler code that trigger the crash is:
========================================
Example B
require “ferret”
p = “/tmp/ferret-test.#$$”
puts “Using #{p} as storage”
puts “Generating a simple index”
i = Ferret::Index::Index.new(:path => p)
i << { :body => "Loren ipsum dolor "*1000 }
i.close
puts “Closed it. Will reopen and use it”
i = Ferret::Index::IndexReader.new(p)
doc = i[0]
i.close
puts doc[:body]
I see two issues here:
The first one is the crash itself: what should happen to loaded
lazy_docs
when an IndexReader is closed? Lucene documentation[3] says an exception
may be thrown on these cases. The same behavior could be the proper fix
for Ferret on Example B, that can be considered invalid usage of the
IndexReader anyway.
The second issue is what should be the behaviour of Ferret::Index::Index
after writing to the index with documents loaded (Example A). Should it
really invalidate all lazy_docs read from the index on every write? That
is the current behavior because its IndexReader is always closed when
writing to the index, but I wonder if it is really desired.
[1] http://rubyforge.org/pipermail/sup-talk/2008-November/001782.html
[2] GitHub - dbalmain/ferret: Ferret: the extensible information retrieval library for ruby.
[3]
http://lucene.apache.org/java/2_4_0/api/core/org/apache/lucene/index/IndexReader.html#document(int,%20org.apache.lucene.document.FieldSelector)