On Jun 1, 2006, at 6:15 AM, David B. wrote:
higher priority.
I’m sure you’ve considered this, but what does that add compared to a
GCJ+SWIG approach, as with PyLucene? Without having looked at it, is
there anything which prevents that method from being applied to Ruby?
It can be done but it’s still a lot of work and I just didn’t feel up
to the task. Plus we get better performance this way with a much
smaller download.
Java Lucene is built on the assumption, quite reasonable for Java as
a compiled language[1], that method calls are cheap and object
creation and destruction are cheap. The fact that they are much more
expensive in an interpreted language is the main reason the pure-Perl
port of Lucene, Plucene, runs so slowly (<http://www.rectangular.com/
kinosearch/benchmarks.html>). Lack of access to primitive data types
such as int is another reason, but it’s actually not that great a
factor compared to the OO overhead (I did extensive hacking on
Plucene before deciding I had no choice but to start from scratch,
and rewriting the IO classes in C didn’t help as much as anyone
expected). Presumably similar factors are at work slowing down the
pure-Ruby Ferret.
The OO overhead problems are mitigated by going the GCJ route, but
not eliminated. Say you want to subclass Analyzer – which most
significant deployments of Lucene will want to do eventually. The
way a TokenStream works in Lucene, several method calls are required
for each and every token – one for each Analyzer the token passes
through. That gets extremely expensive in an interpreted language.
Furthermore, none of Perl’s native string manipulation tools work
with UTF-16 strings. So if you wanted to, say, insert a custom Perl
TokenFilter into a Lucene Analysis chain, you’d have to translate
between UTF-8 and UTF-16 each time you cross the Perl/Java boundary,
making the TokenStream concept a double disaster.
An alternate way of processing Tokens is to have each link in the
Analyzer chain accept a “TokenBatch” instead of a TokenStream: an
array of Tokens, rather than a stream of Tokens. That way, each
Analyzer can iterate over all the Tokens in a tight loop, either
natively or in C. The downside of this technique is that it’s not
possible to feed it directly from a filehandle/Reader, but that’s
small potatoes.
It would be possible to graft the TokenBatch concept onto a GCJ’d
Lucene: create a native full analysis chain which spits out a
TokenBatch, then have the TokenBatch pretend it’s a TokenStream,
feeding Tokens to Lucene using a C version of next(). That would
perform OK – but you couldn’t ever mix and match Java Lucene
Analyzers with native Analyzers, only prepend the native onto the
front. Therefore, you’d have to rewrite the entire
org.apache.lucene.analysis package anyway – it’s the only way you’re
going to get both full flexibility and performance. And once you’ve
started down the path of rewriting large portions of Lucene, it’s
hard to see why you’d put up with the headache of the GCJ approach.
There are many other areas where Lucene’s architecture is poorly
suited for use with an interpreted language. Dave has solved those
problems mainly by rewriting the whole thing in C. KinoSearch has
taken that approach in some cases, but more often than Ferret, it
uses modified algorithms instead. TokenBatch is one example; the
best one, which is harder to explain here, is how KinoSearch merges
together inverted documents during indexing. (In summary, it’s
faster, simpler, and requires far, far fewer objects.)
It would be possible to port some of these algorithm changes to
Lucene, but they would be pretty disruptive. Lucene’s a mature,
heavily-used library and changing anything at all requires a lot of
consideration. Some of the changes I would like to see, I don’t
think I could lobby for in good conscience. The bytecounts-as-string-
headers patch is a good example. For Ferret and KinoSearch it’s
adoption would yield a very significant benefit, as it would open the
door to using Luke to browse indexes. For Java Lucene, though, it
can only be justified by further changes which build upon it.
The downside of the full-port approach that Dave and I have taken is
that it’s a lot of work to build and maintain. However, we’ve
already done the vast majority of the up-front work once. Re-doing
it for Lucy will be a cakewalk in comparison. The maintenance
problem that KinoSearch and Ferret currently face, we’re addressing
by sharing the C core. We would not be surprised if others join us
– I know of at least one other person who rewrote Lucene in C:
Robert Kirchgessner, who did a partial PHP/C port. Heck, it will
presumably be easier to maintain a Python port against Lucy than
against GCJ’d Lucene, provided that we achieve what we’ve set out to
achieve.
The only question remaining, I think, is whether the project will
actually be hosted at Apache. When Dave and I approached Doug
Cutting about it, he specifically requested that development take
place there – before Dave or I had had a chance to indicate that
that was our preference as well. However, we’ve been waiting for
approval by the Lucene PMC for a couple weeks now, and I’m not sure
its coming. I’m guessing that Erik “One Lucene To Rule Them All”
Hatcher hasn’t cast his +1. IMO, it would be best for everybody
if we did this within the Lucene family, but we’ll just have to see.
Marvin H.
Rectangular Research
http://www.rectangular.com/
[1] What constitutes a compiled vs. a dynamic language is debatable
– see http://en.wikipedia.org/wiki/Interpreted_language. It might
be more accurate to describe Java as a “more compiled” language.