Windows progress

regularfry · June 1, 2006, 10:00am

Hi there,

What’s the current status of the Windows port? I may be in a position
to lend a hand over the next couple of weeks - where should I start
looking? And what’s the best way to get SVN HEAD? This happens:

$ svn checkout svn://www.davebalmain.com/ferret/trunk ferret
svn: Can’t connect to host ‘www.davebalmain.com’: Connection refused

regularfry · June 1, 2006, 11:21am

On 6/1/06, Alex Y. [email protected] wrote:

Hi there,

What’s the current status of the Windows port? I may be in a position
to lend a hand over the next couple of weeks - where should I start
looking?

Hi Alex,

Thanks for your interest. I got Ferret to compile with Visual Studio
Express 2005. Unfortunately you currently need to use Visual C 6 to
create Ruby bindings. This proved a lot more difficult so I decided to
take a different route. Marvin H. (author of KinoSearch, a perl
port of lucene) and I are about to start a new project at Apache
called Lucy (LucyProposal - Apache Lucene (Java) - Apache Software Foundation) which
will aim to create a C port of Lucene that can be used as a backend in
all dynamic languages. This time around, portability will be a much
higher priority.

Lucy may or may not one day become the back end to Ferret. At the same
time I’m experimenting with some different options using the Ferret
codebase. Now that Lucy is happening I’m not going to worry about
Lucene index compatibility (which was currently still a long way off
in Ferret due to Java’s modified UTF-8 encoding). This experimental
code is in;

svn://www.davebalmain.com/exp

This code is much more portable and will compile with VC6. So if you
want a Windows port quickly you can try merging this code back into
Ferret propper. Or if you are really interested in the libraries
internals you could join me working on this experimental code or join
Marvin and I on the Lucy project (still waiting on Apache approval).
Whichever route you chose your help will be most appreciated. Let me
know your thoughts.

Cheers,
Dave

And what’s the best way to get SVN HEAD? This happens:
$ svn checkout svn://www.davebalmain.com/ferret/trunk ferret
svn: Can’t connect to host ‘www.davebalmain.com’: Connection refused

Sorry about that. Subversion is up and running again.

regularfry · June 1, 2006, 3:18pm

On 6/1/06, Alex Y. [email protected] wrote:

Thanks for your interest. I got Ferret to compile with Visual Studio
Express 2005. Unfortunately you currently need to use Visual C 6 to
create Ruby bindings.
A few groups have been bitten by this. I believe this is something Curt
Hibbs is going to be addressing with the next One-Click Installer. I
don’t know if you’ve been following ruby-lang, but there are noises to
move over to a mingw32 build instead of a VC6, which would sort a lot
of things out. If that ends up happening, extension building on Windows
will get much simpler. As far as I know, the OCI only uses VC6 because
it was believed at the time that it would be compatible with mingw32
extensions.

Actually the main reason I haven’t finished porting to Windows yet is
that it seemed like too much work if the one-click installer is going
to change to mingw32 anyway. I hope it happens soon.

For my purposes, I don’t especially mind building my own Ruby to make
Ferret compatible with it, but I can see that approach may not have too
many adherents Do you see any reason why that wouldn’t work with
the current Ferret source? Would that not be the shortest path to
getting it working?

Yes, this would probably be the shortest path to get it working. Plus
you’ll have much better locale support (ie utf-3 support).

This proved a lot more difficult so I decided to
take a different route. Marvin H. (author of KinoSearch, a perl
port of lucene) and I are about to start a new project at Apache
called Lucy (LucyProposal - Apache Lucene (Java) - Apache Software Foundation) which
will aim to create a C port of Lucene that can be used as a backend in
all dynamic languages. This time around, portability will be a much
higher priority.
I’m sure you’ve considered this, but what does that add compared to a
GCJ+SWIG approach, as with PyLucene? Without having looked at it, is
there anything which prevents that method from being applied to Ruby?

It can be done but it’s still a lot of work and I just didn’t feel up
to the task. Plus we get better performance this way with a much
smaller download.

want a Windows port quickly you can try merging this code back into
at the new code and see if there’s anything I can usefully add there, too.
Have fun. I don’t think it’ll be too much work getting it to compile
under mingw32. I guess we’ll see.

Cheers,
Dave

regularfry · June 1, 2006, 1:30pm

David B. wrote:

Express 2005. Unfortunately you currently need to use Visual C 6 to
create Ruby bindings.
A few groups have been bitten by this. I believe this is something Curt
Hibbs is going to be addressing with the next One-Click Installer. I
don’t know if you’ve been following ruby-lang, but there are noises to
move over to a mingw32 build instead of a VC6, which would sort a lot
of things out. If that ends up happening, extension building on Windows
will get much simpler. As far as I know, the OCI only uses VC6 because
it was believed at the time that it would be compatible with mingw32
extensions.

For my purposes, I don’t especially mind building my own Ruby to make
Ferret compatible with it, but I can see that approach may not have too
many adherents Do you see any reason why that wouldn’t work with
the current Ferret source? Would that not be the shortest path to
getting it working?

This proved a lot more difficult so I decided to
take a different route. Marvin H. (author of KinoSearch, a perl
port of lucene) and I are about to start a new project at Apache
called Lucy (LucyProposal - Apache Lucene (Java) - Apache Software Foundation) which
will aim to create a C port of Lucene that can be used as a backend in
all dynamic languages. This time around, portability will be a much
higher priority.
I’m sure you’ve considered this, but what does that add compared to a
GCJ+SWIG approach, as with PyLucene? Without having looked at it, is
there anything which prevents that method from being applied to Ruby?

want a Windows port quickly you can try merging this code back into
Ferret propper. Or if you are really interested in the libraries
internals you could join me working on this experimental code or join
Marvin and I on the Lucy project (still waiting on Apache approval).
Whichever route you chose your help will be most appreciated. Let me
know your thoughts.
From my personal point of view, I’m most interested in having the same
codebase work fast on both Linux and Windows, and, like I say, I don’t
mind rebuilding Ruby to do it. Right now, I’d be most interested in
patching the current cFerret to work under mingw32, unless you know of
any reasons that’s just not going to work. I’ll certainly take a look
at the new code and see if there’s anything I can usefully add there,
too.

Thanks,

regularfry · June 1, 2006, 8:33pm

On Jun 1, 2006, at 11:00 AM, Marvin H. wrote:

IMO, it would be best for everybody
if we did this within the Lucene family,

… and that what’s going to happen. I just got an email from Doug.
We’re good to go.

Thank you, Lucene PMC.

Marvin H.
Rectangular Research
http://www.rectangular.com/

regularfry · June 2, 2006, 1:55am

You’re welcome! I’m looking forward to Ruby Lucene goodness!!!

Erik

regularfry · June 1, 2006, 8:01pm

On Jun 1, 2006, at 6:15 AM, David B. wrote:

higher priority.
I’m sure you’ve considered this, but what does that add compared to a
GCJ+SWIG approach, as with PyLucene? Without having looked at it, is
there anything which prevents that method from being applied to Ruby?

It can be done but it’s still a lot of work and I just didn’t feel up
to the task. Plus we get better performance this way with a much
smaller download.

Java Lucene is built on the assumption, quite reasonable for Java as
a compiled language[1], that method calls are cheap and object
creation and destruction are cheap. The fact that they are much more
expensive in an interpreted language is the main reason the pure-Perl
port of Lucene, Plucene, runs so slowly (<http://www.rectangular.com/
kinosearch/benchmarks.html>). Lack of access to primitive data types
such as int is another reason, but it’s actually not that great a
factor compared to the OO overhead (I did extensive hacking on
Plucene before deciding I had no choice but to start from scratch,
and rewriting the IO classes in C didn’t help as much as anyone
expected). Presumably similar factors are at work slowing down the
pure-Ruby Ferret.

The OO overhead problems are mitigated by going the GCJ route, but
not eliminated. Say you want to subclass Analyzer – which most
significant deployments of Lucene will want to do eventually. The
way a TokenStream works in Lucene, several method calls are required
for each and every token – one for each Analyzer the token passes
through. That gets extremely expensive in an interpreted language.
Furthermore, none of Perl’s native string manipulation tools work
with UTF-16 strings. So if you wanted to, say, insert a custom Perl
TokenFilter into a Lucene Analysis chain, you’d have to translate
between UTF-8 and UTF-16 each time you cross the Perl/Java boundary,
making the TokenStream concept a double disaster.

An alternate way of processing Tokens is to have each link in the
Analyzer chain accept a “TokenBatch” instead of a TokenStream: an
array of Tokens, rather than a stream of Tokens. That way, each
Analyzer can iterate over all the Tokens in a tight loop, either
natively or in C. The downside of this technique is that it’s not
possible to feed it directly from a filehandle/Reader, but that’s
small potatoes.

It would be possible to graft the TokenBatch concept onto a GCJ’d
Lucene: create a native full analysis chain which spits out a
TokenBatch, then have the TokenBatch pretend it’s a TokenStream,
feeding Tokens to Lucene using a C version of next(). That would
perform OK – but you couldn’t ever mix and match Java Lucene
Analyzers with native Analyzers, only prepend the native onto the
front. Therefore, you’d have to rewrite the entire
org.apache.lucene.analysis package anyway – it’s the only way you’re
going to get both full flexibility and performance. And once you’ve
started down the path of rewriting large portions of Lucene, it’s
hard to see why you’d put up with the headache of the GCJ approach.

There are many other areas where Lucene’s architecture is poorly
suited for use with an interpreted language. Dave has solved those
problems mainly by rewriting the whole thing in C. KinoSearch has
taken that approach in some cases, but more often than Ferret, it
uses modified algorithms instead. TokenBatch is one example; the
best one, which is harder to explain here, is how KinoSearch merges
together inverted documents during indexing. (In summary, it’s
faster, simpler, and requires far, far fewer objects.)

It would be possible to port some of these algorithm changes to
Lucene, but they would be pretty disruptive. Lucene’s a mature,
heavily-used library and changing anything at all requires a lot of
consideration. Some of the changes I would like to see, I don’t
think I could lobby for in good conscience. The bytecounts-as-string-
headers patch is a good example. For Ferret and KinoSearch it’s
adoption would yield a very significant benefit, as it would open the
door to using Luke to browse indexes. For Java Lucene, though, it
can only be justified by further changes which build upon it.

The downside of the full-port approach that Dave and I have taken is
that it’s a lot of work to build and maintain. However, we’ve
already done the vast majority of the up-front work once. Re-doing
it for Lucy will be a cakewalk in comparison. The maintenance
problem that KinoSearch and Ferret currently face, we’re addressing
by sharing the C core. We would not be surprised if others join us
– I know of at least one other person who rewrote Lucene in C:
Robert Kirchgessner, who did a partial PHP/C port. Heck, it will
presumably be easier to maintain a Python port against Lucy than
against GCJ’d Lucene, provided that we achieve what we’ve set out to
achieve.

The only question remaining, I think, is whether the project will
actually be hosted at Apache. When Dave and I approached Doug
Cutting about it, he specifically requested that development take
place there – before Dave or I had had a chance to indicate that
that was our preference as well. However, we’ve been waiting for
approval by the Lucene PMC for a couple weeks now, and I’m not sure
its coming. I’m guessing that Erik “One Lucene To Rule Them All”
Hatcher hasn’t cast his +1. IMO, it would be best for everybody
if we did this within the Lucene family, but we’ll just have to see.

Marvin H.
Rectangular Research
http://www.rectangular.com/

[1] What constitutes a compiled vs. a dynamic language is debatable
– see http://en.wikipedia.org/wiki/Interpreted_language. It might
be more accurate to describe Java as a “more compiled” language.

regularfry · June 5, 2006, 4:51pm

On 6/1/06, Marvin H. [email protected] wrote:

Other than the initial proposal, any pointers to websites or mailing
lists where we can track the development of this project?

Thanks.

-F

regularfry · June 5, 2006, 5:19pm

On Jun 5, 2006, at 7:48 AM, Finn S. wrote:

Other than the initial proposal, any pointers to websites or mailing
lists where we can track the development of this project?

We’re now waiting for our Apache accounts to be set up, the mailing
lists and the subversion repositories to be created, etc. If I’m not
mistaken, all the infrastructure support work at Apache is done by
volunteers, so patience is the watchword. Once there is a Lucy
mailing list, we’ll send a notification to this list.

Not a lot is going on right now besides the occasional spasm of high-
level planning on either the KinoSearch list or the Ferret list.
That’s because in order to avoid the Apache Incubator process, all
development needs to take place “on the record” in Apache forums and
repositories.

Marvin H.
Rectangular Research
http://www.rectangular.com/