Ferret/AAF Stability?

sams · November 15, 2007, 3:37pm

Hello. I’m the author of DataMapper (http://datamapper.org), and am
trying to choose what Full-Text-Indexing engine/plugin I want to
include by default. I was hoping you guys could help.

Sphinx comes highly recommended, but without live index updates, it
just doesn’t seem practical for most of my work.

I’m most experienced with Solr, but the whole HTTP::Request and
general complexity of it is off-putting.

I haven’t used Ferret in an application yet, but I love what I see so
far. The ability to have an in-process server in development, and the
clean Ruby API are big wins for me. But I’ve heard a lot of scary
things about corrupted indexes, even when using the DRb server. Is
this just FUD? Are there any unresolved issues revolving around
corrupted indexes? Can I afford to use Ferret in big applications for
Fortune-500 clients? (I know that sounds… pompous really, but it’s a
genuine concern.)

Any advice you could offer would be greatly appreciated.

I’ve also read a few messages about serializing index requests/updates
to Ferret through message-queues. Are there any decent
guides/blog-posts on this topic?

Thanks, -Sam

sams · November 15, 2007, 4:45pm

We have several 3GB indexes with approximately 1 million documents in
each of them. Here are some quick notes, feel free to reach out with
other questions:

no corruption problems that weren’t our fault.
there was an issue with large index files (> ~2GB) that was patched,
but I’m honestly not sure if it is in the trunk, as the ferret trac/
svn is frequently MIA (which is a concern of course)
the code is clear and fairly easy to follow. AAF is very easy to
follow.
I’ve been very happy with performance of the actual indexing/
searching, however you need to watch out for the processes that are
actually doing the synchronization for writes. DRB is a bottleneck for
us right now, though our volume isn’t high enough that I’d call it a
real problem yet.
for moderately high-volume sites you’ll want to consider batching
index updates “offline”, though for large indexes make sure that you
have enough IO capacity to optimize the index. We host on EC2 and the
$.1/hour instances simply do not have anywhere near the IO capacity to
optimize a large index without having every other process waiting
for IO. I haven’t tested the larger instance types yet.
we love how easy and efficient it is to combine many indexes into
one. We index tens of thousands of websites in parallel and then
combine 100 or so indexes into one index very quickly.
the mailing list is great. Jens is on top of things, very receptive
to new ideas and takes very good care of AAF. Haven’t seen Dave
Balmain in a while.

Overall we are happy. There are times when search accuracy questions
come up, and frequently the problem is that we are not effectively
parsing queries or using the right analyzer for the problem at hand,
so RTFM (http://www.oreilly.com/catalog/9780596527853/).

That’s all I can think of now…

Erik

sams · November 15, 2007, 7:42pm

Hey …

I haven’t used Ferret in an application yet, but I love what I see so
far. The ability to have an in-process server in development, and the
clean Ruby API are big wins for me. But I’ve heard a lot of scary
things about corrupted indexes, even when using the DRb server. Is
this just FUD? Are there any unresolved issues revolving around
corrupted indexes? Can I afford to use Ferret in big applications for
Fortune-500 clients? (I know that sounds… pompous really, but it’s a
genuine concern.)

We’re using ferret on omdb.org for 14 month without any problems.
There’re a few things you might want to work around (Erik pointed
some out). If you expect a huge amount of index updates, you need
to think about a few infrastructural problems, because right now, AAF
does not allow you to cluster indexing servers. but i know there is a
solution for that

If you just have huge amount of search queries, there is no need
to worry… i would not suggest usings AAF’s ferret server for searching,
though … but it’s quite easy to do the searching in each mongrel, so
not concern here either.

i guess we need more information about the data you want to index
to give more detailed advices.

I’ve also read a few messages about serializing index requests/updates
to Ferret through message-queues. Are there any decent
guides/blog-posts on this topic?

yes, that’s currently being worked on … so there will be some guides
later on

Cheers
Ben

Benjamin K.

[email protected]

Rails-Schulung “Advancing with Rails” mit David A. Black
19.11.-22.11.2007, Berlin-Mitte
Details u. Anmeldung: http://www.railsschulung.de

sams · November 15, 2007, 8:29pm

On Nov 15, 2007, at 1:41 PM, Benjamin K. wrote:

i would not suggest usings AAF’s ferret server for searching,
though … but it’s quite easy to do the searching in each mongrel, so
not concern here either.

I’m confused… what does “searching” mean in this context?

John

sams · November 15, 2007, 9:07pm

John,

On Nov 15, 2007, at 1:41 PM, Benjamin K. wrote:

i would not suggest usings AAF’s ferret server for searching,
though … but it’s quite easy to do the searching in each mongrel, so
not concern here either.

I’m confused… what does “searching” mean in this context?

If you’re using AAF, you should use the ferret drb server to index
your objects. however, using the ferret server means, whenever
someone is search (if you’re using Model.find_by_contents)
the search will be forwarded to the ferret server.

The ferret server will process the searching request and send
the response back to the mongrel. This overhead isn’t
necessary, as mongrel could use a local index to do the
search. there is no need to bother the ferret server.

so, indexing (aka updating, creating, saving, whatever) should
use the ferret server, but searching (using find_by_contents)
will use the ferret server if you’re using standard AAF, even
though it’s not really necessary and could result in a bottleneck.

don’t get me wrong. it is totally fine to use standard AAF, unless
you’re having huge amounts of searches or livesearches. I would
not recommend use a custom ferret solution, unless you
expect a problem or already have one

Cheers
Ben

Benjamin K.

[email protected]

Rails-Schulung “Advancing with Rails” mit David A. Black
19.11.-22.11.2007, Berlin-Mitte
Details u. Anmeldung: http://www.railsschulung.de

sams · November 17, 2007, 1:39pm

Hi!

On Fri, Nov 16, 2007 at 12:19:10PM -0500, Stuart Sierra wrote:
[…]

For a different perspective: I’m in the middle of switching from
Ferret to Solr. I like Ferret a lot, and still use it on several
sites, but I had some problems with one large site:

the patches for large-index support are still in development;

Let’s hope Dave reads this However there are several sites I know of
with Index sizes > several GB, so they seem to be working well enough.

each update to Ferret requires rebuilding the index;

This for sure is annoying but I’d consider this normal for a library
that has developed that fast. I think Dave has had very good reasons for
each
of the changes he did to the index format. Plus I don’t think every
release had a new index format

Ferret doesn’t yet support compressed indexes.

At least from the docs it looks like it does, see
http://ferret.davebalmain.com/api/classes/Ferret/Index/FieldInfo.html .
I didn’t ever try this out however.

My other reason for switching is that Rails’ ActiveRecord is not
well-suited to storing large documents, which made acts_as_ferret less
compelling.

That’s a good point, and we plan to make aaf independent from
active_record in the future.

I was nervous about tackling Solr, but I’ve found it quite easy to
use, and the built-in caching and multithreading make it fast.

numbers, please

I think Ferret is adequate for most search tasks, but if (like me)
you’re building a dedicated search engine, Solr is currently a
stronger candidate.

Well, As Solr uses Lucene internally, the mechanics and performance
characteristics naturally can’t be that different from Ferret. Maybe
Ferret has a bug or two and a non-working inter-process locking (which
doesn’t matter when you think about building a dedicated search server
like Solr is, since it’s only one process), but the general internal
handling of the index is the same, i.e. you can also only have one
Writer open to a Lucene index at a time, and Searchers won’t see index
changes until re-opened, too.

Having that said, if my application’s main concern would be search, I
most probably wouldn’t choose any pre-cooked solution like aaf or Solr,
but build exactly the thing I need from scratch, basing it either on
Lucene or Ferret. But maybe that’s just me

Cheers,
Jens

–
Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

sams · November 16, 2007, 6:19pm

On Nov 15, 2007 9:37 AM, Sam S. [email protected] wrote:

Hello. I’m the author of DataMapper (http://datamapper.org), and am
trying to choose what Full-Text-Indexing engine/plugin I want to
include by default. I was hoping you guys could help.

Sphinx comes highly recommended, but without live index updates, it
just doesn’t seem practical for most of my work.

I’m most experienced with Solr, but the whole HTTP::Request and
general complexity of it is off-putting.

For a different perspective: I’m in the middle of switching from
Ferret to Solr. I like Ferret a lot, and still use it on several
sites, but I had some problems with one large site:

the patches for large-index support are still in development;
each update to Ferret requires rebuilding the index;
Ferret doesn’t yet support compressed indexes.

My other reason for switching is that Rails’ ActiveRecord is not
well-suited to storing large documents, which made acts_as_ferret less
compelling.

I was nervous about tackling Solr, but I’ve found it quite easy to
use, and the built-in caching and multithreading make it fast.

I think Ferret is adequate for most search tasks, but if (like me)
you’re building a dedicated search engine, Solr is currently a
stronger candidate.

-Stuart Sierra

sams · November 19, 2007, 3:59am

On Nov 17, 2007 7:39 AM, Jens K. [email protected] wrote:

Ferret doesn’t yet support compressed indexes.

At least from the docs it looks like it does, see
http://ferret.davebalmain.com/api/classes/Ferret/Index/FieldInfo.html .
I didn’t ever try this out however.

Yes, it’s in the API, but there’s no code for it yet.

I was nervous about tackling Solr, but I’ve found it quite easy to
use, and the built-in caching and multithreading make it fast.

numbers, please

I make no claim that it’s faster than Ferret, but it’s fast enough.

Having that said, if my application’s main concern would be search, I
most probably wouldn’t choose any pre-cooked solution like aaf or Solr,
but build exactly the thing I need from scratch, basing it either on
Lucene or Ferret. But maybe that’s just me

I’d like to do that, but I lack sufficient time and skill. In the
mean time, I’m hoping Solr will let me offer an open search API to my
users without too much extra effort on my part. We’ll see how it
goes; I may end up back on Ferret at some point.

-Stuart

sams · November 18, 2007, 11:25am

On Nov 17, 2007, at 7:39 AM, Jens K. wrote:

Writer open to a Lucene index at a time, and Searchers won’t see index
changes until re-opened, too.

That’s all true. However, Solr manages all the IndexWriter/
IndexSearcher stuff for you quite transparently (which I guess is
comparable to Ferret + DRb, eh?). Because it is a single point of
access to the index, it takes care of the single writer situation,
and also handles warming IndexSearchers before coming online so that
caches are built and a search on an updated index is as fast as it
was before being updated.

Having that said, if my application’s main concern would be search, I
most probably wouldn’t choose any pre-cooked solution like aaf or
Solr,
but build exactly the thing I need from scratch, basing it either on
Lucene or Ferret. But maybe that’s just me

You’d be reinventing a lot of wheels doing that, with IndexWriter
synchronization, IndexSearcher warming, caching, and much more.

Erik

Ferret/AAF Stability?

Cheers Ben

Cheers
Ben