Road map of ferret

Fernando_Parisotto · August 28, 2008, 5:34pm

That is one awesome rebuttal, Jens. I read that forum topic below, and
while I have a great respect for Ezra (from his fine book Deploying
Rails Applications), I must say I disagree with him with respect to
Ferret/AAF combination.

We run Ferret/AAF as a DRb server in production and on our staging
servers and I’ve never seen a Ferret segfault. That said, we’re not
high search load like Google, but even when hit with heavy load
testing, I haven’t experienced a Ferret segfault, nor corrupt indexes.

Now, corrupt indexes in development is another issue. In development,
you are not running a DRb server. Each mongrel is hitting the index
directly. You typically have only one mongrel running in development.
But if you open an interactive script/console session, and play with
your models side-by-side a running mongrel, you WILL corrupt your
Ferret index. That’s because both the mongrel and the script/console
will be writers to the same index, something that Ferret doesn’t
support. Heck, running a rake db:migrate along side a running mongrel
will cause index corruption, for the same reason: multiple writers.

I’m wondering if that’s why so many people experience Ferret indexing
problems in development? It’s not immediately obvious that you’re in a
multiple writer scenario some times.

For now, I’m sticking with the Ferret/AAF combination until one or the
other falls over completely.

Sheldon M.
Developer
http://ideas.veer.com

Fernando_Parisotto · August 28, 2008, 7:31pm

On 28.08.2008, at 17:17, Erik H. wrote:

If you’re talking about custom analyzers being in Ruby, more on that
below.

It’s not only custom analyzers, but the fact that acts_as_ferret’s DRb
runs with the full Rails application loaded, so i.e. to bulk index a
number of records aaf just hands the server the ids and class name of
the records to index, and the server does the rest. It’s debatable if
one approach is better than the other, in terms of index server load
it might even be better to do as much as possible on the client side,
but still it’s a much tighter coupling than you get with the
application agnostic interfaces of solr or stellr.

I must admit that I have a hard time to come up with another example
besides my synonym/thesaurus analysis stuff where this might useful,
but I think there are more use cases where such a tight integration
might come in handy.

It’s an independent server indexing whatever you throw over the
fence via http+xml.

Solr can index CSV as well now a relational database directly (with
the new DataImportHandler).

It also responds with Ruby hash structure (just add &wt=ruby to the
URLs, or use solr-ruby which does that automatically and hides all
server communication from you anyway).

Yeah, I know, but anyway there is a strict line between your
application and Solr, which doesn’t know a thing about the application
using it.

How to use a custom analyzer with solr? You have to code it in Java
(or you do your analysis before feeding the data into java land,
which I wouldn’t consider good app design).

Most users would not need to write a custom analyzer. Many of the
built-in ones are quite configurable. Yes, Solr does require schema
configuration via an XML file, but there have been acts_as_solr
variants (good and bad thing about this git craze) that generate
that for you automatically from an AR model.

Glad you mentioned this I don’t want to configure an analyzer via
xml when I can throw my own together with 4 or 5 lines of easy to read
ruby code. Same for index structure. Philosophical mismatch between
the Java and Ruby worlds I think

But even if you do that then you have
a) half a java project (I don’t want that)

That’s totally fair, and really the primary compelling reason for a
Ferret over Solr for pure Ruby/Rails projects. I dig that.

But isn’t Ferret is like 60k lines of C code too?!

true, but I don’t have to compile that every time I deploy my app…

and b) no way to use your existing rails classes in that custom
analyzer (I have analyzers using rails models to retrieve
synonyms and narrower terms for thesaurus based query expansion)

You could leverage client-side query expansion with Solr… just
take the users query, massage it, and send whatever query you like
to Solr. Solr also has synonym and stop word capability too.

yeah, I could do that. But that’s moving analysis stuff into my
application, which is quite contrary to the purpose of analyzers -
encapsulate this logic and make it pluggable into the search engine
library. So less style points for this solution…

However, there is also no reason (and I have this on my copious-free-
time-TOOD-list) that JRuby couldn’t be used behind the scenes of a
Solr analyzer/tokenizer/filter or even request handler… and do all
the cool Ruby stuff you like right there. Heck, you could even send
the Ruby code over to Solr to execute there if you like

that sounds sexy

Just using Solr and fixing up acts_as_solr to meet your needs (if it
doesn’t) would be even easier than all that Solr really is a
better starting point than Lucene directly, for caching,
scalability, replication, faceting, etc.

Depends on whether you need these features or not. From my experience,
lots of projects don’t need these things anyway, because they’re
running on a single host and nearly every other part of the
application is slower than search… Maybe it’s because I’m quite
involved with the topic and am familiar with lucene’s API, but to me
Solr looks like an additional layer of abstraction and complexity
which I only want to have when it really gives me a feature I need.
Plus the last time I checked Lucene didn’t need xml configuration
files

In development environments and especially when it comes to automated
tests / CI it’s also quite comfortable not having to run a separate
server but using the short cut directly to the index, which isn’t
possible with Solr.

I’d be curious to see scalability comparisons between Ferret and
Solr - or perhaps more properly between Stellr and Solr - as it
boils down to number of documents, queries per second, and faceting
and highlighting speed. I’m betting on Solr myself (by being so
into it and basing my professional life on it).

This would be interesting, but I wouldn’t be that disappointed with
Stellr ending up second given the little amount of time I’ve spent
building it so far. Just out of curiosity, do you have some kind of
performance testing suite for Solr which I could throw at Stellr?

Cheers,
Jens

–
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

Fernando_Parisotto · August 28, 2008, 8:11pm

On Aug 28, 2008, at 10:10 AM, Jens Krämer wrote:

With Ferret I can use custom tokenizers to inject additional terms
at the same offset (i.e., synonyms), is there another way to achieve
that with KinoSearch?

Synonym support isn’t part of the public API right now, but since the
basic principle is the same in KinoSearch as it is in Ferret and
Lucene, it shouldn’t be hard to add.

I don’t think we’d do this by extending Tokenizer; I think we’d want
SynonymFilter/SynonymMap classes akin to the ones provided by Solr.

Marvin H.
Rectangular Research
http://www.rectangular.com/

Fernando_Parisotto · August 28, 2008, 8:08pm

On Aug 28, 2008, at 1:02 PM, Jens K. wrote:

What advantage does Ferret have in terms of ActiveRecord
integration that Solr wouldn’t have?

If you’re talking about custom analyzers being in Ruby, more on
that below.

It’s not only custom analyzers, but the fact that acts_as_ferret’s
DRb runs with the full Rails application loaded, so i.e. to bulk
index a number of records aaf just hands the server the ids and
class name of the records to index, and the server does the rest.

Gotcha. Meaning the search server is pulling from the DB directly.
That’s what the DataImportHandler in Solr does as well. It’d be a
simple single HTTP request to Solr (once the DB stuff is configured,
of course) to have it do full or incremental DB indexing.

Glad you mentioned this I don’t want to configure an analyzer via
xml when I can throw my own together with 4 or 5 lines of easy to
read ruby code. Same for index structure. Philosophical mismatch
between the Java and Ruby worlds I think

Don’t get me wrong… I’m a Ruby fanatic myself! XML makes me ill,
generally speaking (it has its uses, but for configuration it is just
plain wrong).

For using the built-in tokenizer/filters, a smarter acts_as_solr could
generate the right config based on a model specifying parameters for
analysis.

But even if you do that then you have
a) half a java project (I don’t want that)

That’s totally fair, and really the primary compelling reason for a
Ferret over Solr for pure Ruby/Rails projects. I dig that.

But isn’t Ferret is like 60k lines of C code too?!

true, but I don’t have to compile that every time I deploy my app…

My point was that Ferret isn’t just Ruby, just a counter point to your
“half a java project”. No one has to recompile Solr either.

encapsulate this logic and make it pluggable into the search engine
library. So less style points for this solution…

I was just saying It’s debatable exactly where in the client-
server spectrum synonym expansion belongs… and it really depends on
the needs of the project. Nothing wrong with a client doing some user
input massaging before a query hits the search server.

However, there is also no reason (and I have this on my copious-
free-time-TOOD-list) that JRuby couldn’t be used behind the scenes
of a Solr analyzer/tokenizer/filter or even request handler… and
do all the cool Ruby stuff you like right there. Heck, you could
even send the Ruby code over to Solr to execute there if you like

that sounds sexy

Should be fairly trivial to wire JRuby in. The DataImportHandler
already has scripting language support for data transformation:
<DataImportHandler - Solr - Apache Software Foundation

(shield your eyes from the XML wrapping it!), so I believe JRuby
should already work in that context. This is sort of like the Mapper
stuff I built into solr-ruby, transforming data from domain to search
engine “documents”.

Solr looks like an additional layer of abstraction and complexity
which I only want to have when it really gives me a feature I need.
Plus the last time I checked Lucene didn’t need xml configuration
files

I hear ya about the XML config files. And always to be fair to Solr
here, you really only need to set things up from a basic example
configuration that covers most scenarios already - so it really isn’t
necessary to even touch XML config except for tweaking little things.

But Solr’s advantages over just Lucene are built out of experiences
that most Lucene projects eventually build anyway. Caching - really
important for faceting, which is a need that every project I touch
these days needs. Replication - really really important for
scalability of massive querying load. It’s really not such a big
chunk over Lucene to bite off… and in almost all respects it is even
simpler to use Solr than Lucene anyway.

In development environments and especially when it comes to
automated tests / CI it’s also quite comfortable not having to run a
separate server but using the short cut directly to the index, which
isn’t possible with Solr.

Not true. Solr can work embedded. There is a base SolrServer
abstraction, with an implementation that runs embedded (inside the
same JVM) versus over HTTP. Exactly the same interface for both
operations, using a very simple API (SolrJ, much like Lucene’s basic
API actually).

I’d be curious to see scalability comparisons between Ferret and
Solr - or perhaps more properly between Stellr and Solr - as it
boils down to number of documents, queries per second, and faceting
and highlighting speed. I’m betting on Solr myself (by being so
into it and basing my professional life on it).

This would be interesting, but I wouldn’t be that disappointed with
Stellr ending up second given the little amount of time I’ve spent
building it so far. Just out of curiosity, do you have some kind of
performance testing suite for Solr which I could throw at Stellr?

No, I don’t have those kinds of tests myself. While I can speak to
Solr’s performance based on what I hear from our clients and the
reports in the mailing lists, I don’t consider myself a performance
savvy person myself.

I’m curious - what are the numbers of documents being put into Ferret
indexes out there? millions? hundreds of millions? billions? And
are folks doing faceting? Does Ferret have faceting support?

Erik

Fernando_Parisotto · August 28, 2008, 9:06pm

On 28.08.2008, at 20:03, Erik H. wrote:

index a number of records aaf just hands the server the ids and
class name of the records to index, and the server does the rest.

Gotcha. Meaning the search server is pulling from the DB directly.
That’s what the DataImportHandler in Solr does as well. It’d be a
simple single HTTP request to Solr (once the DB stuff is configured,
of course) to have it do full or incremental DB indexing.

With the slight difference that custom model logic defined in the
rails model class is still involved to preprocess data, index values
calculated at indexing time or even have certain records refuse being
indexed based on their current state. Having per document boosts
depending on some value from the database (i.e. record popularity) is
also a classic… Aaf never just pulls data from the db, it always
uses rails model objects. Doesn’t make indexing faster of course…

[…]

XML makes me ill, generally speaking (it has its uses, but for
configuration it is just plain wrong).

FULL ACK

But isn’t Ferret is like 60k lines of C code too?!

true, but I don’t have to compile that every time I deploy my app…

My point was that Ferret isn’t just Ruby, just a counter point to
your “half a java project”. No one has to recompile Solr either.

but the custom analyzer implemented in Java… By saying ‘half a java
project’ I didn’t mean solr, but the parts of my application logic
that have to be implemented in Java in order to be plugged into solr.
But the JRuby route looks promising here of course.

encapsulate this logic and make it pluggable into the search engine
library. So less style points for this solution…

I was just saying It’s debatable exactly where in the client-
server spectrum synonym expansion belongs… and it really depends
on the needs of the project. Nothing wrong with a client doing some
user input massaging before a query hits the search server.

[…]

API, but to me Solr looks like an additional layer of abstraction
and complexity which I only want to have when it really gives me a
feature I need. Plus the last time I checked Lucene didn’t need xml
configuration files

I hear ya about the XML config files. And always to be fair to Solr
here, you really only need to set things up from a basic example
configuration that covers most scenarios already - so it really
isn’t necessary to even touch XML config except for tweaking little
things.

But I still have to read it in order to see if it fits my needs. Okay,
I’ll stop whining about that xml now

[…]

In development environments and especially when it comes to
automated tests / CI it’s also quite comfortable not having to run
a separate server but using the short cut directly to the index,
which isn’t possible with Solr.

Not true. Solr can work embedded. There is a base SolrServer
abstraction, with an implementation that runs embedded (inside the
same JVM) versus over HTTP. Exactly the same interface for both
operations, using a very simple API (SolrJ, much like Lucene’s basic
API actually).

cool, but that won’t work for Rails projects running on MRI and
accessing solr via solr-ruby.

No, I don’t have those kinds of tests myself. While I can speak to
Solr’s performance based on what I hear from our clients and the
reports in the mailing lists, I don’t consider myself a performance
savvy person myself.

I’m curious - what are the numbers of documents being put into
Ferret indexes out there? millions? hundreds of millions?
billions? And are folks doing faceting? Does Ferret have faceting
support?

not sure about the billions, but afair an earlier message in this
thread stated an index size of 90 million documents with aaf.
Altlaw.org has reported an index size of > 4GB with around 700k
documents last fall. The selfhtml.org index has approximately 1
million forum entries indexed, index size around 2GB. Stellr doesn’t
ever use more than around 50MB of RAM during indexing and searching
this index. I know RAM is cheap and all, but RAM size still has a
quite large influence on the price of the server you rent for your
app, at least here in germany.

Without doubt Solr has much more references in the area of such large
installations than ferret/aaf. I for myself never saw aaf as a drop-in
solution for indexes of this size, but more as an easy to use out of
the box solution for the average rails app with maybe several
thousands or tens of thousands records, but I’m happy to see it still
works in larger scale setups.

Heck, it all began with a simple full text search for my blog

Regarding the faceting - it’s not built into ferret, and aaf doesn’t
support it either since I didn’t need it yet, and nobody else
requested this feature so far. All in all I think the average usage
scenarios of solr and aaf are quite different atm…

I’ll try to find the time to benchmark the selfhtml.org data set with
solr and stellr. I’ll report my findings here.

Cheers,
Jens

–
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

Fernando_Parisotto · August 28, 2008, 10:15pm

On Aug 28, 2008, at 3:02 PM, Jens K. wrote:

popularity) is also a classic… Aaf never just pulls data from the
db, it always uses rails model objects. Doesn’t make indexing faster
of course…

All great points. ActiveRecord is much more pleasant than any other
database access that I’ve ever worked with. I don’t generally work
with databases personally, though. The bulk of my full-text searching
experiences don’t involve databases at all.

I suppose the Java counterpart would be Hibernate Search - surely
involving a lot more hideous XML and @annotations - ewww.

basic API actually).

cool, but that won’t work for Rails projects running on MRI and
accessing solr via solr-ruby.

Fair point.

Again, the answer comes back to JRuby Forget MRI. Good point
about solr-ruby - it is specifically designed for Solr over HTTP. It
wouldn’t take much to refactor it to work with embedded Solr via JRuby
though. But if JRuby is a given, it’d be just as easy to work with
SolrJ’s API directly.

Though for testing purposes, solr-ruby is easily mocked. solr-ruby
touts great (98% or something like that) code coverage with unit
tests, many of those tests are against solr-ruby’s API with Solr
itself mocked. And there are tests that fire up Solr in the
background and test that way too for full functional tests. So for
unit testing purposes, having Solr running isn’t needed, but it
launches plenty fast enough for testing end-to-end if desired.

ever use more than around 50MB of RAM during indexing and searching
this index. I know RAM is cheap and all, but RAM size still has a
quite large influence on the price of the server you rent for your
app, at least here in germany.

90 million is impressive for sure.

RAM - well, when Ferret/Stellr does faceting we’ll revisit that
discussion Solr loves RAM! It still can run in modest
environments, but the more RAM you can give it to use for caches
(depending on your needs) the better it is.

Without doubt Solr has much more references in the area of such
large installations than ferret/aaf. I for myself never saw aaf as a
drop-in solution for indexes of this size, but more as an easy to
use out of the box solution for the average rails app with maybe
several thousands or tens of thousands records, but I’m happy to
see it still works in larger scale setups.

Indeed! ferret: +1 - no question!

Heck, it all began with a simple full text search for my blog

Same for me (though I abandoned it when I realized that regular
blogging and server maintenance weren’t for me).

Regarding the faceting - it’s not built into ferret, and aaf doesn’t
support it either since I didn’t need it yet, and nobody else
requested this feature so far. All in all I think the average usage
scenarios of solr and aaf are quite different atm…

I’m really surprised by that. Faceting is the major feature that
attracts folks to Solr. It’s critical for all of our customers.

But yeah, no question that Lucene/Solr and Ferret/Stellr can happily
coexist and aren’t necessarily competition for every project. But
there definitely are those areas of overlap where a project could go
with either solution. And I would definitely not try to shoehorn Solr
into a project where it didn’t fit and Ferret worked fine. I’m
pragmatic like that.

I’ll try to find the time to benchmark the selfhtml.org data set
with solr and stellr. I’ll report my findings here.

Awesome. If you have the data in some easily digestible format, I’d
be happy to toss it into Solr and report back numbers from my
development machine. Drop me a line offline if you’d like.

Erik