Need some information about Ferret

lyes · November 30, 2008, 9:50am

Hi everybody!

In our company, we want to use Ferret as the main index/search engine of
our
applications. And we are looking for some testimonies about how Ferret
is
efficient when deployed in production.

Was Ferret already deployed in production in some companies? is there
some
testimonies about that?
What is the maximum number of documents we can index with ferret? Has
some
one informations about that.
What is the best way to access a very huge Ferret Index? May we
distribute
it on several machines or not?

By the way, can Ferret read Solr indexes as they are both clones of
luceen?

thank you

lyes · November 30, 2008, 4:28pm

On Nov 30, 2008, at 3:49 AM, Lyes A. wrote:

By the way, can Ferret read Solr indexes as they are both clones of
luceen?

No. While Ferret was designed around the Lucene index file format, it
is not compatible with Java Lucene (and thus Solr).

Erik

lyes · November 30, 2008, 4:55pm

On Sun, Nov 30, 2008 at 10:48 AM, Erik H.
[email protected]wrote:

   Erik
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Hello Eik!

thank you for the information. But is there a mean to recover an
existing
Solr index content and reindex it with Ferret?

lyes · December 1, 2008, 8:52am

Hello

But why move from Solr to Ferret?

We found that the search and the indexation with Solr was too slow, and
we
decided to find another alternative. Ferret seems to be a good choice.
We
tried Ferret on some examples and we found that it was better.

lyes · November 30, 2008, 6:08pm

On Nov 30, 2008, at 10:54 AM, Lyes A. wrote:

thank you for the information. But is there a mean to recover an
existing Solr index content and reindex it with Ferret?

It’ll probably be easier and faster to reindex your original content,
which presumably you still have handy. But… you’d have to have your
fields “stored” in Solr for them to be recoverable. Using solr-ruby’s
Solr::Importer::SolrSource would makes it easy to iterate over all
documents in Solr (using a query of :).

But why move from Solr to Ferret?

Erik

lyes · December 1, 2008, 9:45am

On Dec 1, 2008, at 2:52 AM, Lyes A. wrote:

Hello

But why move from Solr to Ferret?

We found that the search and the indexation with Solr was too slow,
and we decided to find another alternative. Ferret seems to be a
good choice. We tried Ferret on some examples and we found that it
was better.

Thanks for the feedback. If you don’t mind elaborating further, what
kind of documents are you indexing (database rows? file system
files? other?), how many documents do you have, and how are you
indexing it?

Thanks,
Erik

lyes · December 1, 2008, 11:41am

Hello Erik

Thanks for the feedback. If you don’t mind elaborating further, what
kind

of documents are you indexing (database rows? file system files? other?),
how many documents do you have, and how are you indexing it?

Thanks,
   Erik

Now, we are indexing file system files varying from HTML pages (85%)
to
IMAGES (10%) (We index Meta information here), PDF(2%) WORD (2%) and
PURE
TEXT (1%), we have 100 000 000 documents to index (10%) is already done.
And
for the last question, I didn’t exactly understand what do you mean by
“How
we are indexing”, What I can say is that before we index non full text
documents (like PDF, WORD and HTML), we operate a content extraction
(usingpdftotext, antiword and ‘hpricot’ ruby library). We axtract also
the
metadata related to each document we index.

lyes · December 4, 2008, 2:07pm

Hello Jens!

Thank you for your contribution.

Yes, I use Ferret whenever I need some kind of search for a site or
application I’m working on. Usually these are full text searches for product
catalogs and/or html content - not really large scale, at most around 10000
documents. Most recent example is www.fahrrad-xxl.de.

Is 100 000 your maximum documents Number?

We have more than 100.000.000 documents to index. 2.800.000 are already
done but the indexation machine starts to be heavy! Do you think that
ferret
will be able to index all this?

What is the best way to access a very huge Ferret Index? May we

distribute it on several machines or not?

Afair there’s no way to distribute an index across multiple machines built
into Ferret. You could do the distribution yourself of course by clustering
your data and distributing across several independent ferret indexes.
Downside is that search result scores from different indexes aren’t directly
comparable.

Yes, it is a good Idea. But how will we merge the results when we will
get
them back after a request?

lyes · December 1, 2008, 5:28pm

Hi!

On 30.11.2008, at 09:49, Lyes A. wrote:

Hi everybody!

In our company, we want to use Ferret as the main index/search
engine of our applications. And we are looking for some testimonies
about how Ferret is efficient when deployed in production.

Was Ferret already deployed in production in some companies? is
there some testimonies about that?

Yes, I use Ferret whenever I need some kind of search for a site or
application I’m working on. Usually these are full text searches for
product catalogs and/or html content - not really large scale, at most
around 10000 documents. Most recent example is www.fahrrad-xxl.de.

We also use Ferret + aaf in a knowledge management system I’m working
on for xscio AG (xscio.de).

What is the maximum number of documents we can index with ferret?
Has some one informations about that.

I have no idea whether there is an upper limit for the number the
documents other than the maximum value a Ruby Fixnum instance can
have…

What is the best way to access a very huge Ferret Index? May we
distribute it on several machines or not?

Afair there’s no way to distribute an index across multiple machines
built into Ferret. You could do the distribution yourself of course by
clustering your data and distributing across several independent
ferret indexes. Downside is that search result scores from different
indexes aren’t directly comparable.

By the way, can Ferret read Solr indexes as they are both clones of
luceen?

Ferret isn’t really index compatible with Lucene anymore, it uses a
slightly different index format mostly due to differences in the
representation of utf8 values, but I think there were other changes,
too.

Oh, and Solr also isn’t a clone of Lucene, it’s a search server that
internally uses the Lucene library.

Cheers,
Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49351467660 | Telefax +493514676666
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold

lyes · December 5, 2008, 9:56pm

Amazouz Loui wrote:

We found that the search and the indexation with Solr was too slow, and
we
decided to find another alternative. Ferret seems to be a good choice.
We
tried Ferret on some examples and we found that it was better.

We recently switched the other way around, because Ferret, running with
DRb server was the sole cause for a lot of scalability issues with our
application. We didn’t really have a lot of documents, altogether maybe
800000, give or take. We did have several 10000 index updates per day
running asynchronously (as in: from a job queue, having them run
directly from every application server would’ve lead to certain death).

We deliver a lot of our content directly from the index, and at peak
times that would end in dozens of Mongrel timeouts. I wasn’t sure if
Ferret really was the cause for all this, but when we switched I was.
Even if we run into scalability issues with Solr it still has a lot more
nuts and bolts to tune than Ferret.

While Ferret itself is fast, it pretty much boils down to how you’re
using it. If you only use one server with acts_as_ferret, and if you
have a lot of index updates, you might just run into the same problem.
If you have a central place where you create your index and then
distribute it to the separate server(s) you might just be fine.

Cheers, Mathias

http://twitter.com/roidrage