Sphinx vs ferret

On 19/01/2008, at 10:17 AM, Jeff wrote:

How difficult would it be to change over to Sphinx?

The overall process? Not hard, with the caveat Adrian mentioned (ie:
advanced Ferret features).

But keep in mind Sphinx does not allow updating fields of index
records (Ferret does) - you have to re-index to get the latest changes
into Sphinx. There are ways around this, to some extent - delta
indexes, containing just the recent changes - but it doesn’t seem to
be critical to everyone.

Essentially, though:

  • Choose a sphinx plugin, and install it.
  • Set up the configuration and indexes, either manually, or within
    your models (depending on the plugin)
  • Install sphinx
  • Index your data
  • Switch your ferret-specific search calls to use the sphinx plugin’s
    search calls.
  • Start the sphinx daemon (searchd)
  • Confirm everything works

Or something along those lines. I’m sure the EngineYard crew have a
better idea though.


Pat
e: [email protected] || m: 0413 273 337
w: http://freelancing-gods.com || p: 03 9386 0928
discworld: http://ausdwcon.org || skype: patallan

No advanced features. Did everything through acts_as_ferret. Other
than that, added one method for pagination from ferret searches.

But Sphinx is good to go in production? What about UTF-8? Ferret
seemed to do that out of the box, I just tested some non-latin
characters and was surprised
to find ferret indexed them all properly.

Ericson S. wrote:

If you consider using Postgresql, then tsearch2 is awesome. Its built
into the latest version of Postgresql.

How would you do the integration into Rails 2 ?

I tried the acts_as_tsearch plugin

Google Code Archive - Long-term storage for Google Code Project Hosting.

and the first line of the example works, but it really
does not seem to be ready for prime time to me and at
this moment …

Thanks for any insights,

Peter V.
(new to rails)

On 22.1.2008, at 23.46, Peter V. wrote:

and the first line of the example works, but it really
does not seem to be ready for prime time to me and at
this moment …

I haven’t used the plugin, but interfacing with tsearch2 is easy
enough so you can write your own in a day:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

We did that back in early '06 and since talking with tsearch2 is
basically normal SQL, all you have to do is to write a custom finder
method.

I have no idea how the performance compares to other engines but I
find it pretty cool that everything happens transparently inside the
database so you have one less process to monitor and keep fresh. So if
you’re using PostgreSQL, it should definitely be worth a shot. It’s
been around forever, so it should be void of most pediatric diseases.

Cheers,
//jarkko


Jarkko L.

http://www.railsecommerce.com
http://odesign.fi

Hi!

Ferret is unstable in production. Segfaults, corrupted indexes
galore. We’ve switched around 40 clients form ferret to sphinx and
solved their problems this way. I will never use ferret again after
all the problems I have seen it cause peoples production apps.

I’d really like anybody experiencing problems like this to contact me
or even
better the ferret-talk mailing list about such problems. I have
several sites using
Ferret with DRb server runs rock solid there. I must admit that
they’re relatively low
traffic, but high load is nothing that will make Ferret crash or
currupt indexes, if you
use it in the right way (say, one process accessing the index).
Without doubt there
are cases when Ferret will segfault, i.e. because of platform specific
problems, poor
argument checking and error handling in the C code and so on, but they
may be
circumvented most of the time. Not nice, but acts_as_ferret already
does most of this
for you.

I also did some load tests with acts_as_ferret’s DRb server a while
ago, where it handled

30 mixed indexing and search requests per second from multiple client processes for hours,
and no crash or index corruption (index size was 7GB at the end of the
run) happened.

So to summarize: it’s definitely possible to have a stable Ferret
setup, before you take on the
work to switch to something else why not drop me a line and I’ll be
happy to have a look at your
problem.

However from what I’ve read here I’ll be sure to check out Sphinx soon
so I know what you’re
talking about here :wink:

Cheers,
Jens

I’d really like anybody experiencing problems like this to contact me

I had trouble with it using version 0.11.6. I was having intermittent
problems every time I tagged a store. When I removed my rescue I
found it was ferret (don’t have the exact error on me… sorry).
Stepping back to 0.11.3 seems to have resolved this (this is the last
version I can remember that worked for me somewhat reliably). With
0.11.6, removing my index solves it temporarily (6 or 7 tag actions)
but then it comes back.

Feel free to move this to the ferret talk list, I’ll go check on it
there.

-Vince

support independent business – http://www.buyindie.net/

I took a good look at Sphinx and Ultrasphinx, even tried
implementation in my app. Unfortunately these were show stoppers for
me:

  • No real integration with activerecord (plugin just generates sql
    statements outside of the context of AR. Therefore you can’t really
    use your own custom model methods as fields… as far as I could tell)
  • No wildcards at all (Sphinx doesn’t support them)
  • No automatic updates - must rebuild entire index using cron jobs.
    Again using straight SQL, not the current state of your models

On the contrary, I could see Sphinx being very appropriate for certain
types of apps… but these were important features for my particular
use (especially wildcards)

On 1/29/2008, “Peter V.” [email protected]
wrote:

on r0165 (still testing the newest r1112) and that seems
to work OK for me. Set the “enable_star” to 1 and set a
min_prefix_leng or a min_infix_leng.

Yup, works for me too - it’s just not turned on by default. The
enable_star feature has been around in at least the last 4 releases of
0.9.8. Fairly certain it’s not in 0.9.7 though (the last ‘production’
release).

  • No automatic updates - must rebuild entire index using cron jobs.

Indeed. But automatic rotation of indexes seems to work OK.
Indexing on my dataset takes 15 seconds (37000 records,
28 MByte) on a desktop PC.

Thinking Sphinx has delta indexes, which keep track of changes between
explicit indexes. I know Evan’s working on adding something like this
to UltraSphinx as well.

The super small delta indexes means they get indexed really quickly,
straight after a model is updated.

Again using straight SQL, not the current state of your models

You’re correct that UltraSphinx doesn’t support model methods (as
opposed to standard attributes) are not accessible for index generation

  • that’s the same with Thinking Sphinx and perhaps all of the other
    plugins as well.

Because you’re dealing with MySQL directly when the data is indexed,
there’s no instantiation of models (and no Ruby at all), so it’s not
really an option. If the data you want isn’t available somewhere in the
database, you’re out of luck.

Ferret uses model methods, I believe, if that’s an option available to
you.

Use the :include key.
So in my Model for jobs, that is as simple as e.g.:

Thinking Sphinx equivalent (just to provide a comparison):

class Job < ActiveRecord::Base

define_index do |index|
index.includes.title
index.includes.employer.name
end

end

Cheers


Pat

Jeff Cc wrote:

  • No wildcards at all (Sphinx doesn’t support them)

Do you mean the "" feature (prefix and infix) ? Where
the search term “program*” matches the database text
“program”, “programmer”, “programs” …

Those work for me in version sphinx-0.9.8-svn-r1065 and
sphinx-0.9.8-svn-r1112 … I have done quite some testing
on r0165 (still testing the newest r1112) and that seems
to work OK for me. Set the “enable_star” to 1 and set a
min_prefix_leng or a min_infix_leng.

  • No automatic updates - must rebuild entire index using cron jobs.

Indeed. But automatic rotation of indexes seems to work OK.
Indexing on my dataset takes 15 seconds (37000 records,
28 MByte) on a desktop PC.

Again using straight SQL, not the current state of your models

At least in one (limited) test, I have just used the :include
feature of ultrasphinx and that automatically created the SQL
for the sphinx configuration file. So, if I understand well,
that did use the AR model ?

From:
http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/classes/ActiveRecord/Base.html

  • Including a field from an association

Use the :include key.

Accepts an array of hashes.

:include => [{:association_name => ‘category’, :field => ‘name’, :as
=> ‘category_name’}]

Each should contain an :association_name key (the association name for
the included model), a :field key (the name of the field to include),
and an optional :as key (what to name the field in the parent).

So in my Model for jobs, that is as simple as e.g.:

class Job < ActiveRecord::Base

is_indexed :fields => [
‘title’]
:include => [
{:association_name => ‘employer’, :field => ‘name’}]

belongs_to :employer

The config file for sphinx that is calculated by ultrasphinx then has
automatically calcuated by ultrasphinx:


sql_query = SELECT (jobs.id * 1 + 0) AS id, ‘Job’ AS class, 0 AS
class_id, jobs.
title AS title, employer.name AS name, …

index complete
{
source = jobs
charset_type = utf-8
charset_table = 0…9, A…Z->a…z, -, _, &, a…z,
U+410…U+42F->U+430…U+44F, … and a lot more …
min_word_len = 2

min_infix_len = 4

stopwords =
enable_star = 1
path = /var/sphinx//sphinx_index_complete
docinfo = extern
morphology = none
min_prefix_len = 4
}

All of this seems to work for me (no production experience yet …).

Hi Jens,

It’s been a long time :wink: Hope you’re doing well.
I have something to say about that all: Even if you find a right way
of making Ferret quite “stable”, the development has stopped for
more than a year now, leaving a LOT of bugs unsolved.
Ferret has no future for the moment, and considering builduing website
on it’s top is like doing extrem sports on a just recovered broken
leg…

Cheers,
Jérémie

Do you mean the "" feature (prefix and infix) ? Where
the search term “program*” matches the database text
“program”, “programmer”, “programs” …

Those work for me in version sphinx-0.9.8-svn-r1065 and
sphinx-0.9.8-svn-r1112 … I have done quite some testing
on r0165 (still testing the newest r1112) and that seems
to work OK for me. Set the “enable_star” to 1 and set a
min_prefix_leng or a min_infix_leng.

No that’s the stemming feature I believe… it just changes prefixes
and suffixes on words and is language dependent. Awesome feature
however. Not sure how (or if) Ferret implements it.

What I meant was just straight wildcards as in a MySQL LIKE clause,
example: “*@gmail.com” to find all emails @gmail.com

Are you sure Ferret development has stopped? According to the Ferret
trac, last change to the trunk was only a few weeks ago, and last tag
(0.11.6) was dated Nov 28 2007, only two months ago. I also see the
developer replying to tickets just this month. Am I missing something
here?

Hi Jeff,

The 0.11.6 release has only a LITTLE bugfix. Also, closing tickets as
“wont fix” is easy. Have a look at the latest MEANINGFUL changesets,
long time ago :slight_smile:

Cheers,
Jérémie

Jeff Cc wrote:

Do you mean the "" feature (prefix and infix) ? Where
the search term “program*” matches the database text
“program”, “programmer”, “programs” …

Those work for me in version sphinx-0.9.8-svn-r1065 and
sphinx-0.9.8-svn-r1112 … I have done quite some testing
on r0165 (still testing the newest r1112) and that seems
to work OK for me. Set the “enable_star” to 1 and set a
min_prefix_leng or a min_infix_leng.

No that’s the stemming feature I believe… it just changes prefixes
and suffixes on words and is language dependent. Awesome feature
however. Not sure how (or if) Ferret implements it.

What I meant was just straight wildcards as in a MySQL LIKE clause,
example: “*@gmail.com” to find all emails @gmail.com

I have the impression the enable_star is really the feature that does
allow
search for "@gmail.com" to find all emails @ gamil.com (if you add the
‘@’ sign to the char table actually … (which is another problem, since
‘@’ also has a special meaning as a field indicator for field specific
search).
For the enable star the user must explicitely give a '
'. WIthout a ‘*’
the match is only for “exact match”. I give an example at the end of
my blog: (http://www.vandenabeele.com/Ultrasphinx-performance) where
I tested with and without the enable_star feature and always without
stemming
(since I had not stemmer for the Duthch language).

0.001 sec [ext/0/rel 1409 (0,20)] [complete] c
0.001 sec [ext/0/rel 1409 (0,20)] [complete] c*
0.000 sec [ext/0/rel 35 (0,20)] [complete] co
0.000 sec [ext/0/rel 35 (0,20)] [complete] co*
0.000 sec [ext/0/rel 5 (0,20)] [complete] com
0.000 sec [ext/0/rel 5 (0,20)] [complete] com*
0.000 sec [ext/0/rel 10 (0,20)] [complete] comp
0.003 sec [ext/0/rel 5343 (0,20)] [complete] comp*
0.000 sec [ext/0/rel 0 (0,20)] [complete] compl
0.000 sec [ext/0/rel 1473 (0,20)] [complete] compl*
0.000 sec [ext/0/rel 0 (0,20)] [complete] comple
0.000 sec [ext/0/rel 1214 (0,20)] [complete] comple*
0.000 sec [ext/0/rel 0 (0,20)] [complete] complet
0.000 sec [ext/0/rel 793 (0,20)] [complete] complet*
0.000 sec [ext/0/rel 458 (0,20)] [complete] complete
0.000 sec [ext/0/rel 642 (0,20)] [complete] complete*
0.000 sec [ext/0/rel 30 (0,20)] [complete] completed
0.000 sec [ext/0/rel 30 (0,20)] [complete] completed*
0.000 sec [ext/0/rel 0 (0,20)] [complete] completel
0.000 sec [ext/0/rel 130 (0,20)] [complete] completel*
0.000 sec [ext/0/rel 10 (0,20)] [complete] completely.

What happens is that with less than 4 characters, the * has no effect,
but from 4 characters on, the * expands to all words that match the same
first 4 letters. And that is an interesting feature the major public
search engines do not offer. At this time, with the relatively small
database I expect initially for our project (< 10 MByte or so), it
should not be a problem to keep indices with start expansion after 4
letters in memory.

An issue that I still have is that a final ‘.’ of a sentence is attached
to the index data and so not found without attaching a ‘.’ or ‘*’ to the
search term.

++++

I solved the ‘.’ issue in the meanwhile with a crude solution of
removing the ‘.’ character from the char_table list (which causes other
problems …).

The stemming will e.g. ‘companies’ and ‘company’ to a stem of ‘compani’
(both in the search term and in the database index), without the user
needing to add a special * to the search. so any combination of
‘company’ and ‘companies’ will match.

HTH,

Peter

On Mar 17, 12:45 am, lamyseba [email protected] wrote:

Seems like the maintenance is back, just have a look at the trac
timeline, dbalmain seems to be the new maintenance guy

Lol, dbalmain has always been the one and only coder of Ferret, he’s
back but his current commits only fixes a very little bugs and add
more features such as compression with zlib… Adding feature instead
of fixing existing and CORE ones ? Have fun guys :slight_smile:

I REALLY hope ferret to become stable one day since it’s the most
flexible and easy to use product i’ve seen for this job, but don’t use
it in production now…

Jérémie.

On 30 jan, 22:58, “[email protected][email protected] wrote:

Hi Jeff,

The 0.11.6 release has only a LITTLE bugfix. Also, closing tickets as
“wont fix” is easy. Have a look at the latest MEANINGFUL changesets,
long time ago :slight_smile:

Cheers,
Jérémie

Seems like the maintenance is back, just have a look at the trac
timeline, dbalmain seems to be the new maintenance guy

Jérémie Bordier wrote:

On Mar 17, 12:45 am, lamyseba [email protected] wrote:

Seems like the maintenance is back, just have a look at the trac
timeline, dbalmain seems to be the new maintenance guy

Lol, dbalmain has always been the one and only coder of Ferret, he’s
back but his current commits only fixes a very little bugs and add
more features such as compression with zlib… Adding feature instead
of fixing existing and CORE ones ? Have fun guys :slight_smile:

I REALLY hope ferret to become stable one day since it’s the most
flexible and easy to use product i’ve seen for this job, but don’t use
it in production now…

J�r�mie.

Hi,

That is a very interesting thread. I am currently deploying a rails app
with aaf. I have many troubles and basically I cannot have it working. I
am surprised because everything was so simple in development.

I must admit that I understand nothing to the DRB server. (I am learning
this new thing.)
My app is on a shared host. I do not even understand if the drb server
can run on it…

When I run : script/ferret_server -e production start
I get : starting ferret server…
That is all
But when I stop it (script/ferret_server -e production start) I get :
ferret_server doesn’t appear to be running
I guess it is not normal (can someone confirm please ?)…

Then when I do script/console production
Article.rebuild_index
I get the first time :
DRb::DRbConnError: druby://ferret.myhost.com:9010 - #<SocketError:
getaddrinfo: Name or service not known>
And if I do it a second time :
LoadError: Expected article.rb to define Article

Whatever bad I am, this is just an awful behavior for a software, sorry
to say that because I loved aaf in dev.

(I tried a chmod -R 777 index without success)
For info I am deploying with Capistrano, in case it rings a bell to
someone.

My options :
more help from my host, I am currently discussing with them
help from you about the aaf configuration
but even if I make it work, from what I have read here I should not
build my app with it…
try another search engine : sphinx
but I read from brfsa “FERRET is in my second choice only because
shared hosts won’t support sphinx…” Can I have any precision on that
? Or alternatively for those of you on a shared host how do you manage
your search ?

Finally, I am listening to your suggestions about :
the web host : which ones allow a search engine such as
ferret/sphinx/other ?
how to configure aaf ?
Which plugins for the engine ? (Do not worry I will read again the whole
thread!)

Thx !
H

Don’t even try running ferret on a shared host. I don’t think you really
have any other option but MySQL fulltext indexes in a shared hosting
environment.

AEM

On Wed, Mar 26, 2008 at 1:29 PM, Harry S. <
[email protected]> wrote:

of fixing existing and CORE ones ? Have fun guys :slight_smile:
with aaf. I have many troubles and basically I cannot have it working. I
But when I stop it (script/ferret_server -e production start) I get :

but even if I make it work, from what I have read here I should not
how to configure aaf ?


Posted via http://www.ruby-forum.com/.


Adrian Esteban Madrid
Lead Developer, Prefab Markets
http://www.prefabmarkets.com

Heartbreaking !

Adrian M. wrote:

Don’t even try running ferret on a shared host. I don’t think you really
have any other option but MySQL fulltext indexes in a shared hosting
environment.

AEM

Adrian M. wrote:

Don’t even try running ferret on a shared host. I don’t think you really
have any other option but MySQL fulltext indexes in a shared hosting
environment.

You might take a look at tsearch2 on postgresql (for a shared host
solution).
IIRC, it only requires special indexes in the database, but no daemon
process (like e.g. sphinx does). This was mentioned higher up in this
thread too, by
Ericson S…

I did some experiments with tsearch2 and it worked OK (but then I
switched to sphinx, mainly because MySQL is more common as a Rails
back-end and because a clean and full plug-in (Ultrasphinx) was
available). In older versions of Postgresql it is a plug-in, since 8.2
(IIRC) it is built-in by default.

HTH,

Peter