Similar words

quik77 · June 11, 2008, 12:38pm

Is there a way to get a list of similar words to the ones a user has
searched for?

So if they search for (in my case) transferaze which has no matches I
can get back an array like this [‘transferase’] ?

I know I can just add ~ on the end to make it fuzzy, but what I’d like
is to be able to say “Sorry, no matches for ‘transferaze’. Did you
mean ‘transferase’ (310 matches)?”

Ideally I’d like to get the number of matches for those similar words,
but I know I could just do a search for each of those to get it.

Is this possible?

-Rob

quik77 · June 11, 2008, 1:45pm

Ideally I’d like to get the number of matches for those similar words, but I
know I could just do a search for each of those to get it.

And I think it might have to boil down to that really. If you want to
get a the number of results for ‘transferase’ when someone searches
for ‘transferaze’, then it means you’ll need to hit the index once
more with ‘transferase’ in separate.

quik77 · June 11, 2008, 1:46pm

Is there a way to get a list of results so that you also get, for each
result, the phrase that was successfully matched? If so, you could
shove
all the results into a hash, with the keys being the matched phrase (and
the
value for each being an array of results), and return the asked-for
key’s
array along with the key and value size for the largest other array.

just thinking aloud…i’d like to be able to do this as well, i don;t
know
if you can get back that info though.

2008/6/11 Julio Cesar O. [email protected]:

quik77 · June 12, 2008, 1:27am

So the problem you have is where to pull recommendations from. For my
own needs, I use a spell checker to do the “did you mean”, which means
my data source is external, and thus I never hit the index twice.

As you seem to want to correlate the user’s input with existing
entries in your index, then I still think you’ll need to hit the index
twice, one using the analyzers you’d normally use, and another with a
fuzzy query.

To help scaling things, you could have 2 indexes. But that’s another
story.

quik77 · June 11, 2008, 3:05pm

And I think it might have to boil down to that really. If you want to
get a the number of results for ‘transferase’ when someone searches
for ‘transferaze’, then it means you’ll need to hit the index once
more with ‘transferase’ in separate.
I’m happy to do that if that’s the only way to do that, but that’s
really a secondary issue.

Imagine I have in my index the following terms:
abcd
abce
abcf

and I search for abca

I’d get 0 matches.

What I’d like is to be able to present to the user:

No matches found for ‘abca’. Did you mean ‘abcd’, ‘abce’, or ‘abcf’ ?

So I need Ferret to have method call that would return [‘abcd’,
‘abce’, ‘abcf’].

Is this possible?

-Rob

quik77 · June 12, 2008, 1:55pm

So the problem you have is where to pull recommendations from. For my
own needs, I use a spell checker to do the “did you mean”, which means
my data source is external, and thus I never hit the index twice.
No

I definitely need to pull the recommendations from the Ferret index
(or reimplement this bit of Ferret in Ruby). I can’t use spell checker
with an ordinary dictionary because the terms that are stored in my
index (which is an index of Protein Databank File headers among other
things) are often not ordinary words. I could build my own
dictionary of all the words that are indexed, then loop through those
and compute the levenstein distance for each - but that’s obviously
what query~ does (it must query a Ferret dictionary to find the
matches with a levenstein distance less than foo, then create a query
that does word1 or word2 or word 3…), so it seems extraordinarily
silly (not to mention slow) to reimplement (in Ruby) something that is
already in Ferret.

My question really is whether access to this information is exposed
throught the Ferret API. I think a Ferret developer is needed to
answer this question.

I’m very surprised that I’m the first person (AFAICT from searching
the mailing list archive) to ask this question.

-Rob