Substring search?

dawillis · November 24, 2007, 10:14pm

Is it possible to use Ferret to do substring searches efficiently? If
not, what can I use?

Problem: 1 million+ strings need to be matched with 1 million+
substrings. For example:

iliketotraveltohawaii
travelmagazine

will both be matched with the substring “travel” but only the first will
match with “hawaii”.

What I have tried:

Used Ferret to create an index with a WhiteSpaceAnalyzer by splitting
each string into characters. travelmagazine -> t r a v e l m a g a z i n
e

This works, and generates the index very quickly but the search
(PhraseQuery) is very slow. Like 200-300 ms.

I’m concerned that either Ferret is the wrong tool for this or I’m just
taking the wrong approach to this problem.

dawillis · November 25, 2007, 3:37am

To answer my own question… using n-grams instead of unigrams provided
a huge speedup.