On Feb 9, 2006, at 7:48, [email protected] wrote:
Since then it has been running for a further 12 hours trying to use
that index to obtain likely matches for the same 3000 items; i.e. for
each of the 3000 items I am trying to get the best matches from the
index (using find related).
Should I even bother waiting for it to finish or should I be
investigating something else to achieve similar results?
Can’t comment on the time it takes, but the data you’re using doesn’t
seem particularly suited to LSI, in my opinion (and this sort of
thing is my occupation these days). LSI’s not magic - what it’s
doing is taking advantage of the statistical properties of language.
So it needs two things to work well: a relatively large set of words
compared to the number of items, and the items should be (more or
less) standard language.
Obviously I don’t know exactly what the product names are, but as a
class, product names don’t strike me as fitting those constraints
very well. Firstly because I expect them to be fairly short (5-6
words, tops?), and secondly because they lack a lot of the syntax and
semantic relations that you’d find in a sentence (nominals don’t have
very much internal structure, in general).
Other approaches that might be promising might be standard word/
document search (like ferret, already mentioned), or a language model
approach, which works using relative frequencies of words. In the
power tool domain, for instance, “grit” might correlate highly with
“sander”, and so you could say that anything with “grit” in it is
related to sanding.
That said, I’m not aware of any Ruby libraries which implement this
sort of thing, so if you wanted to stick with Ruby, you’d be doing it
yourself (it’s not a particularly sophisticated approach, though, so
it likely wouldn’t be that hard).
matthew smillie.
Matthew S. [email protected]
Institute for Communicating and Collaborative Systems
University of Edinburgh