Lucene’s standard analyzer splits words separater with underscores.
Ferret doesn’t do this. For example, if I create an index with only
document ‘test_case’ and search for ‘case’ it doesn’t find anything.
Lucene on the other hand finds it. The same story goes for words
separated by colons.
Which analyzer should I use to emulate Lucene’s StandardAnalyzer
behavior?
No analyzer currently emulates Lucene’s StandardAnalyzer exactly.
You’d have to port it to Ruby which shouldn’t be too hard if you know
how to use racc. But is sounds to me like you don’t need anything so
complex. If you are indexing code you might want to try using the
AsciiLetterAnalyzer. Or you could use the RegExpAnalyzer and describe
your tokens with a Ruby RegExp. Something like this;
include Ferret
include Ferret::Analysis
index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/))
# or if you want case sensitive searches;
index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/, false))
On 9/6/06, Kent S. [email protected] wrote:
No analyzer currently emulates Lucene’s StandardAnalyzer exactly.
You’d have to port it to Ruby which shouldn’t be too hard if you know
how to use racc. But is sounds to me like you don’t need anything so
complex. If you are indexing code you might want to try using the
AsciiLetterAnalyzer.
No, it doesn’t do what I want. Looking at the code I’m slightly
confused. The criteria is that if isalpha returns 0 then we reached the
end of a token. Does it mean that ‘_’ character is considered
alphanumeric?
Or you could use the RegExpAnalyzer and describe
your tokens with a Ruby RegExp. Something like this;
include Ferret
include Ferret::Analysis
index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/))
# or if you want case sensitive searches;
index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/, false))
It would be great if this code worked, but it segfaulted on me. I’ve
glanced at the code and noticed that for this type of stream
typedef struct RegExpTokenStream {
CachedTokenStream super;
VALUE rtext;
VALUE regex;
VALUE proc;
int curr_ind;
} RegExpTokenStream;
you initialize tree VALUE objects but never mark them for garbage
collector. Eventually they are being freed behind my back. What you
should do is to keep the type of the stream in TokenStream structure and
rework frt_ts_mark method.
So no, ‘_’ is not considered alphanumeric (or in this case alpha, as
AsciiLetterAnalyzer won’t match numbers)
you initialize tree VALUE objects but never mark them for garbage
collector. Eventually they are being freed behind my back. What you
should do is to keep the type of the stream in TokenStream structure and
rework frt_ts_mark method.
Hope that helps,
Kent
Actually, frt_rets_mark already marks the three VALUE objects
correctly. What would really help would be if you could give me an
example script that segfaults. If you can do this I’ll fix it and get
a new gem out as soon as possible.
correctly. What would really help would be if you could give me an
example script that segfaults. If you can do this I’ll fix it and get
a new gem out as soon as possible.
Actually, hold on that, I think I’ve found the problem.
correctly. What would really help would be if you could give me an
example script that segfaults. If you can do this I’ll fix it and get
a new gem out as soon as possible.
Actually, hold on that, I think I’ve found the problem.
Hi Kent,
I’ve put in a fix which I think should fix your segfault.
Unfortunately I can’t seem to replicate the bug here to test it. Even
calling GC.start doesn’t seem to collect any of the three VALUES in
RegExpTokenStream. I’ve had problems like this before when trying to
test an implemention of a weak-key Hash. I really need to look into
how the Ruby garbage collector works but it never seems to work
predictable for me.
Anyway, I was hoping you could help me out, either by testing your
code against the latest version of Ferret in subversion or sending me
a short (or long, I don’t really care) script which causes the
problem. If it’ll make it any easier I can email you a gem of the
current working version Ferret.
So no, ‘_’ is not considered alphanumeric (or in this case alpha, as
AsciiLetterAnalyzer won’t match numbers)
Yes. It seems to work correctly, but I’ve noticed that index.search_each
doesn’t return more that 10 documents. Is there an option to change it?
you initialize tree VALUE objects but never mark them for garbage
collector. Eventually they are being freed behind my back. What you
should do is to keep the type of the stream in TokenStream structure and
rework frt_ts_mark method.
Hope that helps,
Kent
Actually, frt_rets_mark already marks the three VALUE objects
correctly. What would really help would be if you could give me an
example script that segfaults. If you can do this I’ll fix it and get
a new gem out as soon as possible.
correctly. What would really help would be if you could give me an
example script that segfaults. If you can do this I’ll fix it and get
a new gem out as soon as possible.
Actually, hold on that, I think I’ve found the problem.
Hi Kent,
I’ve put in a fix which I think should fix your segfault.
Unfortunately I can’t seem to replicate the bug here to test it. Even
calling GC.start doesn’t seem to collect any of the three VALUES in
RegExpTokenStream. I’ve had problems like this before when trying to
test an implemention of a weak-key Hash. I really need to look into
how the Ruby garbage collector works but it never seems to work
predictable for me.
Anyway, I was hoping you could help me out, either by testing your
code against the latest version of Ferret in subversion or sending me
a short (or long, I don’t really care) script which causes the
problem. If it’ll make it any easier I can email you a gem of the
current working version Ferret.
Can you send it to my email at ksibilev at yahoo dot com?
Thanks.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.