First, thanks to Jens K. for pointing a stupid error on my part
regarding
the use of test_token_stream().
My current problem, a custom tokenizer I’ve written in Ruby does not
properly create an index (or at least searches on the index don’t work).
Using test_token_stream() I have verified that my tokenizer properly
creates
the token_stream; certainly each Token’s attributes are set properly.
Nevertheless, simple searches return zero results.
The essence of my tokenizer is to skip beyond XML tags in a file and
break
up and return text components as tokens. I use this approach as opposed
to
an Hpricot approach because I need to keep track of the location of the
text
with respect to XML tags since after a search for a phrase I’ll want to
extract the nearby XML tags as they contain important context. My
tokenizer
(XMLTokenizer) contains a the obligatory initialize, next and text
methods
(shown below) as well as a lot of parsing methods that are called at the
top
level by the method XMLTokenizer.get_next_token which is the primary
action
within next. I didn’t add the details of get_next_token as I’m assuming
that
if each token produced by get_next_token has the proper attributes then
it
shouldn’t be the cause of the problem. What more should I be looking
for?
I’ve been looking for a custom tokenizer written in Ruby to model after;
any
suggestions?
def initialize(xmlText)
@xmlText = xmlText.gsub(/[;,!]/, ' ')
@currPtr = 0
@currWordStart = nil
@currTextStart = 0
@nextTagStart = 0
@startOfTextRegion = 0
@currTextStart = \
XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText)
@nextTagStart = \
XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText)
@currPtr = @currTextStart
@startOfTextRegion = 1
end
def next
tkn = get_next_token
if tkn != nil
puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc,
tkn.text]
end
return tkn
end
def text=(text)
initialize(text)
@xmlText
end
Below is text from a previous, related message that shows that
StopFiltering
is not working:
- I’ve written a tokenizer/analyzer that parses a file extracting tokens and
> operate this analyzer/tokenizer on ASCII data consisting of XML
files (the
> tokenizer skips over XML elements but maintains relative
positioning). I’ve
> written many units tests to check the produced token stream and was
> confident that the tokenizer was working properly. Then I noticed
two
> problems:
>
> 1. StopFilter (using English stop words) does not properly filter
the
> token stream output from my tokenizer. If I explicitly pass an
array of stop
> words to the stop filter it still doesn’t work. If I simply
switch my
> tokenizer to a StandardTokenizer the stop words are
appropriately filtered
> (of course the XML tags are treated differently).
*>
- When I try a simple search no results come up. I can see that my
> tokenizer is adding files to the index but a simple search (using
> Ferret::Index::Index.search_each) produces no results.
Any suggestions are appreciated.
John