It’s my understanding that the tokens in a token_stream consist of text
along with start/stop positions that represent the byte positions of the
text within the corresponding document field. The documentation I’ve
been
reading (i.e., O’Reilly - Ferret - page 67) suggests that these byte
positions represent positions within the entire field but based on my
testing it appears that the byte positions are with respect to the line
that
contains the corresponding text within the field. I read my fields
following
Brian McCallister:
index.add_document :file => path,
:content => file.readlines
Hence, if I have a file that contains carriage returns, the token
positions
will be reset with each new line. For example, the following file
contents
(File A)
this is a sentence
will result in a token for the text “sentence” with start position equal
to
10 (assume “this” starts in position 0) while a file with a carriage
return
this is a
sentence
will result in a token for the text “sentence” with start position equal
to
0. I get the same results for my custom tokenizer as well as
StandardTokenizer. The above does not seem consistent with the
documentation
but more importantly, it seems that global positions are more useful
than
line-based positions (e.g., for highlighting).
Digging a little deeper it seems that the tokenizer’s initialize method
is
called each time the token_stream method of the containing analyzer is
called:
class CustomAnalyzer
def token_stream(field, str)
ts = StandardTokenizer.new(str)
end
end
Am I missing something here? Are the start/stop byte positions intended
to
be with respect to the line? Is there a way for token_stream to only be
called once for an entire string sequence (even if carriage returns are
contained)?
Thanks,
John