Handling Carriage Returns

S_D · April 28, 2008, 9:05am

It’s my understanding that the tokens in a token_stream consist of text
along with start/stop positions that represent the byte positions of the
text within the corresponding document field. The documentation I’ve
been
reading (i.e., O’Reilly - Ferret - page 67) suggests that these byte
positions represent positions within the entire field but based on my
testing it appears that the byte positions are with respect to the line
that
contains the corresponding text within the field. I read my fields
following
Brian McCallister:

  index.add_document :file => path,
                     :content => file.readlines

Hence, if I have a file that contains carriage returns, the token
positions
will be reset with each new line. For example, the following file
contents
(File A)
this is a sentence
will result in a token for the text “sentence” with start position equal
to
10 (assume “this” starts in position 0) while a file with a carriage
return
this is a
sentence
will result in a token for the text “sentence” with start position equal
to
0. I get the same results for my custom tokenizer as well as
StandardTokenizer. The above does not seem consistent with the
documentation
but more importantly, it seems that global positions are more useful
than
line-based positions (e.g., for highlighting).

Digging a little deeper it seems that the tokenizer’s initialize method
is
called each time the token_stream method of the containing analyzer is
called:

class CustomAnalyzer
def token_stream(field, str)
ts = StandardTokenizer.new(str)
end
end

Am I missing something here? Are the start/stop byte positions intended
to
be with respect to the line? Is there a way for token_stream to only be
called once for an entire string sequence (even if carriage returns are
contained)?

Thanks,
John

S_D · April 28, 2008, 12:37pm

Hi,

File.readlines returns an array which I think is the root cause of the
problem.
Just using File.read instead should solve your problem.

Cheers,
Jens

On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:

                     :content => file.readlines
will result in a token for the text “sentence” with start position equal to
def token_stream(field, str)
John

Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

–
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

S_D · April 30, 2008, 7:54am

That was it. Stupid mistake on my part.

Thanks!
John