StandardTokenizer Doesn't Support token_stream method

bburcham · August 3, 2007, 7:02pm

According to the Analyzer doc and the StandardTokenizer doc:

http://ferret.davebalmain.com/api/classes/Ferret/Analysis/Analyzer.html
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/StandardTokenizer.html

I ought to be able to construct a StandardTokenizer like this:

t = StandardTokenizer.new( true) # true to downcase tokens

and then later:

stream = token_stream( ignored_field_name, some_string)

To create a new TokenStream from some_string. This approach would be
valuable for my application since I am analyzing many short strings –
so I’m thinking that building my 5-deep analyzer chain for each small
string will be a nice savings.

Unfortunately, StandardTokenizer#initialize does not work as advertised.
It takes a string, not a boolean. So it does not support the reuse model
from the documentation cited above. If you have a look at the “source”
link on the StandardTokenizer documentation for “new”:
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/StandardTokenizer.html#

You’ll see that the rdoc comment apparently lies That formal
parameter name that should hold “lower” is named “rstr”. Fishy. A quick
look indicates that WhiteSpaceTokenizer has a similar mismatch with its
documentation.

Is there an idiomatic way to reuse analyzer chains?