According to the Analyzer doc and the StandardTokenizer doc:
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/Analyzer.html
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/StandardTokenizer.html
I ought to be able to construct a StandardTokenizer like this:
t = StandardTokenizer.new( true) # true to downcase tokens
and then later:
stream = token_stream( ignored_field_name, some_string)
To create a new TokenStream from some_string. This approach would be
valuable for my application since I am analyzing many short strings –
so I’m thinking that building my 5-deep analyzer chain for each small
string will be a nice savings.
Unfortunately, StandardTokenizer#initialize does not work as advertised.
It takes a string, not a boolean. So it does not support the reuse model
from the documentation cited above. If you have a look at the “source”
link on the StandardTokenizer documentation for “new”:
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/StandardTokenizer.html#
You’ll see that the rdoc comment apparently lies That formal
parameter name that should hold “lower” is named “rstr”. Fishy. A quick
look indicates that WhiteSpaceTokenizer has a similar mismatch with its
documentation.
Is there an idiomatic way to reuse analyzer chains?