I have spent days trying to figure out how to get UTF-8 working with my
site.
Here’s my environment:
Linux version 2.6.16.29-xen_3.0.3.0
Ruby 1.8.4 (2005-12-24 [i386-linux]
Rails 1.2.3
mongrel (1.0.1)
mongrel_cluster (1.0.2, 0.2.1)
ferret (0.11.4)
acts_as_ferret stable plugin
Ferret DRB server
When I don’t use an analyzer with my acts_as_ferret declaration,
everything works fine. However, I can’t expect users to enter “Ãlex
RodrÃguez” when searching… they’re going to put “alex rodriguez” (or
some variation of his name, which I handle using a fuzzy search)
So then call an analyzer in my acts_as_ferret declaration:
acts_as_ferret({ :fields => {:first_name => {:store => :no},
:last_name => {:store => :no},
:db_state => {:index =>
:untokenized_omit_norms, :term_vector => :no}},
:remote => true}, {:analyzer => UtfAnalyzer.new})
Here’s the analyzer I’m using… pretty much taken from from here:
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/MappingFilter.html
class UtfAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
CHARACTER_MAPPINGS = {
['à ',‘á’,‘â’,‘ã’,‘ä’,‘Ã¥’,‘Ä’,‘ă’] => ‘a’,
‘æ’ => ‘ae’,
[‘Ä’,‘Ä‘’] => ‘d’,
[‘ç’,‘ć’,‘Ä’,‘ĉ’,‘Ä‹’] => ‘c’,
[‘è’,‘é’,‘ê’,‘ë’,‘Ä“’,‘Ä™’,‘Ä›’,‘Ä•’,‘Ä—’,] => ‘e’,
[‘Æ’’] => ‘f’,
[‘Ä’,‘ÄŸ’,‘Ä¡’,‘Ä£’] => ‘g’,
[‘Ä¥’,‘ħ’] => ‘h’,
[‘ì’,‘ì’,‘Ã’,‘î’,‘ï’,‘Ä«’,‘Ä©’,‘Ä’] => ‘i’,
[‘į’,‘ı’,‘ij’,‘ĵ’] => ‘j’,
[‘Ä·’,‘ĸ’] => ‘k’,
[‘Å‚’,‘ľ’,‘ĺ’,‘ļ’,‘Å€’] => ‘l’,
[‘ñ’,‘Å„’,‘ň’,‘ņ’,‘ʼn’,‘Å‹’] => ‘n’,
[‘ò’,‘ó’,‘ô’,‘õ’,‘ö’,‘ø’,‘Å’,‘Å‘’,‘Å’,‘Å’] => ‘o’,
[‘Å“’] => ‘oek’,
[‘Ä…’] => ‘q’,
[‘Å•’,‘Å™’,‘Å—’] => ‘r’,
[‘Å›’,‘Å¡’,‘ÅŸ’,‘Å’,‘È™’] => ‘s’,
[‘Å¥’,‘Å£’,‘ŧ’,‘È›’] => ‘t’,
[‘ù’,‘ú’,‘û’,‘ü’,‘Å«’,‘ů’,‘ű’,‘Å’,‘Å©’,‘ų’] => ‘u’,
[‘ŵ’] => ‘w’,
[‘ý’,‘ÿ’,‘Å·’] => ‘y’,
[‘ž’,‘ż’,‘ź’] => ‘z’
}
def token_stream(field, str)
MappingFilter.new(StandardTokenizer.new(str), CHARACTER_MAPPINGS)
end
end
I think Ferret is working fine… because when I run some tests, the
mapping filter correctly pulls out the accented characters… exactly as
it should.
However, when something is persisted via the model (acts_as_ferret and
DRB server), I get unexpected behavior…
-
using a model with ONE field declared in acts_as_ferret, and a string
with accented characters – I can search it as expected - with either
accented or non-accented character, adn I get the results returned;
however, I don’t get any other results for the non-accented records.
ONLY the accented records get returned when searching. -
using a model with multiple characters defined (as in Player model
above) – nothing gets returned, neither accented or non-accented
records, or any combination
My ferret_server.log file shows characters that are very different from
the accented characters I’m trying to search on…
Search entered in form: Ãlex RodrÃguez
ferret_server.log: Ãlex rodrÃÂguez
Not sure why this is occuring, but I’ve also redisplayed the submitted
text on a web page and it displays correctly. This leads me to believe
that Ruby/Rails is successfully getting the information, and that html
page encoding is correct, along with environment variables, etc… As I
stated earlier, my Ferret test takes the string “RodrÃguez” and returns
token[“Rodriguez”:0:10:1] demonstrating the UtfAnalyzer works fine
outside of acts_as_ferret…
So any help here would be much appreciated.
Thanks,
Brandon