i’m using ferret with aaf since a year now. everything is running with
UTF-8 (even the MSSQL DB plays nice with UTF-8) just ferret on windows
doesnt work with UTF-8, the only locale it accepts is
“German_Germany.1252” which gets automatically assinged when doing
“Ferret.locale=’’”.
while i had quite some trouble getting german umlauts indexed and
searchable, after countless tries i had it working, somehow.
but when i tried to improve performance by storing data directly in the
index i came across a new issue i’m can’t seem to solve.
and that issue is sorting. o.O
i was able to get a properly sorted output but then again none of the
umlauts were searchable.
with the help of iconv that converted all UTF-8 data that gets tokenized
into ISO-8859-1 searching works but when i want to have the results
sorted strange things happen (:type=>:string or :type=>:byte both return
the results in a wrong order).
i’ve indexed my apps users, groups, orga units and positions (replicated
from an ActiveDirectory).
when sorting by the name i get results like
–
Administratoren
Adamek, Angela
Arrens, Anika
Baar, Benjamin
Alle Benutzer
Company Administrators
Drobny, Ewa
…
while one could think “Alle Benutzer” was indexed by “Benutzer” on
purpose, it wasnt! it just seems ferret ignores words on a couple of
(random?) records.
–
…
Knorr, Fanz-Josef
Danisch, Krystyna
Kuhn, Frieda
Layout Designer
Lehmann-Mönke, Elke
Dammann, Lennart
Jhim, Li-Soue
Lessing, Lisa
…
i can’t think of anything anymore that i could try to get it working
without breaking either search or sort.
so my last resort would be to use a mapping of special chars to replace
with. so ferret won’t have to work with them anymore at all.
so “müller” should be indexed as “muller”
and when someone searches for “müller” it should translate it and
retrieve hits with “muller” even if that would mean losing some
precision in the results. just to get out of encoding hell once and for
all.
is that what the MappingFilter is meant for?
i tried the one from dave’s examples:
class MappingAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
CHARACTER_MAPPINGS = {
['à ',‘á’,‘â’,‘ã’,‘ä’,‘Ã¥’,‘Ä’,‘ă’,‘À’,‘Ã’,‘Â’] => ‘a’,
‘æ’ => ‘ae’,
[‘Ä’,‘Ä‘’] => ‘d’,
[‘ç’,‘ć’,‘Ä’,‘ĉ’,‘Ä‹’] => ‘c’,
[‘è’,‘é’,‘ê’,‘ë’,‘Ä“’,‘Ä™’,‘Ä›’,‘Ä•’,‘Ä—’,‘É’,‘È’] => ‘e’,
[‘Æ’’] => ‘f’,
[‘Ä’,‘ÄŸ’,‘Ä¡’,‘Ä£’] => ‘g’,
[‘Ä¥’,‘ħ’] => ‘h’,
[‘ì’,‘ì’,‘Ã’,‘î’,‘ï’,‘Ä«’,‘Ä©’,‘Ä’,‘ÃŽ’] => ‘i’,
[‘į’,‘ı’,‘ij’,‘ĵ’] => ‘j’,
[‘Ä·’,‘ĸ’] => ‘k’,
[‘Å‚’,‘ľ’,‘ĺ’,‘ļ’,‘Å€’] => ‘l’,
[‘ñ’,‘Å„’,‘ň’,‘ņ’,‘ʼn’,‘Å‹’] => ‘n’,
[‘ò’,‘ó’,‘ô’,‘õ’,‘ö’,‘ø’,‘Å’,‘Å‘’,‘Å’,‘Å’,‘Ô’] => ‘o’,
‘Å“’ => ‘oek’,
‘Ä…’ => ‘q’,
[‘Å•’,‘Å™’,‘Å—’] => ‘r’,
[‘Å›’,‘Å¡’,‘ÅŸ’,‘Å’,‘È™’] => ‘s’,
[‘ß’] => ‘ss’,
[‘Å¥’,‘Å£’,‘ŧ’,‘È›’] => ‘t’,
[‘ù’,‘ú’,‘û’,‘ü’,‘Å«’,‘ů’,‘ű’,‘Å’,‘Å©’,‘ų’,‘Û’] => ‘u’,
‘ŵ’ => ‘w’,
[‘ý’,‘ÿ’,‘Å·’] => ‘y’,
[‘ž’,‘ż’,‘ź’] => ‘z’
}
def initialize(stop_words = FULL_GERMAN_STOP_WORDS)
@stop_words = stop_words
end
def token_stream(field, str)
MappingFilter.new(StandardTokenizer.new(str), CHARACTER_MAPPINGS)
end
end