I need to convert some accented text, and I would like to know
what arguments I have to give Iconv to produce the desired output.
E.g., in Italian, the word for Friday is “venerdi”, where the
“i” carries a dash (small i with grave accent).
If you type this into Wikipedia search in Italian
(which I believed to be in utf-8 encoding),
it will load:
gives me “venerd\303\254” when I convert from latin1 encoding.
That looks right to me - if I write that into a UTF-8 HTML document, it
displays correctly. What are you expecting?
/usr/local/lib/ruby/1.8/uri/common.rb:436:in split': bad URI(is not URI?): http://www.wikipedia.org/wiki/venerdì (URI::InvalidURIError) from /usr/local/lib/ruby/1.8/uri/common.rb:485:in parse’
from
/usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/rl/withpath.rb:285:in uri_from_string_' from /usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/rl/uri.rb:74:in arg0_info_’
from
/usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/rl/uri.rb:83:in init_from_args_' from /usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/rl/uri.rb:56:in initialize’
from
/usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/rl/base.rb:80:in new' from /usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/rl/base.rb:80:in parse’
from
/usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/rl/builder.rb:111:in build' from /usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/factory.rb:412:in create_state’
from /usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio.rb:65:in initialize' from /usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio.rb:76:in new’
from /usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio.rb:76:in rio' from /usr/local/lib/ruby/gems/1.8/gems/rio-0.4.0/lib/rio/kernel.rb:42:in rio’
def to_hex(number)
number=number.abs
binary=‘’
while number>0
digit=number%16
if digit<10
binary<<digit.to_s
elsif digit==10
binary<<‘A%’
elsif digit==11
binary<<‘B%’
elsif digit==12
binary<<‘C%’
elsif digit==13
binary<<‘D%’
elsif digit==14
binary<<‘E%’
elsif digit==15
binary<<‘F%’
end
number=(number-digit)/16
end
return binary.reverse.gsub(/%([A-F])%([A-F])/,‘%\1\2’)
end
class String
def wiki_addr
converted_doc = Iconv.new(‘utf-8’, ‘latin1’).iconv(self)
res=‘’
converted_doc.split(//).each{|x|
if /[a-zA-Z0-9_ ]/.match(x)
res<<x
else
res<<to_hex(x[0])
end
}
return res
end
end
thank you for bringing this to notice!
(Slightly varying Voltaire, I might
have been able to write a shorter
program had I had more leisure and
more knowledge).
I’ll try your suggestion.
Best regards,