Subject: Re: Converting between ASCII-8BIT and UTF-8
Date: Tue 04 Nov 14 05:22:25PM -0500
Quoting Darryl L. Pierce ([email protected]):
What I have is this: my project has the concept of a message and a
messenger that can send/receive messages. The wire protocol requires
that any string of data be UTF-8 encoded.
Strings are just sequences of bits, like other data. But while with
binary , ASCII and older per-country encodings you take the bits one
byte at a time, in UTF-8 you may have to take them one, two or three
bytes at a time. And there are illegal sequences.
In Ruby, if I recall correctly since 2.0, a string is always
associated with an encoding, UTF-8 as default.
s=’?bc’
p s.encoding => #Encoding:UTF-8
There are two main operations you can do with yiur string re:
encoding: either you can transcode:
s.encode!(‘ISO-8859-1’)
In this case, the first character (lower case ‘a’ with acute accent,
represented as two bytes \xC3\xA1 in UTF-8) is changed to its
equivalent in ISO8859.1, the single character \xE1.
The second operation allows you to keep the exact sequence of bytes,
but tells the system that it has to interpret the string with another
encoding:
s.force_encoding(‘ISO-8859-1’)
In this case, the new string, interpreted as ISO8859.1, will be
??bc
because in ISO8859.1 \xC3 is A with tilde, and \xA1 is the inverted
exclamation mark.
Every time, you have the possibility to inspect the exact bytes that
make up a string:
b=s.bytes
p b => [195, 161, 98, 99]
In your case, since you receive stuff, it should be the other part’s
responsibility to make sure the strings are proper UTF-8. What you
should not do is mangle it. If I were you, in order not to be mistaken
I’d get the string as array of bytes, and write a method that’s
something like this:
def massage_input(array)
s=array.pack(‘c*’).force_encoding(‘UTF-8’)
unless(s.valid_encoding?())
[COMPLAIN IN SOME WAY]
end
s
end
and then make sure I do not modify the string I receive anymore.
Carlo