#encode and (pre/de)composed characters

aris · October 27, 2012, 8:41pm

Why does encode to UTF-8 (in at least this one case) appear to favor a
decomposed character while encode to UTF-16LE favors a precomposed
character?

See below…

Thanks,
Hal

1.9.2p290 :009 > str = [233].pack(“c*”).force_encoding(“ISO-8859-1”)
=> “\xE9”
1.9.2p290 :010 > s2 = str.encode(“UTF-8”)
=> “”
1.9.2p290 :012 > s2.bytes.to_a
=> [195, 169]
1.9.2p290 :018 > s3 = str.encode(“UTF-16LE”)
=> “\u00E9”
1.9.2p290 :019 > s3.bytes.to_a
=> [233, 0]

Hal_F · October 27, 2012, 9:39pm

It doesn’t. UTF-8 just needs two bytes to encode this character.

You can use unicode_utils gem to decompose and compose characters, as
well as to check what’s in a string:

irb(main):001:0> require ‘unicode_utils’
=> true
irb(main):004:0> UnicodeUtils.char_name [195,
169].pack(“c*”).force_encoding(“utf-8”)
=> “LATIN SMALL LETTER E WITH ACUTE”
irb(main):005:0> UnicodeUtils.char_name
[195].pack(“c*”).force_encoding(“utf-8”)
ArgumentError: invalid byte sequence in UTF-8
irb(main):006:0> UnicodeUtils.char_name [233,
0].pack(“c*”).force_encoding(“utf-16le”)
=> “LATIN SMALL LETTER E WITH ACUTE”

To decompose the char:
irb(main):007:0> e_acute = [195, 169].pack(“c*”).force_encoding(“utf-8”)
=> “\u00E9”
irb(main):014:0> nfkd = UnicodeUtils.nfkd e_acute
=> “e\u0301”
irb(main):015:0> UnicodeUtils.char_name nfkd[0]
=> “LATIN SMALL LETTER E”
irb(main):016:0> UnicodeUtils.char_name nfkd[1]
=> “COMBINING ACUTE ACCENT”

– Matma R.

Hal_F · October 28, 2012, 2:42am

On Sat, Oct 27, 2012 at 2:38 PM, Bartosz Dziewoński
[email protected]wrote:

It doesn’t. UTF-8 just needs two bytes to encode this character.

Ahh, of course, I see now. Thank you.

Hal