Ruby 1.8 - character encoding

thomthom · July 7, 2009, 2:28pm

I write Ruby plugins for Google Sketchup.

Sketchup uses UTF-8 strings and passes this to ruby (1.8) - which
handles Strings as simple series of bytes. This caused problems when I
tried to pass a String I got from Sketchup which contained a file path
with some Norwegian letters. (Ã¦Ã¸Ã¥Ã†Ã˜Ã…) as ruby then raised an error
saying the file/path didn’t exist.

This was because Ã¦Ã¸Ã¥Ã†Ã˜Ã… lies outside the ASCII character set so it was
returned as double byte characters in UTF-8.

Searching the net I found some hacks that converted UTF-8 into single
byte characters: str_utf8.unpack(‘U*’).pack(‘C*’)

The Norwegian characters lies outside the ASCII range, but still they
get packed into single bytes characters that the File classes can
handle.

Example:
‘Ã¦Ã¸Ã¥Ã†Ã˜Ã…’.length # <- all these characters causes the File class to fail

12

‘Ã¦Ã¸Ã¥Ã†Ã˜Ã…’.unpack(‘U*’).pack(‘C*’).length # <- File class now can handle
this

6

So it seems that the File class doesn’t just handle ASCII, but maybe
ANSI (Windows-1252) or ISO-8859-1. Or does this depend on some system
setting?

My tests has been on a Norwegian Windows XP system with Norwegian
locale. Default language for applications that doesn’t support Unicode
is also set to Norwegian.

To summon up what I’m trying to work out is how UTF-8 characters above
the ASCII range (0-127) is mapped to the 128-255 range. Does the 128-255
range refer to ANSI (1252) or ISO-8859-1? <- and is this due to system
settings?

thomthom · July 7, 2009, 3:23pm

Sorry, I just realised that the extra number was the octal variant.
So it could still be ANSI…

thomthom · July 7, 2009, 3:11pm

Looking at ISO/IEC 8859-1 - Wikipedia
There’s an extra character code besides the code that equal to the ANSI
code. It consist of 3 integers ranging from 0-7. This code can be used
in Ruby in conjunction with the escape character:

ANSI  ISO-8859-1

Ã¦ - 230 - 230 / 346
Ã¸ - 248 - 248 / 370
Ã¥ - 229 - 229 / 345
Ã† - 198 - 198 / 306
Ã˜ - 216 - 216 / 330
Ã… - 197 - 197 / 305

“\306”.length # Code for Ã†

1

Since this code doesn’t exist in ANSI it seem to me that Ruby interprets
ISO-8859 encoding. But I’m still wondering if this is system
dependant…

thomthom · July 7, 2009, 4:18pm

On Tue, Jul 7, 2009 at 8:28 AM, Thomas T.[email protected]
wrote:

I write Ruby plugins for Google Sketchup.

Sketchup uses UTF-8 strings and passes this to ruby (1.8) - which
handles Strings as simple series of bytes. This caused problems when I
tried to pass a String I got from Sketchup which contained a file path
with some Norwegian letters. (æøåÆØÅ) as ruby then raised an error
saying the file/path didn’t exist.

http://blog.grayproductions.net/categories/character_encodings

thomthom · July 7, 2009, 4:43pm

On Tue, Jul 7, 2009 at 10:21 AM, Thomas T.[email protected]
wrote:

Gregory B. wrote:

http://blog.grayproductions.net/categories/character_encodings

I have seen that series. I still can’t work out how Ruby determines what
UTF-8 character to map to the 128-255 spaces.

I missed why you wouldn’t just set $KCODE=“U” and stick w. UTF-8?

Anyway, I think chars.pack(“C*”) is going to give you ISO-8859-1 but
someone else will need to verify for you.

-greg

thomthom · July 7, 2009, 4:44pm

On Tue, Jul 7, 2009 at 10:42 AM, Gregory
Brown[email protected] wrote:

someone else will need to verify for you.
Also, since you know the original encoding, you can use IConv to
explicitly convert to whatever you want.

-greg

thomthom · July 7, 2009, 4:21pm

Gregory B. wrote:

http://blog.grayproductions.net/categories/character_encodings

I have seen that series. I still can’t work out how Ruby determines what
UTF-8 character to map to the 128-255 spaces.

thomthom · July 7, 2009, 4:47pm

On Jul 7, 2009, at 7:28 AM, Thomas T. wrote:

Searching the net I found some hacks that converted UTF-8 into single
byte characters: str_utf8.unpack(‘U*’).pack(‘C*’)

What you are doing there is transcoding from UTF-8 to Latin-1 (or
ISO-8859-1). Here’s the proof:

$ ruby -KU -r iconv -e ‘utf8 = “æøåÆØÅ”; p
utf8.unpack(“U*”).pack(“C*”) == Iconv.conv(“ISO-8859-1”, “UTF-8”, utf8)’
true

James Edward G. II

thomthom · July 7, 2009, 4:48pm

2009/7/7 Thomas T. [email protected]:

[…]
So it seems that the File class doesn’t just handle ASCII, but maybe
ANSI (Windows-1252) or ISO-8859-1. Or does this depend on some system
setting?
[…]

Hello,

how Windows interprets file paths depends on which API calls you use
and on the current system locale. There is one set of Windows API
functions that always use UTF-16 and another one that always uses the
encoding associated with the current system locale.

I think Ruby indirectly accesses the latter API and doesn’t do any
character set conversions before passing strings to the operating
system, but I’m not entirely sure there.

cu,
Thomas

thomthom · July 7, 2009, 7:24pm

I’ve been doing some testing on the 128-255 range, and from what I can
gather all code points within the ISO 8859-1 range are identical with
UTF-8.

thomthom · July 7, 2009, 4:51pm

Gregory B. wrote:

I missed why you wouldn’t just set $KCODE=“U” and stick w. UTF-8?
Because Sketchup uses Ruby to allow users to write plugins for the
applications. That flag, as far as I understand, is global and would
affect all scripts which might break a number of things. Also, the Ruby
1.8 version shipped with SU is not the whole package. Not sure if it’s
possible even if I wanted it.

Anyway, I think chars.pack(“C*”) is going to give you ISO-8859-1 but
someone else will need to verify for you.

-greg
I’m currently looking into if the UTF-8 decimal codepoints (in range
128-255) are similar to the ISO-8859-1 or ANSI. That might be the
answer.

thomthom · July 8, 2009, 6:30pm

I checked the $KCODE variable and it returns “UTF8”.

Now, what does that do to Ruby? Why does File.exist?(‘c:\Test Ã¦Ã¸Ã¥’) fail
if it’s UTF-8 encoded?

thomthom · July 8, 2009, 6:38pm

On Jul 8, 2009, at 11:30 AM, Thomas T. wrote:

I checked the $KCODE variable and it returns “UTF8”.

Now, what does that do to Ruby?

I answer that question in detail in this article:

http://blog.grayproductions.net/articles/the_kcode_variable_and_jcode_library

Why does File.exist?(‘c:\Test æøå’) fail if it’s UTF-8 encoded?

The IO methods are not $KCODE aware. You will likely need to
transcode the Strings you pass them.

James Edward G. II

thomthom · July 8, 2009, 6:48pm

On Jul 8, 2009, at 11:42 AM, Thomas T. wrote:

Why does File.exist?(‘c:\Test ï¿½ï¿½ï¿½’) fail if it’s UTF-8
encoded?

The IO methods are not $KCODE aware. You will likely need to
transcode the Strings you pass them.

James Edward G. II

What does the IO method require?

That’s a good question. I’m not sure what it does on Windows.

Is it the Ruby IO methods or the system methods it calls that
doesn’t handle UTF-8?

I assume it’s the underlying Windows API, though I’m just guessing
there.

Windows’ NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?

I think it depends on which API methods you call, so I’m guessing you
cannot do this. I think Ruby would need to be changed to use those
methods first.

I’m trying to avoid transcoding to a 8bit only encoding as that’ll
just
cause grief when I encounter characters outside the range.

Have you had a look at Ruby 1.9 yet? I’m wondering if this issue has
been improved there, using the new encoding support. I don’t know
that it has. I’m more just wondering out-loudâ€¦

James Edward G. II

thomthom · July 8, 2009, 6:59pm

James G. wrote:

On Jul 8, 2009, at 11:42 AM, Thomas T. wrote:

Why does File.exist?(‘c:\Test ï¿½ï¿½ï¿½’) fail if it’s UTF-8
encoded?

The IO methods are not $KCODE aware. You will likely need to
transcode the Strings you pass them.

James Edward G. II

What does the IO method require?

That’s a good question. I’m not sure what it does on Windows.
Any clues what I does on OSX? The scripts will run on macs as well.

Is it the Ruby IO methods or the system methods it calls that
doesn’t handle UTF-8?

I assume it’s the underlying Windows API, though I’m just guessing
there.

Windows’ NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?

I think it depends on which API methods you call, so I’m guessing you
cannot do this. I think Ruby would need to be changed to use those
methods first.
Since NTFS supports UTF, then I guess it’s the Ruby API that calls the
wrong WinAPIs?
Can I make my own API calls?

I’m trying to avoid transcoding to a 8bit only encoding as that’ll
just
cause grief when I encounter characters outside the range.

Have you had a look at Ruby 1.9 yet? I’m wondering if this issue has
been improved there, using the new encoding support. I don’t know
that it has. I’m more just wondering out-loudâ€¦

James Edward G. II

The scripts I write is plugins for Google Sketchup - so the Ruby version
I have at disposal is the one Sketchup bundles - a partial 1.8 version.
While I’ve been searching for solutions I’ve noticed that v1.9 have
better support for various encoding, but unfortunately it’s of no use
for me.

So my problem is that I have to deal with string data that comes from
Sketchup in UTF-8 format - might even have to deal with files and folder
that include characters outside the Windows1252 or ISO8859 range
(whatever the IO functions are using - I’ve not been able to pin-point
this.). If I get characters outside that range it’s impossible to
transcode.
Andd, I also don’t know what would happen for an eastern user. I’m
wondering if the IO functions would assume a different 8bit encoding…

thomthom · July 8, 2009, 6:42pm

James G. wrote:

On Jul 8, 2009, at 11:30 AM, Thomas T. wrote:

I checked the $KCODE variable and it returns “UTF8”.

Now, what does that do to Ruby?

I answer that question in detail in this article:

http://blog.grayproductions.net/articles/the_kcode_variable_and_jcode_library

Why does File.exist?(‘c:\Test ï¿½ï¿½ï¿½’) fail if it’s UTF-8 encoded?

The IO methods are not $KCODE aware. You will likely need to
transcode the Strings you pass them.

James Edward G. II

What does the IO method require? Is it the Ruby IO methods or the system
methods it calls that doesn’t handle UTF-8?
Windows’ NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?
I’m trying to avoid transcoding to a 8bit only encoding as that’ll just
cause grief when I encounter characters outside the range.

thomthom · July 8, 2009, 7:01pm

From: “James G.” [email protected]

Have you had a look at Ruby 1.9 yet? I’m wondering if this issue has been improved there, using the new encoding support. I
don’t know that it has. I’m more just wondering out-loudâ€¦

It’s only begun to improve as of the ruby 1.9.2 development version.
(1.9.1 and earlier use the 8-bit windows file API routines.)

This ruby-core post provides a partial list of methods in 1.9.2dev
which now work with windows unicode paths, as of
1.9.2dev (2009-06-24) [i386-mswin32_71]

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/24010

Regards,

Bill

thomthom · July 8, 2009, 7:22pm

I just tried on a Mac - It worked fine with Norwegian letters there. So
it seems that Ruby 1.8 on OSX calls UTF-8 aware IO system calls.

Then it’s the question of what encoding is used on Windows.

And can I can UTF-8 aware Windows IO API methods myself - bypassing the
built in ruby?

thomthom · July 8, 2009, 7:28pm

From: “Thomas T.” [email protected]

James G. wrote:

That’s a good question. I’m not sure what it does on Windows.
Any clues what I does on OSX? The scripts will run on macs as well.

Unlike that other OS, both OS X and Linux have taken an approach
I like to refer to as, NOT MIND-NUMBINGLY STUPID.

In OS X and Linux, one can use the same API calls one has always
used, as they are now UTF-8 savvy.

Windows’ NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?

I think it depends on which API methods you call, so I’m guessing you
cannot do this. I think Ruby would need to be changed to use those
methods first.
Since NTFS supports UTF, then I guess it’s the Ruby API that calls the
wrong WinAPIs?
Can I make my own API calls?

In ruby 1.8 embedded into our C++ application, I’ve created hooks
so that I can call our unicode-savvy C++ routines from ruby.

I suppose it may be possible to do this without involving a
ruby C extension, assuming the ruby Win32API module can
be made to call routines like _wopen and such. I haven’t tried that.

this.). If I get characters outside that range it’s impossible to
transcode.
Andd, I also don’t know what would happen for an eastern user. I’m
wondering if the IO functions would assume a different 8bit encoding…

For best 8-bit compatibility you’ll want to encode to Windows1252.

But, this (of course) won’t help at all with chinese characters, etc.

Regards,

Bill

thomthom · July 8, 2009, 7:06pm

Bill K. wrote:

From: “James G.” [email protected]

Have you had a look at Ruby 1.9 yet? I’m wondering if this issue has been improved there, using the new encoding support. I
don’t know that it has. I’m more just wondering out-loudâ€¦

It’s only begun to improve as of the ruby 1.9.2 development version.
(1.9.1 and earlier use the 8-bit windows file API routines.)

This ruby-core post provides a partial list of methods in 1.9.2dev
which now work with windows unicode paths, as of
1.9.2dev (2009-06-24) [i386-mswin32_71]

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/24010

Regards,

Bill
I see. So Ruby calls Win32 APis that doesn’t handle UTF-8. But what do
they use then? windows-1252? (Or would that be system dependant?) If
it’s not a fixed character set, is there any way of finding that out -
so that I have a chance to try to transcode correctly?