Unicode roadmap?

rhaus · June 14, 2006, 10:23am

On Jun 13, 2006, at 10:26 PM, Victor S. wrote:

Regexps seems to work fine (in my 1.9), but pathes are
not: File.open with Russian letters in path don’t finds the file.

On OS X multibyte filenames work:

$ cat x.rb
$KCODE = ‘u’

puts File.read(‘Cyrillic_Ð¯.txt’)
$ cat Cyrillic_\320\257.txt
test file with Ð¯!
$ ruby x.rb
test file with Ð¯!
$ uname -a
Darwin kaa.jijo.segment7.net 8.6.0 Darwin Kernel Version 8.6.0: Tue
Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
$ ruby -v
ruby 1.8.4 (2006-05-18) [powerpc-darwin8.6.0]
$

–
Eric H. - removed_email_address@domain.invalid - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com

rhaus · June 14, 2006, 10:55am

On 14/06/06, Yukihiro M. removed_email_address@domain.invalid wrote:

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we can handle Russian
path names, but we need help from Russian people to improve.

str.sub!('32 path encoding ', ‘’) #

I don’t use Windows much, but as I understand it, Ruby interacts with
most of the Win32 API using the ‘legacy code page’, which is only a
subset of what the filesystem can handle. (Windows NT and its
successors use Unicode internally, and the filesystem is UTF-16
KC-normalised IIRC). Windows does provide Unicode API functions, but
to use those, a layer of translation between UTF-16 and UTF-8 would be
needed, as Ruby can’t do anything useful with UTF-16 at present. I
believe that Austin Z. was looking into this; I don’t know if
he’s made any progress.

Even if a Ruby program uses UTF-8 internally, it should be possible to
access the filesystem by Iconv’ing paths to the appropriate code page

providing that they don’t contain characters not in the code page.
It’s far from ideal, though: the real solution is for Ruby to use the
Unicode functions (those suffixed with W) in the API. The upside is
that UTF-8/UTF-16 conversion should be less expensive than the code
page conversion that’s inside each of Win32’s non-Unicode functions.

On the other hand, plenty of Windows programs don’t support Unicode
properly either.

Paul.

rhaus · June 14, 2006, 9:32am

From: Yukihiro M. [mailto:removed_email_address@domain.invalid]
Sent: Wednesday, June 14, 2006 10:20 AM

OK. Case is the problem. I understand.

|BTW, does String#length works good for you?

I don’t remember the last time I needed length method to count
character numbers. Actually I don’t count string length at all both
in bytes and characters in my string processing. Maybe this is a
special case. I am too optimized for Ruby string operations using
Regexp.

I can confirm. But I’m afraid that some libraries I rely on use #length
and
can break when #length doesn’t work.

|Oh, it’s a bit hard theme for me. I know Windows XP must support Unicode
|file names; I see my filenames in Russian, but I have low knowledge of
|system internals to say, are they really Unicode?

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we can handle Russian
path names, but we need help from Russian people to improve.

In Russian encoding (Win-1251) and on Russian PC all works well. In
Unicode
it doesn’t, but I’m not convinced it must.

In any case, I’m ready to spend my time helping Ruby community
(especially
in Russian/Ukrainian localization issues), because I really love the
language.

V.

rhaus · June 14, 2006, 11:00am

On 14/06/06, Victor S. removed_email_address@domain.invalid wrote:

I can confirm. But I’m afraid that some libraries I rely on use #length and
can break when #length doesn’t work.

Those libraries should probably be considered broken; they can and
should be patched to do any human-readable-string processing in an
encoding-safe manner (e.g. by using jcode’s jlength and each_char
methods).

Paul.

rhaus · June 14, 2006, 11:12am

On 14/06/06, Victor S. removed_email_address@domain.invalid wrote:

Just to chime in, aren’t upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization.

Really? I know about two cases: European capitalization and no
capitalization.

There is variety even within western European languages - Dutch, for
example, differs from English (IJsselmeer).

Paul.

rhaus · June 14, 2006, 11:09am

-------- Original-Nachricht --------
Datum: Wed, 14 Jun 2006 17:58:41 +0900
Von: Paul B. removed_email_address@domain.invalid
An: removed_email_address@domain.invalid
Betreff: Re: Unicode roadmap?

Paul.
That will be quite some libraries, I guess…

rhaus · June 14, 2006, 11:16am

From: Paul B. [mailto:removed_email_address@domain.invalid]
Sent: Wednesday, June 14, 2006 12:10 PM

example, differs from English (IJsselmeer).
I already realized. (I’ve said about Florian G., his surname last
“ss”
normally printed in something like “B” I can’t type and my Outlook can’t
show AFAIK, it is normally printed as one letter in downcase and two
letters in uppercase. So, “single general” String#upcase, #downcase are
totally impossible.

V.

rhaus · June 14, 2006, 11:25am

On 6/14/06, Yukihiro M. removed_email_address@domain.invalid wrote:

|
|The same is for Russians/Ukrainians. In our programming communities question
|“does the programming language supports Unicode as ‘native’?” has very high
|priority.

Alright, then what specific features are you (both) missing? I don’t
think it is a method to get number of characters in a string. It
can’t be THAT crucial. I do want to cover “your missing features” in
the future M17N support in Ruby.

What I want is all methods working seamlessly with unicode strings so
that I do not have to think about the encoding.

Regexps do work with utf-8 strings if KCODE is set to u (but it
defaults to n even when locale uses UTF-8).

String searches should probably work but they would retrurn wrong
position.
Things like split should work for utf-8, the encoding is pretty well
defined.

But one might want to use length and [] to work with strings.
It can be simulated with unicode_string=string.scan(/./). But it is no
longer a string. It is composed of characters only as long as I assign
only characters using []=.
The string functions should do the right thing even for utf-8. But I
guess utf-32 is more useful for working with strings this way.

It might be a good idea to stick encoding information into strings (it
is probably the only way how internationalization can be done and the
sanity of all involved preserved at the same time). The functions for
comparison, etc could use it to do the right thing even if strings
come in several encodings. ie. cp1251 from the system, utf-8 from a
web page, …

Functions like open could convert the string correctly according to
locale. One should be able to set the encoding information (ie for web
page title when the meta tag for content type is found in a web
page),and remove it to suppress string conversion. It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

Things like <=>, upcase, downcase, etc make sense only in context of
locale (language). Only the encoding does not define them.
I guess the default <=>is based on the binary representation of the
string. This would mean different sorting of the same strings in
different encodings. Sorting by the unicode code point would be at
least the same for any encoding.

Thanks

Michal

rhaus · June 14, 2006, 11:35am

On 6/14/06, Victor S. removed_email_address@domain.invalid wrote:

capitalization.

Really? I know about two cases: European capitalization and no
Really.
capitalization.

There is no such thing like European capitalization. There is only
capitalization.
The german character Ã? has no uppercase version. In most languages
using Latin script the uppercase of ‘i’ is ‘I’. But Turkish has i and
i without dot, and the uppercase of ‘i’ is, of course, I with dot.

Thanks

Michal

rhaus · June 14, 2006, 11:41am

On 14/06/06, Michal S. removed_email_address@domain.invalid wrote:

It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

utf8_string.unpack(‘U*’) is pretty close to this, giving an array of
codepoints.

Paul.

rhaus · June 14, 2006, 12:54pm

On 6/14/06, Paul B. removed_email_address@domain.invalid wrote:

On 14/06/06, Michal S. removed_email_address@domain.invalid wrote:

It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

utf8_string.unpack(‘U*’) is pretty close to this, giving an array of codepoints.

But I want it to be string after the conversion, so that I can use
the standard string functions with sane results. I do not want to
think about varoius encodings myself if my application has to use
them. The runtime should do that.

Thanks

Michal

rhaus · June 14, 2006, 2:23pm

On 6/14/06, Victor S. removed_email_address@domain.invalid wrote:

Oh, it’s a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

They are UTF-16 internally. I haven’t been paying attention to Ruby
1.9 lately, but when I have time and have noticed that Matz has
checked in support for m17n strings, I will be enhancing support for
Windows files to use Unicode. Currently, Ruby is built using the
non-Unicode form only. And no, using -DUNICODE is the wrong
answer, thanks. We’d have to start using TCHAR instead of char, and it
would actually mean that we’d be using wchar_t instead of char in this
case.

I’ve already done a similar (but more complex) project at work.

-austin

rhaus · June 14, 2006, 2:29pm

On 6/14/06, Yukihiro M. removed_email_address@domain.invalid wrote:

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we can handle Russian
path names, but we need help from Russian people to improve.

It’s not that bad, Matz. I started as a Unix developer, but in the
last two years I have learned quite a bit about how Windows handles
this stuff and we can adapt what I did for work with no problem.

I just need M17N strings to support this. I should look at what I
can/should do to provide this as an extension, I just have no time.

-austin

rhaus · June 14, 2006, 2:29pm

On 6/14/06, Vincent I. removed_email_address@domain.invalid wrote:

Windows XP does support Unicode file names, but I’m not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

You can’t currently use them with Ruby. The file operations in Ruby
are using the likes of CreateFileA instead of CreateFileW (it’s not
that explicit; Ruby is compiled without -DUNICODE – which is the
correct thing to do in Ruby’s case – which means that CreateFile is
CreateFileA).

All files are stored on the filesystem as UTF-16, though, even if you
are using “ANSI” access.

By the way, there are multiple Russian encodings, so … Unicode is
better for this point. As I said in my previous message, I have
already planned to enhance the Windows filesystem support when Matz
gets the m17n strings in so that I can always force the file
routines on Windows to provide either UTF-8 or UTF-16 (probably the
former, since it will also make it easier to work with existing
extensions) and indicate that the strings are such.

-austin

rhaus · June 14, 2006, 11:40pm

On Wednesday 14 June 2006 06:52 am, Michal S. wrote:

On 6/14/06, Paul B. removed_email_address@domain.invalid wrote:

On 14/06/06, Michal S. removed_email_address@domain.invalid wrote:

It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

(RE my previous post): Oops, maybe UTF-32 is exactly what I was
alluding to?

Randy K.

(Should have waited a little longer before posting.)

rhaus · June 14, 2006, 2:36pm

On 6/14/06, Michal S. removed_email_address@domain.invalid wrote:

What I want is all methods working seamlessly with unicode strings so
that I do not have to think about the encoding.

That will never happen. Even with Unicode, you have to think about
the encoding, because UTF-32 (the closest representation to the
Platonic ideal “Unicode” you’ll ever find) is unlikely to be supported
in the general case. Matz’s idea of m17n strings is the right one: you
have a “byte stream” and an attribute which indicates how the byte
stream is encoded. This will sort of be like $KCODE but on an
individual string level so that you could meaningfully have Unicode
(probably UTF-8) and ShiftJIS strings in the same data and still
meaningfully call #length on them.

You will always have to care about the encoding. As well as,
ultimately, your locale.

-austin

rhaus · June 15, 2006, 2:12am

Every time these unicode discussions come up my head spins like a top.
You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby
is
currently 1.8-compatible, we do not have what most call native unicode
support. This is primarily because we do not wish to create an
incompatible
version of Ruby or build in support for unicode now that would conflict
with
Ruby 2.0 in the future. It is, however, embarressing to say that
although we
run on top of Java, which has arguably pretty good unicode support, we
don’t
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform’s encoding of choice by default.
It
also supports converting those UTF16 strings into just about every
encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there’s always
that nagging question of what it should look like and what would mesh
well
with the Ruby community at large. With the underlying platform already
rich
with unicode support, it would not take much effort to modify JRuby. So
then
there’s a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation
of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java’s String, is already using UTF16 strings exclusively…however
there’s
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in
the
future?

rhaus · June 15, 2006, 2:22am

I posted this to ruby-talk, but it occurred to me that you folks
implementing Rails functionality probably have a thing or two to say
about
unicode support in Ruby. Therefore, I would love to hear your opinions.
Adding native unicode support is only a matter of time in JRuby; its
usefulness as a JVM-based language depends on it. However, we continue
to
wrestle with how best to support unicode without stepping on the Ruby
community’s toes in the process. Thoughts?

---------- Forwarded message ----------
From: Charles O Nutter removed_email_address@domain.invalid
Date: Jun 14, 2006 7:11 PM
Subject: Re: Unicode roadmap?
To: ruby-talk ML removed_email_address@domain.invalid

Every time these unicode discussions come up my head spins like a top.
You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby
is
currently 1.8-compatible, we do not have what most call native unicode
support. This is primarily because we do not wish to create an
incompatible
version of Ruby or build in support for unicode now that would conflict
with
Ruby 2.0 in the future. It is, however, embarressing to say that
although we
run on top of Java, which has arguably pretty good unicode support, we
don’t
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform’s encoding of choice by default.
It
also supports converting those UTF16 strings into just about every
encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there’s always
that nagging question of what it should look like and what would mesh
well
with the Ruby community at large. With the underlying platform already
rich
with unicode support, it would not take much effort to modify JRuby. So
then
there’s a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation
of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java’s String, is already using UTF16 strings exclusively…however
there’s
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in
the
future?

–
Charles Oliver N. @ headius.blogspot.com
JRuby Developer @ jruby.sourceforge.net
Application Architect @ www.ventera.com

rhaus · June 15, 2006, 2:40am

On 15-jun-2006, at 2:11, Charles O Nutter wrote:

with unicode support, it would not take much effort to modify
JRuby. So then
there’s a simple question:

Yukihiro M. wrote:

Define “proper Unicode support” first.

I’m planning enhancing Unicode support in 1.9 in a year or so
(finally). But I’m not sure that conforms your definition of “proper
Unicode support”. Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

Hello everyone, and sorry for chiming so fiercely. Got into some
confusion with the ML controls.

Just joined the list seeing the subject popping up once more. I am
doing Unicode-aware apps in Rails and Ruby right now and it hurts.
I’ll try to define “proper Unicode support” as I (dream of it at
night) see it.

All string indexing (length, index, slice, insert) works with
characters instead of bytes, whatever length in bytes the characters
have to be.
String methods (index or =~) should never return offsets that will
damage the string’s characters if employed for slicing - you
shouldn’t have to manually translate the byte offset of 2 to
character offset of 1 because the second character is multibyte.

Simple example:

 def translate_offset(str, byte_offset)
   chunk = str[0..byte_offset]
   begin
     chunk.unpack("U*").length - 1
   rescue ArgumentError # this offset is just wrong! shift

upwards and retry
chunk = str[0…(byte_offset+=1)]
retry
end
end

I think it’s unnecessarily painful for something as easy as string
=~ /pattern/. Yes, you can get that offset you recieve from =~ and
then get the slice of the string and then split it again with /./mu
to get the same number etc…

Case-insensitive regexes actually work. Even in my Oniguruma-
enabled builds of 1.8.2. it was not true (maybe changed now). At
least “Unicode general” collation casefolding (such a thing exists)
available built-in on every platform.
Locale-aware sorting, including multibyte charsets, if provided by
the OS
Preferably separate (and strictly purposed) Bytestring that you
get out of Sockets and use in Servers etc. - or the ability to
“force” all strings recieved from external resources to be flagged
uniformly as being of a certain encoding in your program, not
somewhere in someone’s library. If flags have to be set by libraries,
they won’t be set because most developers sadly don’t care:

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

Unicode-aware strip dealing with weirdo whitespaces (hair space,
thin space etc.)
And no, as I mentioned - it doesn’t handle it properly because
the /i modifier is broken, and to deal without it you need to
downcase BOTH the regexp and the string itself. Closed circle - you
go and get the Unicode gem with tables.

All of this can be controlled either per String (then 99 out of 100
libraries I use will be getting it wrong - see above) or by a global
setting such as $KCODE.

As an example of something that is ridiculously backwards to do in
Ruby now is this (I spent some time refactoring this today):
http://dev.rubyonrails.org/browser/trunk/actionpack/lib/action_view/
helpers/text_helper.rb#L44

Here you have a major problem because the /i flag doesn’t do anything
(Ruby is incapable of Unicode-aware casefolding), and using offsets
means that you are always one step from damaging someone’s text. It’s
just wrong that it has to be so painful.

Python3000, IMO, gets this right (as does Java) - byte array and a
String are sompletely separate, and String operates with characters
and characters only.

That’s what I would expect. Hope this makes sense somewhat

Julian ‘Julik’ Tarkhanov
please send all personal mail to
me at julik.nl

rhaus · June 15, 2006, 2:40am

On Jun 15, 2006, at 2:19 AM, Charles O Nutter wrote:

I posted this to ruby-talk, but it occurred to me that you folks
implementing Rails functionality probably have a thing or two to
say about unicode support in Ruby. Therefore, I would love to hear
your opinions. Adding native unicode support is only a matter of
time in JRuby; its usefulness as a JVM-based language depends on
it. However, we continue to wrestle with how best to support
unicode without stepping on the Ruby community’s toes in the
process. Thoughts?

Julik has done a lot of pionering in that direction for Rails. His
latest suggestion is to use a proxy class on string objects to
perform unicode operations:

@some_unicode_string.u.length
@some_unicode_string.u.reverse

I tend to agree with this solution as it doesn’t break any previous
string operations and gives us an easy way to perform unicode aware
operations.

Manfred