On 15-jun-2006, at 2:11, Charles O Nutter wrote:
with unicode support, it would not take much effort to modify
JRuby. So then
there’s a simple question:
Yukihiro M. wrote:
Define “proper Unicode support” first.
I’m planning enhancing Unicode support in 1.9 in a year or so
(finally). But I’m not sure that conforms your definition of “proper
Unicode support”. Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.
Hello everyone, and sorry for chiming so fiercely. Got into some
confusion with the ML controls.
Just joined the list seeing the subject popping up once more. I am
doing Unicode-aware apps in Rails and Ruby right now and it hurts.
I’ll try to define “proper Unicode support” as I (dream of it at
night) see it.
- All string indexing (length, index, slice, insert) works with
characters instead of bytes, whatever length in bytes the characters
have to be.
String methods (index or =~) should never return offsets that will
damage the string’s characters if employed for slicing - you
shouldn’t have to manually translate the byte offset of 2 to
character offset of 1 because the second character is multibyte.
Simple example:
def translate_offset(str, byte_offset)
chunk = str[0..byte_offset]
begin
chunk.unpack("U*").length - 1
rescue ArgumentError # this offset is just wrong! shift
upwards and retry
chunk = str[0…(byte_offset+=1)]
retry
end
end
I think it’s unnecessarily painful for something as easy as string
=~ /pattern/. Yes, you can get that offset you recieve from =~ and
then get the slice of the string and then split it again with /./mu
to get the same number etc…
- Case-insensitive regexes actually work. Even in my Oniguruma-
enabled builds of 1.8.2. it was not true (maybe changed now). At
least “Unicode general” collation casefolding (such a thing exists)
available built-in on every platform.
- Locale-aware sorting, including multibyte charsets, if provided by
the OS
- Preferably separate (and strictly purposed) Bytestring that you
get out of Sockets and use in Servers etc. - or the ability to
“force” all strings recieved from external resources to be flagged
uniformly as being of a certain encoding in your program, not
somewhere in someone’s library. If flags have to be set by libraries,
they won’t be set because most developers sadly don’t care:
http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
http://thraxil.org/users/anders/posts/2005/11/01/unicodification/
- Unicode-aware strip dealing with weirdo whitespaces (hair space,
thin space etc.)
- And no, as I mentioned - it doesn’t handle it properly because
the /i modifier is broken, and to deal without it you need to
downcase BOTH the regexp and the string itself. Closed circle - you
go and get the Unicode gem with tables.
All of this can be controlled either per String (then 99 out of 100
libraries I use will be getting it wrong - see above) or by a global
setting such as $KCODE.
As an example of something that is ridiculously backwards to do in
Ruby now is this (I spent some time refactoring this today):
http://dev.rubyonrails.org/browser/trunk/actionpack/lib/action_view/
helpers/text_helper.rb#L44
Here you have a major problem because the /i flag doesn’t do anything
(Ruby is incapable of Unicode-aware casefolding), and using offsets
means that you are always one step from damaging someone’s text. It’s
just wrong that it has to be so painful.
Python3000, IMO, gets this right (as does Java) - byte array and a
String are sompletely separate, and String operates with characters
and characters only.
That’s what I would expect. Hope this makes sense somewhat
Julian ‘Julik’ Tarkhanov
please send all personal mail to
me at julik.nl