State of unicode support

cpeterson · August 1, 2006, 11:40am

On 8/1/06, Daniel DeLorme [email protected] wrote:

round and round in circles; if we as a community could identify some
“weird” characters. IMHO no amount of “transparent support” will change that.
But I would love to be shown otherwise with examples of languages that “do it
right”.

By transparent I mean that I can iterate, compare, match, index, …
not only bytes but also at least code points (and grapheme clusters if
somebody is so nice and implements that - but for me it is not very
important now). Using the standard string class that all standard
functions accept.

In ruby 1.8 working with anything but bytes is like scratching your
right ear with your left hand … or leg.

Thanks

Michal

cpeterson · August 1, 2006, 8:43pm

Daniel DeLorme wrote:

I second that. I see a lot of people asking for “transparent” unicode
support but I don’t see how that is possible. To me it’s like asking for
a language that has transparent bug recovery. I know that ruby has
weaknesses when it comes to multibyte encodings, but the main problem is
human in nature; too many people assume that char==byte, which results
in bugs when someone unexpectedly uses “weird” characters. IMHO no
amount of “transparent support” will change that. But I would love to be
shown otherwise with examples of languages that “do it right”.

It can be done. Java gets it almost right, and in such a way that most
people will never stub their toes on the flaws. Python, it seems, is
going to get it right next time around. It’s clearly possible to do
Unicode correctly. What Matz wants is much harder; a String type that
can contain strings of characters from arbitrary character sets in
arbitrary encodings, Unicode being just one special case, and also serve
as a byte buffer.

-Tim

cpeterson · August 1, 2006, 1:45pm

On 1-aug-2006, at 12:05, Michal S. wrote:

Last time I looked ICU was in C++. Requiring a C++ compilier and
runtime is quite a bit of bloat

It still is. And it’s huge and takes ages to build. If only I knew
something much lighter and better I would have dismissed it.

I am not sure how large that might be. But if it is about the size of
the interpreter including the rest of the standard libraries I would
consider it “heavyweight”. It would be a reason to start “optional
standard libraries” I guess

I’m stopping right here. Unicode is not an option.

It’s been also said that giving more options does not stop you from
using only unicode.

In 90% of the cases giving more options means programmers ignore
Unicode, for reasons ranging from speed
to ignorance. My user experience over the years has proven it.

But then again, I stop right here. And I urge you to do the same