Unicode roadmap?

rhaus · June 20, 2006, 3:57pm

On 6/20/06, Michal S. [email protected] wrote:

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

Having different encodings on one web page is a good way to make sure
that
the page won’t display correctly, since all the browsers I know of
display
all text on a page using just one encoding. Granted, if the encoding is
a
subset of unicode, it may still manage to work out, but personally I
keep
running in to pages that display some of the characters as garbage no
matter
what encoding I instruct my browser to use. So, no, I don’t think it
should
be valid to concatenate strings with different encodings.

rhaus · June 20, 2006, 4:34pm

On Jun 20, 2006, at 14:54, Timothy B. wrote:

sure that
be valid to concatenate strings with different encodings.
So we shouldn’t do it because it doesn’t work in web browsers?

Hopefully we don’t apply that criteria globally, or we’d never get
anything done.

rhaus · June 20, 2006, 4:41pm

On 6/20/06, Timothy B. [email protected] wrote:

the page won’t display correctly, since all the browsers I know of display
all text on a page using just one encoding. Granted, if the encoding is a
subset of unicode, it may still manage to work out, but personally I keep
running in to pages that display some of the characters as garbage no matter
what encoding I instruct my browser to use. So, no, I don’t think it should
be valid to concatenate strings with different encodings.

No, I meant that the strings are, of course, converted to a common
encoding such as utf-8 before they are concatenated.
The point is that you do not have to care in which encoding you
obtained the pieces and convert them manually to a common encoding if
the string class can do it automatically for you.

Thanks

Michal

rhaus · June 19, 2006, 7:48pm

On Jun 19, 2006, at 6:31 AM, Austin Z. wrote:

This entire discussion is centered around a proposal to do exactly
that. There are many very good reasons to avoid doing this. Unicode
Is Not Always The Answer.

It’s usually the answer, but there are times when it’s just easier
to work with data in an established code page.

To enlighten the ignorant, could you describe one or two scenarios
where a Unicode-based String class would get in the way? To use your
words, make things less easy? I would probably not agree that there
are “many good” reasons to avoid this, but probably that’s just
because I’ve been fortunate enough to not encounter the problem
scenarios. This material would have application in a far larger
domain than just Ruby, obviously. -Tim

rhaus · June 20, 2006, 5:46pm

On Jun 20, 2006, at 8:09 AM, Michal S. wrote:

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

I’m not sure I understand what ‘subset of unicode’ means.

Do you mean two different encodings of Unicode code points?
As in ‘UTF-8 and UTF-16 are subsets of Unicode’?

That usage seems unusual to me. Are you using ‘subset’ and ‘encoding’
as synonyms or am I missing subtle difference?

Gary W.

rhaus · June 20, 2006, 6:08pm

On Jun 20, 2006, at 6:54 AM, Timothy B. wrote:

Having different encodings on one web page is a good way to make
sure that
the page won’t display correctly
…
So, no, I don’t think it should
be valid to concatenate strings with different encodings.

Well, unless you had a String class that took care of the encoding
details and, when you were ready to output, allowed you to say “Give
me that in ISO-8859 or UTF-8 or whatever”. -Tim

rhaus · June 20, 2006, 6:20pm

Hi,

In message “Re: Unicode roadmap?”
on Tue, 20 Jun 2006 23:33:43 +0900, “Michal S.”
[email protected] writes:

|No, I meant that the strings are, of course, converted to a common
|encoding such as utf-8 before they are concatenated.
|The point is that you do not have to care in which encoding you
|obtained the pieces and convert them manually to a common encoding if
|the string class can do it automatically for you.

If you choose to convert all input text data into Unicode (and convert
them back at output), there’s no need for unreliable automatic
conversion.

						matz.

rhaus · June 20, 2006, 7:52pm

On 6/20/06, [email protected] [email protected] wrote:

As in ‘UTF-8 and UTF-16 are subsets of Unicode’?

That usage seems unusual to me. Are you using ‘subset’ and ‘encoding’
as synonyms or am I missing subtle difference?

I mean that iso-8859-1 and iso-8859-2 encodings (as well as many
other) encode a subset of characters available in Unicode, and any of
its utf-* encodings. Thus any string that is encoded using such
encoding can be losslessly and automatically converted to an encoding
of full unicode such as utf-8, and operations on several such
converted strings make sense even if the strings were encoded using
different encodings before the conversion.

The automatic conversion would simplify things if you get strings in
different encodings from outside sources such as various web pages,
databases, etc.

Thanks

Michal

rhaus · June 21, 2006, 1:46pm

On 6/20/06, Yukihiro M. [email protected] wrote:

If you choose to convert all input text data into Unicode (and convert
them back at output), there’s no need for unreliable automatic
conversion.

Well, it’s actually you who chose the conversion on input for me.
Since the strings aren’t automatically converted I have to ensure that
I have always strings encoded using the same encoding. And the only
reasonable way I can think of is to convert any string that enters my
application (or class) to an arbitrary encoding I choose in advance.

This is no more reliable than automatic conversion. The reliability or
(un)reliability of the conversion is based on the (un)reliability with
which the actual encoding of the string is determined when it is
obtained. If the encoding tag is wrong the string will be converted
incorrectly. It is the only cause for incorrect conversion wether it
happens manually or automatically.

If conversion was done automatically by the string class it could be
performed lazily. The strings are kept in the encoding in which the
were obtained, and only converted when it is needed because they are
combined with a string in a different encoding. And users of the
srings still have the choice to convert them explicitly when they see
fit.

When such automatic conversion is not available it makes interfacing
with libraries that fetch external data more difficult.

a) I could instruct the library that fetches data from a database or
the web to return them always in the encoding I chose for
reperesenting strings in my application, irregardless of the encoding
the data was originally obtained in.
The disadvantage is that if the encoding was determined incorrectly on
input to the library the data is already garbled.

b) I could get the data from the library in the original encoding in
which it was obtained. Either because I would like to check that the
encoding is correct before converting the data or because the library
does not implement the interface for (a).
The disadvantage is that I have to traverse a potentially complex data
structure and convert all strings so that they work with the other
strings inside my application.

c) Every time I perform a string operation I should first check
(manually) that the two strings are compatible (or catch the exception
very near the opration so that I can convert the arguments and retry).
I do not think this is a reasonable option for the common case that
should be made as simple as possible: the strings can be represented
in Unicode. This may be necessary to some extent in applications
dealing with encodings that are incompatible with Unicode but it
should not be required for the common case.

The people with experience from other languages are complaining that
they have to do (b) or (c) because (a) is usually not implemented. And
ensuring either of the three does look like additional problems that
could be solved elsewhere - in the string class.

Thanks

Michal

rhaus · June 21, 2006, 4:04pm

Hi,

In message “Re: Unicode roadmap?”
on Wed, 21 Jun 2006 20:45:38 +0900, “Michal S.”
[email protected] writes:

|> If you choose to convert all input text data into Unicode (and convert
|> them back at output), there’s no need for unreliable automatic
|> conversion.
|
|Well, it’s actually you who chose the conversion on input for me.
|Since the strings aren’t automatically converted I have to ensure that
|I have always strings encoded using the same encoding. And the only
|reasonable way I can think of is to convert any string that enters my
|application (or class) to an arbitrary encoding I choose in advance.

Agreed. It is me. Perhaps you don’t know how terrible code
conversion can be. In the ideal world, lazy conversion seems
attractive, but reality bites. Conversions fail so easily.
Characters lost, text broken. Failures can not be avoided for various
reasons, mostly historical reasons we can’t fix anymore. When error
happens (often) it’s good to detect errors as early as possible,
i.e. on input/output. So I encourage universal character set model as
far as it is applicable. You may use UTF-8 or ISO8859-1 for universal
character set. I may use EUC-JP for it.

For only rare case, there might be need to handle multiple encoding in
an application. I do want to allow it. But I am not sure how we can
help that kind of applications, since they are fundamentally complex.
And we don’t have enough experience to design a framework for such
applications.

						matz.

rhaus · June 21, 2006, 4:59pm

On 6/21/06, Yukihiro M. [email protected] wrote:

For only rare case, there might be need to handle multiple encoding in
an application. I do want to allow it. But I am not sure how we can
help that kind of applications, since they are fundamentally complex.
And we don’t have enough experience to design a framework for such
applications.

I can see one more problem with setting encoding per file and tagging
accordingly string literals in it.
If operations on strings with different encodings will always throw an
exception, problems can raise when one calls such third-party library
from
script with different encoding.

Here’s small example:

library code in file some_utility.rb:

-- coding: EUC-JP --

module SomeUtility
def SomeUtility.fancy_format(str)
“” + str + “” # these literals are tagged as EUC-JP,
right?
end
end

application code in file my_app.rb:

-- coding: UTF-8 --

require ‘some_utility’
puts SomeUtility.fancy_format(“an utf8 string”) # this literal is
tagged as
UTF8

If the last call will throw some kind of EncodingMismatchError, how to
deal
with that?

rhaus · June 21, 2006, 5:36pm

On 6/21/06, Yukihiro M. [email protected] wrote:

I recommend using “ascii” encoding, which is default, for library
files, unless you are sure in what encoding your input data are.
For localization, tools like gettext would help dealing with strings
in the native encoding.

Just a thought. Might it be possible to have a new String literal for
what will be, I think, the most common encoding chosen (UTF-8)? That is,
in addition to:

-- coding: EUC-JP --

“” # tagged as EUC-JP

We allow:

-- coding: EUC-JP --

“” # tagged as EUC-JP
u"" # tagged as UTF-8

Despite my belief that we should avoid an enforced universal encoding as
the String representation, I do plan on making most of my applications
and libraries UTF-8 friendly and aware. It’s extremely important that we
be able to work with this cleanly, and if I can simply do either u"foo"
or U"foo" I would find it much easier to deal with in those places where
I need UTF-8/Unicode support.

-austin

rhaus · June 21, 2006, 5:42pm

On 21-jun-2006, at 17:18, Yukihiro M. wrote:

|If the last call will throw some kind of EncodingMismatchError,
how to deal
|with that?

I recommend using “ascii” encoding, which is default, for library
files, unless you are sure in what encoding your input data are.
For localization, tools like gettext would help dealing with strings
in the native encoding.

Matz, this would be a disaster (if in such a situation a library
throws). It’s gonna be like python.
Because it means that 99 percent of the libraries will throw.

rhaus · June 21, 2006, 6:21pm

Hi,

In message “Re: Unicode roadmap?”
on Thu, 22 Jun 2006 00:41:02 +0900, Julian ‘Julik’ Tarkhanov
[email protected] writes:

|Matz, this would be a disaster (if in such a situation a library
|throws). It’s gonna be like python.
|Because it means that 99 percent of the libraries will throw.

Can you elaborate? I don’t want to see disaster whatever it is.

						matz.

rhaus · June 21, 2006, 5:19pm

Hi,

In message “Re: Unicode roadmap?”
on Wed, 21 Jun 2006 23:56:47 +0900, “Dmitry S.”
[email protected] writes:

|I can see one more problem with setting encoding per file and tagging
|accordingly string literals in it.

Indeed.

I recommend using “ascii” encoding, which is default, for library
files, unless you are sure in what encoding your input data are.
For localization, tools like gettext would help dealing with strings
in the native encoding.

						matz.

rhaus · June 21, 2006, 6:47pm

Hi,

In message “Re: Unicode roadmap?”
on Thu, 22 Jun 2006 00:34:27 +0900, “Austin Z.”
[email protected] writes:

I am not sure this is a good idea or not (yet). If your “u” text
contains only ASCII characters, I see no need to tag it “UTF-8”, and
if it’s not, how do we prepare them? I think, for example,

u"\346\235\276\346\234\254" => my family name in Kanji

is too ugly.

						matz.

rhaus · June 21, 2006, 9:20pm

On 21-jun-2006, at 18:20, Yukihiro M. wrote:

Can you elaborate? I don’t want to see disaster whatever it is.
I imagine that in the case mentioned the encoding assumed for a
library will depend on the pragma in the source.

Fr instance, I am writing a program that needs to work wuth UTF8
data, but one of the libraries I am using has ASCII in the pragma.
What is going to haveppen if I ship this library UTF8 strings? Python
libraries just throw, because they do all kinds of no-unicode aware
operations on strings
or request Unicode strings explicitly. So anytime you want to ship
something to a library (or get something from STDIN) you have to
decode and encode.
As soon as you forget to, you get exceptions everywhere.

rhaus · June 21, 2006, 7:18pm

On 6/21/06, Yukihiro M. [email protected] wrote:

Can you elaborate? I don’t want to see disaster whatever it is.

                                                    matz.

Single scripts and small self-contained applications almost always
are written in the same codepage. Usually text data processing also
is done for the same codepage, that simplifies life a lot even with
current String as byte vector. So recoding is an overhead here, and
external data is only recoded on input/output in relativey small number
of well-defined places, using known subset of source and target
encodings.
In this case when you know what to expect from your file/network IO,
things
are OK.

It is also OK, when part of script is extracted and evolves to a
library,
as long as you use it in the same environment.

But let’s view a case when several third-party libraries are used, all
returning
strings with different encodings. gettext for libraries won’t solve
everything, as even externalized strings will have some particular
encoding.
E.g. localization libraries can’t fit in only ASCII.

And now calls to methods will behave like some kind of IO in respect to
encoding of passed parameters.
Number of i/o points grows drastically.

How can it be solved in consistent and reliable manner?
a) just simply declare in documentation: "Methods in these classes
require

strings to be in UTF16, you’ve been warned!!!"

So users of that code will have to remember those constrains and
enforce
encoding of their data before calling those methods. With dynamic
nature
of Ruby things will break in unexpected places. No, i dislike idea to
write:

 str.enforce_encoding!(BooClass::INTERNAL_ENCODING)
 b = BooClass.new(str)

b) take care in called methods to enforce encoding
def process_formatting(str)
str.enforce_encoding!(MY_INTERNAL_ENCODING)
# now it is compatible with rest of my code
# and i can do something with it
end

This is also too error-prone

And what about processing results of calls? To take care about it in
caller
code?
res_str = SomeUtil.fancy_format( str )
res_str.enforce_encoding!(MY_INTERNAL_ENCODING)

On input parameters and returned results which represent complex
structures
with some
String fields things will go even worse.

Who will ever cope with this issues?
Probably this is what Julik meant by “disaster”?

Things shouldn’t be that complicated.

rhaus · June 21, 2006, 9:23pm

On 21-jun-2006, at 19:17, Dmitry S. wrote:

.

Who will ever cope with this issues?
Probably this is what Julik meant by “disaster”?

Things shouldn’t be that complicated.

What I meant is the desritption how you get a Python program wielded
from different libraries to be Unicode-aware.
If Ruby works like that I won’t be happy. Basically, some libraries
accept Unicode in Python’s 16bit form, some accept utf-8 bytestrings
and some can only grok ASCII and will
throw up anyway. These are not going to work on Python 3000 as I
understand.

rhaus · June 22, 2006, 1:47am

On 6/21/06, Yukihiro M. [email protected] wrote:

|Since the strings aren’t automatically converted I have to ensure that
i.e. on input/output. So I encourage universal character set model as
far as it is applicable. You may use UTF-8 or ISO8859-1 for universal
character set. I may use EUC-JP for it.

I do not see how converting the strings on input will make the
situation better than converting them later. The exact place where the
text is garbled because it is converted incorrectly does not change
the fact it is no longer usable, does it?
well, it may be possible to detect characters that are invalid for
certain encoding either by scanning the string or by attempting a
conversion. But I would rather like optional checks that can be added
when something breaks or is likely to break rather than forced
conversion.

Or to put it another way: If I get a string from somewhere where the
encoding is marked incorrectly it is wrong and it should be expected
to fail. And I can do some checks if I think my source of data is not
reliable in this respect. But if I get string that is marked correctly
and it fails because I did not manually convert it it is frustrating.
And needlessly so.

For only rare case, there might be need to handle multiple encoding in
an application. I do want to allow it. But I am not sure how we can
help that kind of applications, since they are fundamentally complex.
And we don’t have enough experience to design a framework for such
applications.

I do no think it is that rare. Most people want new web (or any other)
stuff in utf-8 but there is need to interface legacy databases or
applications. Sometimes converting the data to fit the new application
is not practical. For one, the legacy application may be still used as
well.

Anyway, Ruby being as dynamic as it is I should be able to add support
for automatic recoding myself quite easily. The problem is I would not
be able to use it in libraries (should I ever write some) without
risking a clash with similar feature added by somebody else.

Thanks

Michal