On 6/17/06, Stefan L. [email protected] wrote:
Full ACK. Ruby programs shouldn’t need to care about the
internal string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.
This is incorrect. Most Ruby programs won’t need to care about the
internal string encoding. Experience suggests, however, that it is
most. Definitely not all.
Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.
This could look like:
my_character_str = Encoding::UTF8.encode(my_byte_buffer)
buffer = Encoding::UTF8.decode(my_character_str)
Unnecessarily complex and inflexible. Before you go too much further, I
really suggest that you look in the archives and Google to find more
about Matz’s m17n String proposal. It’s a really good one, as it allows
developers (both pure Ruby and extension) to choose what is appropriate
with the ability to transparently convert as well.
- IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings.
String-IO operations work as expected.
I propose one of:
- A low level IO API that reads/writes ByteBuffers. String IO
can be implemented on top of this byte-oriented API.
[…]
- The File class/IO module as of current Ruby just gets
additional methods for binary IO (through ByteBuffers) and
an encoding attribute. The methods that do binary IO don’t
need to care about the encoding attribute.
I think 1) is cleaner.
I think neither is necessary and both would be a mistake. It is, as I
indicated to Juergen, sometimes impossible to determine the encoding
to be used for an IO until you have some data from the IO already.
- Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.
If the strings are represented as a sequence of Unicode codepoints, it
is possible for external libraries to implement more advanced Unicode
operations.
This would be true regardless of the encoding.
Since IMO a new “character” class would be overkill, I propose that
the String class provides codepoint-wise iteration (and indexing) by
representing a codepoint as a Fixnum. AFAIK a Fixnum consists of 31
bits on a 32 bit machine, which is enough to represent the whole range
of unicode codepoints.
This does not match what Matz will be doing.
str = “Fran\303\247ais”
str[5] # → “\303\247”
This is better than doing a Fixnum representation. It is character
iteration, but each character is, itself, a String.
- This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like
FixInt and BigInt).
I think providing different internal String representations
would be too much work, especially for maintenance in the long
run.
If you’re depending on classes to do that, especially given that Ruby’s
String, Array, and Hash classes don’t inherit well, you’re right.
The advantages of this proposal over the current situation and
tagging a string with an encoding are:
The problem, of course, is that this proposal – and your take on it –
don’t account for the m17n String that Matz has planned. The current
situation is a mess. But the current situation is not what is planned.
I’ve had to do some encoding work for work in the last two years, and
while I prefer a UTF-8/UTF-16 internal representation, I also know
that’s impossible in some situations and you have to be flexible. I
also know that POSIX handles this situation worse than any other
setup.
With the work that I’ve done on this, Matz is right about this, and
the people claiming that Unicode is the Only Way … are wrong. In an
ideal world, Unicode would be the correct and only way. In the real
world, however, it’s a lot messier, and Ruby has to be aware of that.
We can still make it as easy as possible for the common case (which
will be UTF-8 encoding data and filenames). But we shouldn’t make the
mistake of assuming that the common case is all that Ruby should handle.
- There is only one internal string (where string means a
string of characters) representation. String operations
don’t need to be written for different encodings.
This is still (mostly) correct under the m17n String proposal.
This is true under the m17n String.
This is true under the m17n String.
- Separation of concerns. I always found it strange that most dynamic
languages simply mix handling of character and arbitrary binary data
(just think of pack/unpack).
The separation makes things harder most of the time.
- Reading of character data in one encoding and representing it in
other encoding(s) would be easy.
This is true under the m17n String.
It seems that the main argument against using Unicode strings in Ruby
is because Unicode doesn’t work well for eastern countries. Perhaps
there is another character set that works better that we could use
instead of Unicode. The important point here is that there is only
one representation of character data Ruby.
This is a mistake.
If Unicode is choosen as character set, there is the question which
encoding to use internally. UTF-32 would be a good choice with regards
to simplicity in implementation, since each codepoint takes a fixed
number of bytes. Consider indexing of Strings:
Yes, but this would be very hard on memory requirements. There are
people who are trying to get Ruby to fit into small-memory environments.
This would destroy any chance of that.
[…]
Thank you for reading so far. Just in case Matz decides to implement
something similar to this proposal, I am willing to help with Ruby
development (although I don’t know much about Ruby’s internals and not
too much about Unicode either).
I would suggest that you look for discussions about m17n Strings in
Ruby. Matz has this one right.
I do not have a CS degree and I’m not a Unicode expert, so perhaps the
proposal is garbage, in this case please tell me what is wrong about
it or why it is not realistic to implement it.
I don’t have a CS degree either, but I have been in the business for a
long time and I’ve been immersed in Unicode and encoding issues for
the last two years. If everyone used Unicode – and POSIX weren’t stupid
– your proposal would be much more realistic. I agree that Ruby
should encourage the use of Unicode as much as is practical. But it also
shouldn’t tie our hands like other programming languages do.
-austin