On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Z. wrote:
- Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be encapsulated
by the string class completely, except for a few related classes which
may opt to work with the gory details for performance reasons.
The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.
Completely disagree. Matz has the right choice on this one. You can’t
think in just terms of a pure Ruby implementation – you must think
in terms of the Ruby/C interface for extensions as well.
I admit I don’t know about Ruby’s C extensions. Are they unable to
access String’s methods? That is all that is needed to work with them.
And since this String class does not have a parametric encoding
attribute, it should be easier to crunch in C even.
fails because your #2 is unacceptable.
Note that explict conversion to characters, arrays, etc, is possible
for any supported character set and encodig. I have even given method
examples. “External” is to be seen in the context of the String class.
case folding, sorting, comparing etc.
Agreed, but this would be expected regardless of the actual encoding of
a String.
I am unaware of Matz’s exact plan. Any good english language links?
I was under the impression users of Matz’ String instances need to
look at the encoding tag to implement eg. #version_sort. If that is
not the case our proposals are not that much different, only Matz’ one
is even more complex to implement than mine.
tradeoff reasons which work transparently together (a bit like FixInt
and BigInt).
Um. Disagree. Matz’s proposed approach does this; yours does not. Yours,
in fact, makes things much harder.
If Matz’s approach requires looking at the encoding tag from the
outside, it is not as transparent as mine. If it isn’t it just boils
down to a parametric class versus subclass hierarchy design decision,
and I don’t see much difference and would be happy with either one.
- Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby’s canonical String class. This will break some
old uses of String, but now is the right time for that.
“Now” isn’t; Ruby 2.0 is. Maybe Ruby 1.9.1.
My original title, somewhere snipped out, was “A Plan for Unicode
Strings in Ruby 2.0”. I don’t want to rush things or break 1.8 either.
- The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or the
terminal attached to stdout.
The String class doesn’t worry about that now.
I was just playing safe here.
- Be flexible.
And little is more flexible than Matz’s m17n String.
I’ve had flexibility with respect to Unicode Standards in mind, to not
fall into traps similiar to Java. A simple to use String class,
powerful enough to include every character of the world was my goal,
with the ability to convert to and from other external (from the
String class’es point of view) representations.
The flexibility to have parametric String encodings inside the String
class was not what I was going for, rather I would have that
inaccessible or at least unneccessary to access for the common String
user, and I provided a somewhat weaker but maybe still sufficient
technique via subclassing.
Remember: POLS is not an acceptable reason for anything. Matz’s m17n
Strings would be predictable, too. a + b would be possible if and only
if a and b are the same encoding or one of them is “raw” (which would
mean that the other is treated as the defined encoding) or there is a
built-in conversion for them.
Since I probably cannot control which Strings I get from libraries,
and dont’t want to worry which ones I’ll have to provide to them, this
is weaker than my approach in this respect, see my next point.
work for Ruby/C interfaced items. Sorry.
Please elaborate this or provide pointers. I cannot believe C cannot
crunch at my Strings, which are less parametric than Matz’s ones are.
whether it’s actually UTF-8 or not until I get HTTP headers – or
worse, a tag. Assuming UTF-8 reading in today’s world
is doomed to failure.
Read it as binary, and decide later. These problems should be locally
containable, and methods are still able to return Strings after
determining the encoding.
tags. Merely that they could. I suspect that there will be pragma-
like behaviours to enforce a particular internal representation at all
times.
Previously you stated users need to look at the encoding to determine
if simple operations like a + b work.
Can you point to more info? I am interested how this pragma stuff
works, and if not doing it “right” can break things.
Disadvantages (with mitigating reasoning of course)
- String users need to learn that #byte_length(encoding=:utf8) >=
#size, but that’s not too hard, and applies everywhere. Users do not
need to learn about an encoding tag, which is surely worse to handle
for them.
True, but the encoding tag is not worse. Anyone who assumes that
developers can ignore encoding at any time simply doesn’t know about
the level of problems that can be encountered.
For String concatenates, substring access, search, etc, I expect to be
able to ignore encoding totally. Only when interfacing with
non-String-class objects (I/O and/or explicit conversion) would I need
encoding info.
- Strings cannot be used as simple byte buffers any more. Either use
an array of bytes, or an optimized ByteBuffer class. If you need
regular expresson support, RegExp can be extended for ByteBuffers or
even more.
I see no reason for this.
In my proposal, Unicode Strings cannot represent arbitrary binary data
in their internal representation, since not everything would be valid
characters. In fact, you cannot set the internal representation
directly.
The interface could accept a code point sequence of values
(0…255), but that would be wasteful compared to an array of bytes.
- Some String operations may perform worse than might be expected from
a naive user, in both the time or space domain. But we do this so the
String user doesn’t need to himself, and are problably better at it
than the user too.
This is a wash.
Only trying to refute weak arguments in advance.
- For very simple uses of String, there might be unneccessary
conversions. If a String is just to be passed through somewhere,
without inspecting or modifying it at all, in- and outwards conversion
will still take place. You could and should use a ByteBuffer to avoid
this.
This is a wash.
Not a big problem either, but someone was bound to bring it up.
users really do get unexpected foreign characters in their Strings. I
concluded case folding. I think it is more than that: we are lazy and
understood this could be handled by future Unicode revisions
* [email protected]
The way I see it we have to choose a character set. I proposed
Unicode, because their official goal is to be the one unifying set,
and if they ain’t yet, I hope they’ll be sometime.
If that is not enough we will effectively create our own character
set, let’s call it RubyCode, which will contain characters from the
union of Unicode and a few other sets. Each String will have a
particular encoding, which will determine which characters of RubyCode
are valid in this particular String instance. Hopefully many
characters will be valid in multiple encodings. But it doesn’t sound
like a very clear design to me.
Jürgen