On Aug 7, 2009, at 10:41 AM, Vít Ondruch wrote:
You are not allowed to set the source encoding to a non-ASCII
compatible encoding, if memory serves.
Where is it documented please?
I’m not sure it’s officially documented yet.
Ruby does throw an error in this scenario though:
$ ruby_dev
encoding: UTF-16BE
ruby_dev: UTF-16BE is not ASCII compatible (ArgumentError)
and:
$ ruby_dev -e ‘puts “\uFEFF# encoding: UTF-16BE”.encode(“UTF-16BE”)’ |
ruby_dev
-:1: invalid multibyte char (UTF-8)
I believe this is the relevant code from Ruby’s parser:
static void
parser_set_encode(struct parser_params *parser, const char *name)
{
int idx = rb_enc_find_index(name);
rb_encoding *enc;
if (idx < 0) {
rb_raise(rb_eArgError, “unknown encoding name: %s”, name);
}
enc = rb_enc_from_index(idx);
if (!rb_enc_asciicompat(enc)) {
rb_raise(rb_eArgError, “%s is not ASCII compatible”,
rb_enc_name(enc));
}
parser->enc = enc;
}
That eliminates any issues
with encodings like UTF-16. This makes perfect sense as there’s no
way to reliably support the magic encoding comment unless we can
count
on being able to read at least that far.
Needed to say that XML parsers can handle such cases, i.e. when xml
header is in different encoding than the rest of document.
I doubt we can say that universally.
Also, what you said isn’t very accurate. For example, “in different
encoding than the rest of document” is not a possible occurrence
according to the XML 1.1 specification
(Extensible Markup Language (XML) 1.1 (Second Edition)
) which states:
“It is a fatal error when an XML processor encounters an entity with
an encoding that it is unable to process. It is a fatal error if an
XML entity is determined (via default, encoding declaration, or higher-
level protocol) to be in a certain encoding but contains byte
sequences that are not legal in that encoding.”
All XML parsers are required to assume UTF-8 unless told otherwise and
to be able to recognize UTF-16 by a required BOM. Beyond that, they
are not required to recognize any other encodings, though they may of
course. Their encoding declaration can be expressed in ASCII and,
since they assume UTF-8 by default, this is similar to what Ruby
does. It allows a switch to an ASCII-compatible encoding.
XML processors may do more. For example, they can accept a different
encoding from an external source to support things like HTTP headers
and MIME types. Ruby doesn’t really have access to such sources at
execution time, so that option doesn’t apply to the case we are
discussing. However, XML processors may also recognize other BOM’s
and Ruby could do this.
A BOM could be handled similarly to what I showed. You need to open
the file in ASCII-8BIT and check the beginning bytes, then you could
switch to US-ASCII and finish reading the first line (or to the
second
if a shebang line is includes), then switch encodings again if needed
and finish processing.
May be this technique could be used for reading UTF-16 encoded
files, if
needed?
Yes, Ruby could recognize BOM’s for non-ASCII compatible encodings to
support them. A BOM would be required in this case though, just as it
is in an XML processor that doesn’t have external information.
Ruby doesn’t currently do this, as near as I can tell.
Note that this would not give what you purposed in your initial
message: multiple encodings in the same file. Ruby doesn’t support
that and isn’t ever likely to. An XML processor that supports such
things is in violation of its specification as I understand it.
Besides, not many text editors that I’m aware of make it super easy to
edit in multiple encodings.
James Edward G. II