Bug in how Ruby 2.1,2.2 handles Encoding::ConverterNotFoundError

addis_a · January 9, 2015, 5:08am

Given:

“\x80”.force_encoding(“ASCII-8BIT”).encode( Encoding::Emacs_Mule)

raises an Encoding::ConverterNotFoundError on all rubies

On all rubies except for mri 2.1 and 2.2, encoding with the invalid
option has no effect. But on 2.1 and 2.2, it replaces the “\x80” ==
128.chr with the replace string (‘?’, 63.chr)

for each ruby version with UTF-8 encoding
[“1.9.2”, “ruby”, #Encoding:UTF-8, #Encoding:UTF-8, 128]
[“1.9.3”, “ruby”, #Encoding:UTF-8, #Encoding:UTF-8, 128]
[“2.0.0”, “ruby”, #Encoding:UTF-8, #Encoding:UTF-8, 128]
[“2.1.5”, “ruby”, #Encoding:UTF-8, #Encoding:UTF-8, 63]
[“2.2.0”, “ruby”, #Encoding:UTF-8, #Encoding:UTF-8, 63]
[“2.0.0”, “jruby”, #Encoding:UTF-8, #Encoding:UTF-8, 128]
[“2.1.0”, “rbx”, #Encoding:UTF-8, #Encoding:UTF-8, 128]

[“1.9.3”, “jruby”, #Encoding:UTF-8, #Encoding:UTF-8, 128]
[“2.0.0”, “jruby”, #Encoding:UTF-8, #Encoding:UTF-8, 128]

Here’s my test code

LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
rvm
ruby-1.9.2-p330,ruby-1.9.3-p551,ruby-2.0.0-p598,ruby-2.1.5,ruby-2.2.0,jruby-1.7.18,rbx-2.2.2
do
ruby -e ‘p [RUBY_VERSION, RUBY_ENGINE, Encoding.default_external,
ENCODING] +
“\x80”.force_encoding(“ASCII-8BIT”).force_encoding(“Emacs-Mule”).encode(:invalid
=> :replace).bytes.to_a’

for version in 1.9 2.0; do
export JRUBY_OPTS=“-Xcompat.version=${version}” ;
LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
rvm jruby-1.7.18 do
ruby -e ‘p [RUBY_VERSION, RUBY_ENGINE, Encoding.default_external,
ENCODING] +
“\x80”.force_encoding(“ASCII-8BIT”).force_encoding(“Emacs-Mule”).encode(:invalid
=> :replace).bytes.to_a’ ; done

For the particularly curious, this is relevant to a PR I have for
rspec-support

beninem · January 9, 2015, 5:48am

String#encode converts to Encoding.default_internal by default, not
default_exteranl.
And default_internal is nil by default, it doesn’t change the encoding.
In this case, it had done nothing, except for just making a copy, till
2.0.

But as many people made same mistake like yours, expecting invalid
chars were removed/replaced, the behavior has changed since 2.1 to
replace such chars if
:replace is given.

beninem · January 9, 2015, 6:10am

Nobuyoshi N. wrote in post #1166362:

Thank you! That is very helpful!

But as many people made same mistake like yours, expecting invalid
chars were removed/replaced,

Actually, I’ve just been reading code and testing and observing what
Ruby does.

I was asked why the behavior changed and I didn’t know, since it is
reasonable for Ruby to ignore the :invalid directive when no converter
is found,

Thank you so much!

-Benjamin

p.s. Now I just need to figure out how all the Rubies reconcile the
:undef, :invalid, and :replace with :fallback

I don’t really know C, but it appears to me from the code that :invalid
and :undef are called before :fallback (ecflags?)

github.com

ruby/ruby/blob/34fbf57aaafee9390a0f7427eb90efac099e33ec/transcode.c#L2677-L2733


      
              if (argc == 0) {
          	arg1 = rb_enc_default_internal();
          	if (NIL_P(arg1)) {
          	    if (!ecflags) return -1;
          	    arg1 = rb_obj_encoding(str);
          	}
          	if (!(ecflags & ECONV_INVALID_MASK)) {
          	    explicitly_invalid_replace = FALSE;
          	}
          	ecflags |= ECONV_INVALID_REPLACE | ECONV_UNDEF_REPLACE;
              }
              else {
          	arg1 = argv[0];
              }
              arg2 = argc<=1 ? Qnil : argv[1];
              dencidx = str_transcode_enc_args(str, &arg1, &arg2, &sname, &senc, &dname, &denc);
          
          
    if ((ecflags & (ECONV_NEWLINE_DECORATOR_MASK|
                              ECONV_XML_TEXT_DECORATOR|
                              ECONV_XML_ATTR_CONTENT_DECORATOR|

This file has been truncated. show original

github.com

ruby/ruby/blob/34fbf57aaafee9390a0f7427eb90efac099e33ec/transcode.c#L2838-L2858


      
          *  The +options+ Hash gives details for conversion and can have the following
          *  keys:
          *
          *  :invalid ::
          *    If the value is +:replace+, #encode replaces invalid byte sequences in
          *    +str+ with the replacement character.  The default is to raise the
          *    Encoding::InvalidByteSequenceError exception
          *  :undef ::
          *    If the value is +:replace+, #encode replaces characters which are
          *    undefined in the destination encoding with the replacement character.
          *    The default is to raise the Encoding::UndefinedConversionError.
          *  :replace ::
          *    Sets the replacement string to the given value. The default replacement
          *    string is "\uFFFD" for Unicode encoding forms, and "?" otherwise.
          *  :fallback ::
          *    Sets the replacement string by the given object for undefined
          *    character.  The object should be a Hash, a Proc, a Method, or an
          *    object which has [] method.
          *    Its key is an undefined character encoded in the source encoding
          *    of current transcoder. Its value can be any encoding until it
          *    can be converted into the destination encoding of the transcoder.

github.com

ruby/ruby/blob/34fbf57aaafee9390a0f7427eb90efac099e33ec/transcode.c#L2282-L2290


      
              ec = rb_econv_open_opts(src_encoding, dst_encoding, ecflags, ecopts);
              if (!ec)
                  rb_exc_raise(rb_econv_open_exc(src_encoding, dst_encoding, ecflags));
          
          
    if (!NIL_P(ecopts) && RB_TYPE_P(ecopts, T_HASH)) {
          	fallback = rb_hash_aref(ecopts, sym_fallback);
          	if (RB_TYPE_P(fallback, T_HASH)) {
          	    fallback_func = hash_fallback;
          	}