Csv.rb, UTF-16LE, 1.9 compatibility

I have another tiny CSV file (UTF-16LE),
and also a tiny script reading it with the help of csv.rb.

require 'csv'

# $ $HOME/.rvm/bin/ruby-1.9.2-p180       -w test-utf.00.rb
# $ $HOME/.rvm/bin/jruby-head      --1.9 -w test-utf.00.rb

f = open('g.UTF-16LE.csv', "r:UTF-16LE:UTF-8")

STDERR.printf("=%d: %s=>{%s},%s=>{%s} // %s\n",__LINE__,
              'f.external_encoding',f.external_encoding,
              'f.internal_encoding',f.internal_encoding,
              '...')

f.read(2)                         # read off the BOM

CSV.new(f,headers: true, row_sep: "\r\n").each do |csv_record|

  printf "%s=>{%05.5d},{%s}=>{%s}\n",
    '$.',$.,
    'f1',csv_record['f1']

end

It runs through just as expected with ruby-1.9.2-p180,
but it runs into an exception with jruby-head :

$ ~/.rvm/bin/jruby-head --version
jruby 1.6.0.RC2 (ruby 1.8.7 patchlevel 330) (2011-03-02 7fd4e3f) 

(OpenJDK Server VM 1.6.0_20) [linux-i386-java]

$ ~/.rvm/bin/jruby-head --1.9 -w test-utf.00.rb
=8: f.external_encoding=>{UTF-16LE},f.internal_encoding=>{UTF-8} // 


Encoding::CompatibilityError: incompatible encoding regexp match
(US-ASCII regexp with UTF-16LE string)
sub! at org/jruby/RubyString.java:2582
shift at
/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1831
loop at org/jruby/RubyKernel.java:1417
shift at
/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1825
each at
/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1767
(root) at test-utf.00.rb:15

As before:
I talked to csv.rb’s developer and maintainer,
he also thinks, jruby (–1.9) needs fixing, not csv.rb.

J.

If I quote all the entries in the CSV file,
so that it looks like this (albeit UTF-16LE encoded):

"f1","f2"
"bla","bla"

… the stack trace is also quite intimidating:

CSV::MalformedCSVError: CSV::MalformedCSVError
   shift at 

/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1886
each at org/jruby/RubyArray.java:1572
shift at
/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1863
loop at org/jruby/RubyKernel.java:1417
shift at
/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1825
each at
/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1767
(root) at test-utf.00.rb:26

These lines of code look at the same file as just a text file,
look yourself at, what they yield:

f = open('g.UTF-16LE.csv', "r:UTF-16LE:UTF-8")

STDERR.printf("=%d: %s=>{%s},%s=>{%s} // %s\n",__LINE__,
              'f.external_encoding',f.external_encoding,
              'f.internal_encoding',f.internal_encoding,
              '...')

f.read(2)                         # read off the BOM

line_1 = f.gets

STDERR.printf("=%d: %s=>{%s} // %s\n",__LINE__,
              'line_1.encoding.name',line_1.encoding.name,
              '...')

STDERR.printf("=%d: %s=>{%s} // %s\n",__LINE__,
              'line_1',line_1,
              '...')

UTF-16 / UTF-16LE looks like not (really / fully) implemented yet.

Hey, you jruby devolper guys, we all still admire your work a lot!!

J.

Regarding “csv.rb, UTF-16LE, 1.9 compatibility” Jochen H. adds:

I have another tiny CSV file (UTF-16LE),
and also a tiny script reading it with the help of csv.rb.

require 'csv'
# $ $HOME/.rvm/bin/ruby-1.9.2-p180       -w test-utf.00.rb
# $ $HOME/.rvm/bin/jruby-head      --1.9 -w test-utf.00.rb
f = open('g.UTF-16LE.csv', "r:UTF-16LE:UTF-8")
STDERR.printf("=%d: %s=>{%s},%s=>{%s} // %s\n",__LINE__,
              'f.external_encoding',f.external_encoding,
              'f.internal_encoding',f.internal_encoding,
              '...')
f.read(2)                         # read off the BOM
CSV.new(f,headers: true, row_sep: "\r\n").each do |csv_record|
  printf "%s=>{%05.5d},{%s}=>{%s}\n",
    '$.',$.,
    'f1',csv_record['f1']
end

It runs through just as expected with ruby-1.9.2-p180,
but it runs into an exception with jruby-head :

$ ~/.rvm/bin/jruby-head --version
jruby 1.6.0.RC2 (ruby 1.8.7 patchlevel 330) (2011-03-02 7fd4e3f) (OpenJDK 

Server VM 1.6.0_20) [linux-i386-java]

$ ~/.rvm/bin/jruby-head --1.9 -w test-utf.00.rb
=8: f.external_encoding=>{UTF-16LE},f.internal_encoding=>{UTF-8} // ...
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII 

regexp with UTF-16LE string)

    sub! at org/jruby/RubyString.java:2582
   shift at 

/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1831

    loop at org/jruby/RubyKernel.java:1417
   shift at 

/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1825

    each at 

/home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1767

  (root) at test-utf.00.rb:15

As before:
I talked to csv.rb’s developer and maintainer,
he also thinks, jruby (–1.9) needs fixing, not csv.rb.

I almost have this fixed now. master will properly transcode the
gets, but it does not transcode the separator so it is unable to find
\r\n since in UTF16E that is really \0\r\0\n (something like that).
Pretty close to solved though.

-Tom

On Sat, Mar 5, 2011 at 7:35 AM, Jochen H.
[email protected] wrote:

each at org/jruby/RubyArray.java:1572
f = open(‘g.UTF-16LE.csv’, “r:UTF-16LE:UTF-8”)
STDERR.printf(“=%d: %s=>{%s} // %s\n”,LINE,

     '...')

shift at /home/jochen_hayek/.rvm/rubies/jruby-head/lib/ruby/1.9/csv.rb:1831


To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email


blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

Thomas E Enebo writes:

I almost have this fixed now.
master will properly transcode the gets,
but it does not transcode the separator
so it is unable to find \r\n

since in UTF16E that is really \0\r\0\n (something like that).

yes, exactly.

f = open(‘g.UTF-16LE.csv’, “r:UTF-16LE:UTF-8”)

I actually thought, that “r:UTF-16LE:UTF-8” lets all subsequent
operations deal with UTF-8,
just as if the file was UTF-8 encoded from the beginning.
Then \r\n would be the right thing again.

Is that perspective incorrect?

Pretty close to solved though.

-Tom

J.

On Sat, Mar 5, 2011 at 9:57 AM, Jochen H.
[email protected] wrote:

f = open(‘g.UTF-16LE.csv’, “r:UTF-16LE:UTF-8”)

I actually thought, that “r:UTF-16LE:UTF-8” lets all subsequent operations deal
with UTF-8,
just as if the file was UTF-8 encoded from the beginning.
Then \r\n would be the right thing again.

Is that perspective incorrect?

That is true for anything which comes out of File/IO methods. When
reading from the source we need to read it as UTF-16LE since that is
what we indicated by the internal encoding option…without reading
the entire file in and then transcoding it to UTF-8, then how could we
apply an UTF-8 separator to get a single line unless we transcode the
separator to UTF-16LE finstead (MRI also does this)?

For what it’s worth I did not realize this either until you reported
this problem…

-Tom


blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

Your example is working…we need tons of specs added for IO
internal/external and I am not in the position before Monday to add
them, but your examples have at least ‘roughed’ in the feature. I
will make an effort (and honestly I would love it if some kind soul
dug into IO specs for internal/external encoding features) to get
better coverage for 1.6.1.

-Tom

On Sat, Mar 5, 2011 at 11:31 AM, Thomas E Enebo [email protected]
wrote:

yes, exactly.
reading from the source we need to read it as UTF-16LE since that is


blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]


blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

Thomas E Enebo writes:

Your example is working…

And the application, from which I derived the example, now works as
well.

No more blocker for now, that’s great!
Thanks a lot!

Just … that we cannot write args for open as hash array:

{:internal_encoding=>“UTF-8”, :external_encoding=>“UTF-16LE”}

Instead we have to write

“r:UTF-16LE:UTF-8”

But I don’t mind that.

we need tons of specs added for IO internal/external
[…]

-Tom

J.

Thomas E Enebo writes:

On Sun, Mar 6, 2011 at 2:48 AM, Jochen H. [email protected] wrote:

Just … that we cannot write args for open as hash array:

{:internal_encoding=>“UTF-8”, :external_encoding=>“UTF-16LE”}

Changing your example to use hash or even ruby special embedded hash
syntax works for me. Can you show me an example of this not working?

$ …/.rvm/rubies/ruby-1.9.2-p180/bin/ruby
…/.rvm/rubies/ruby-1.9.2-p180/bin/irb

ruby-1.9.2-p180 :001 > f = open(‘g.UTF-16LE.csv’, “r:UTF-16LE:UTF-8”)
=> #<File:g.UTF-16LE.csv>

ruby-1.9.2-p180 :002 > f = open(‘g.UTF-16LE.csv’,
{:internal_encoding=>“UTF-8”, :external_encoding=>“UTF-16LE”})
=> #<File:g.UTF-16LE.csv>

$ …/.rvm/rubies/jruby-head/bin/jruby --1.9
…/.rvm/rubies/jruby-head/bin/irb

ruby-1.9.2-p180 :001 > f = open(‘g.UTF-16LE.csv’, “r:UTF-16LE:UTF-8”)
=> #<File:g.UTF-16LE.csv>

ruby-1.9.2-p180 :002 > f = open(‘g.UTF-16LE.csv’,
{:internal_encoding=>“UTF-8”, :external_encoding=>“UTF-16LE”})
TypeError: can’t convert Hash into String
from org/jruby/RubyFile.java:456:in initialize' from org/jruby/RubyIO.java:1115:in open’
from org/jruby/RubyKernel.java:308:in open' from (irb):2:in evaluate’
from org/jruby/RubyKernel.java:1092:in eval' from org/jruby/RubyKernel.java:1417:in loop’
from org/jruby/RubyKernel.java:1204:in catch' from org/jruby/RubyKernel.java:1204:in catch’
from …/.rvm/rubies/jruby-head/bin/irb:17:in `(root)’

HTH

J.

On Sun, Mar 6, 2011 at 2:48 AM, Jochen H.
[email protected] wrote:

Just … that we cannot write args for open as hash array:

{:internal_encoding=>“UTF-8”, :external_encoding=>“UTF-16LE”}

Changing your example to use hash or even ruby special embedded hash
syntax works for me. Can you show me an example of this not working?

-Tom

-Tom

J.


To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email


blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

Fixed this as well.

-Tom

On Sun, Mar 6, 2011 at 9:54 AM, Thomas E Enebo [email protected]
wrote:

ruby-1.9.2-p180 :002 > f = open(‘g.UTF-16LE.csv’, {:internal_encoding=>“UTF-8”,
:external_encoding=>“UTF-16LE”})
ruby-1.9.2-p180 :002 > f = open(‘g.UTF-16LE.csv’, {:internal_encoding=>“UTF-8”,
:external_encoding=>“UTF-16LE”})

blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]


blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

Thomas E Enebo writes:

Fixed this as well.

I like that. Thanks!

I am glad I could help by finding a few differences,
that you were able to fix in the meantime.

-Tom

J.

Ah cool. Thanks RubyKernel.open is deferring to 1.8 open instead of 1.9
open…

-Tom

On Sun, Mar 6, 2011 at 9:18 AM, Jochen H.
[email protected] wrote:

from (irb):2:in `evaluate’
J.


blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]