UTF-8 String.strip bug

Dobai-Pataky_BSSSSl · December 14, 2010, 5:22pm

Hello, with Rails 3.0.3

"Café Noir ".strip => “Café noir”
but
"Café ".strip => “Caf\303\251”

In fact, strip() doesn’t works if the last printable character is
accentuated.
Surprisingly " écologie".strip works fine.

I’ve tried to dig deeper in active_support multibyte source code but
didn’t found any solution.

Any help ?

bobm · December 14, 2010, 5:48pm

On 14 December 2010 16:22, Bob M. [email protected] wrote:

Hello, with Rails 3.0.3

"Caf Noir ".strip => “Caf noir”
but
"Caf ".strip => “Caf\303\251”

In fact, strip() doesn’t works if the last printable character is
accentuated.

Strange, I get:
$ rails console
Loading development environment (Rails 3.0.3)
ruby-1.9.2-p0 > "Caf Noir ".strip
=> “Caf Noir”
ruby-1.9.2-p0 > "Caf ".strip
=> “Caf”

Which Ruby are you using?

Colin

bobm · December 15, 2010, 1:02am

I don’t see the ‘é’ in your snippet code. Did you tried with real
accentuated chars ?

I’m using Ruby enterprise edition 1.8.x - I didn’t thought about a
possible bug in Ruby itself. I might try a more recent 1.8 version of
REE… Don’t want to switch to 1.9 just for a so small (but annoying)
problem…

bobm · December 15, 2010, 9:52am

On 15 December 2010 00:02, Bob M. [email protected] wrote:

I don’t see the ‘’ in your snippet code. Did you tried with real
accentuated chars ?

Is this in reply to my response? You have not quoted anything and
have changed the subject line so gmail has not linked up the thread.

If so I don’t understand when you say you do not see the accented
char, copying from my previous post:
$ rails console
Loading development environment (Rails 3.0.3)
ruby-1.9.2-p0 > "Caf Noir ".strip
=> “Caf Noir”
ruby-1.9.2-p0 > "Caf ".strip
=> “Caf”
I see accented char .

It is interesting, though, that the in your mail looks different to
the one here, even though I have just copied and pasted it from your
email into mine. It does look like yours though when I paste it into
the ruby console. What happens if you copy it from here and use it in
your console?

I’m using Ruby enterprise edition 1.8.x - I didn’t thought about a
possible bug in Ruby itself. I might try a more recent 1.8 version or
REE… Don’t want to switch to 1.9 just for a so small (but annoying)
problem…

This is the result in 1.8.7
$ ruby script/console
Loading development environment (Rails 2.3.2)
ruby-1.8.7-p302 > "Caf Noir ".strip
=> “Caf Noir”
ruby-1.8.7-p302 > "Caf ".strip
=> “Caf”
ruby-1.8.7-p302 >

Of course maybe your response was not to my mail at all, in which case
I have been wasting my time.

Colin

bobm · December 15, 2010, 12:09pm

Colin L. wrote in post #968503:

Is this in reply to my response? You have not quoted anything and
have changed the subject line so gmail has not linked up the thread.

I’m posting throught ruby-forum, so may be something got mixed up during
the process ?

So, in your case String.strip() does work correctly with both versions
of Ruby. I really don’t understand why it goes wrong for me. May be a
bug in the REE code.

bobm · December 15, 2010, 11:28am

On Dec 14, 4:22pm, Bob M. [email protected] wrote:

Hello, with Rails 3.0.3

"Caf Noir ".strip => “Caf noir”
but
"Caf ".strip => “Caf\303\251”

While it may not look pretty this is accurate if you are using utf8 -
is 0xC3 0xA9 in UTF8, which is 0o303 0o251 in octal. I’m not sure
why inspect is choosing to show the octal escape codes but you string
does contain the correct bytes. (maybe some heuristic that tries to
determine whether the string is utf8 and show be displayed as such or
whether it just contains random binary gunk)

Fred

bobm · December 15, 2010, 2:54pm

Colin L. wrote in post #968534:

On 15 December 2010 11:09, Bob M. [email protected] wrote:

of Ruby. I really don’t understand why it goes wrong for me. May be a
bug in the REE code.

Have you seen Fred’s reply back in your original thread?

Colin

Yes I did. But I still don’t see how to avoid having all my right
stripped strings beeing garbaged by octal escapes. I would like to avoid
the need of a regexp call to revert them to something readable.

Thanks

bobm · December 15, 2010, 12:18pm

On 15 December 2010 11:09, Bob M. [email protected] wrote:

of Ruby. I really don’t understand why it goes wrong for me. May be a
bug in the REE code.

Have you seen Fred’s reply back in your original thread?

Colin

bobm · December 16, 2010, 6:00pm

Frederick C. wrote in post #968521:

On Dec 14, 4:22pm, Bob M. [email protected] wrote:

Hello, with Rails 3.0.3

"Caf Noir ".strip => “Caf noir”
but
"Caf ".strip => “Caf\303\251”

While it may not look pretty this is accurate if you are using utf8 -
is 0xC3 0xA9 in UTF8, which is 0o303 0o251 in octal. I’m not sure
why inspect is choosing to show the octal escape codes but you string
does contain the correct bytes. (maybe some heuristic that tries to
determine whether the string is utf8 and show be displayed as such or
whether it just contains random binary gunk)

Fred

I tried in 3 different versions of ruby and the way it is rendered in
irb
is indeed different (and is confusing):

ruby-1.8.7-p302 > “Caf\303\251”
=> “Caf\303\251”
…
ree-1.8.7-2010.02 > “Caf\303\251”
=> “Caf\303\251”
…
ruby-1.9.2-head > “Caf\303\251”
=> “Café”

@Bob, are you sure you use UTF-8 encoding for your web page?

HTH,

Peter

bobm · December 15, 2010, 3:20pm

On 15 December 2010 13:54, Bob M. [email protected] wrote:

stripped strings beeing garbaged by octal escapes. I would like to avoid
the need of a regexp call to revert them to something readable.

If I understand Fred correctly there is nothing wrong with the string,
it is just the display that is wrong in the console. Are you seeing
the same thing when you show it on a web page?

Colin