Hi,
I’m using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:
require ‘test/unit’
class TestRegexBug < Test::Unit::TestCase
def test_bug
hours = "pon-Äet"
assert(hours =~ /[Ä]et/i)
assert(hours =~ /Äet/i)
assert(hours =~ /-Äet/i)
assert(hours =~ /[cÄ]et/i)
assert(hours =~ /-[Ä]et/i)
end
end
As you can see, this only happens with unicode letters… (the last test
fails)… I’m used to the fact that //i doesn’t work for unicode chars
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’ or
‘[]’ from the regex, it works…
Can you comment?
thank you,
david
On Mar 3, 2008, at 2:24 PM, D. Krmpotic wrote:
Hi,
I’m using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:
$KCODE = ‘UTF8’
require ‘jcode’
assert(hours =~ /-Äet/i)
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’
or
‘[]’ from the regex, it works…
Can you comment?
thank you,
david
Ruby is not natively aware of unicode, but you can get all these to
pass if you give it the $KCOCDE hint.
-Rob
Rob B. http://agileconsultingllc.com
[email protected]
Great info… completely forgot that this is available…
thank you
david
$KCODE = ‘UTF8’
require ‘jcode’
2008/3/3, D. Krmpotic [email protected]:
end
As you can see, this only happens with unicode letters… (the last test
fails)… I’m used to the fact that //i doesn’t work for unicode chars
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’ or
‘[]’ from the regex, it works…
In the regex [è] is a character class with two bytes. So
Ruby tries to match a minus followed by one of the bytes
out of “è” followed by “et”. So the regex would match
“pon-\304et” or “pon-\215et”, but not “pon-\304\215et”.
Stefan