I posted this as a question here:
Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the
Unicode “Decimal_Number” category. However, in Ruby 1.9.1 and 1.9.2 it
only matches Latin 0-9 characters. Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?
Test program:
#encoding: utf-8
require ‘open-uri’
html =
open(“Unicode Characters in the 'Number, Decimal Digit' Category”).read
digits = html.scan(/U+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16)
}.pack(‘U*’)
puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…
p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> “ruby 1.9.2p180 (2011-02-18) [i386-mingw32]”
#=> [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”]
Feel free to discuss here, or answer on Stack Overflow if you have a
solid answer and want the rep
[1] サービス終了のお知らせ