On Dec 5, 6:15 pm, Daniel DeLorme removed_email_address@domain.invalid wrote:
representation. If strings were fundamentally made of characters then we
believe that in 99% of those cases you would be better served with regex
than this pretend “array” disguise.
Daniel
Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).
$ ruby19 -v
ruby 1.9.0 (2007-10-15 patchlevel 0) [i686-linux]
$ echo && cat bench.rb
#!/usr/bin/ruby19
-- coding: ascii --
require “benchmark”
require “test/unit/assertions”
include Test::Unit::Assertions
$KCODE = “u”
$target = “!e$BF|K\8le(B!” * 100
$unichr = “e$BK\e(B”.force_encoding(‘utf-8’)
$regchr = /[e$BK\e(B]/u
def uni_split
$target.split($unichr)
end
def reg_split
$target.split($regchr)
end
def uni_index
$target.index($unichr)
end
def reg_index
$target =~ $regchr
end
def uni_chars
$target.length
end
def reg_chars
$target.unpack(“U*”).length
this is alot slower
$target.scan(/./u).length
end
$target.force_encoding(“ascii”)
a = reg_split
$target.force_encoding(“utf-8”)
b = uni_split
assert_equal(a.length, b.length)
$target.force_encoding(“ascii”)
a = reg_index
$target.force_encoding(“utf-8”)
b = uni_index
assert_equal(a-2, b)
$target.force_encoding(“ascii”)
a = reg_chars
$target.force_encoding(“utf-8”)
b = uni_chars
assert_equal(a, b)
n = 10_000
Benchmark.bm(12) { | x |
$target.force_encoding(“ascii”)
x.report(“reg_split”) { n.times { reg_split } }
$target.force_encoding(“utf-8”)
x.report(“uni_split”) { n.times { uni_split } }
puts
$target.force_encoding(“ascii”)
x.report(“reg_index”) { n.times { reg_index } }
$target.force_encoding(“utf-8”)
x.report(“uni_index”) { n.times { uni_index } }
puts
$target.force_encoding(“ascii”)
x.report(“reg_chars”) { n.times { reg_chars } }
$target.force_encoding(“utf-8”)
x.report(“uni_chars”) { n.times { uni_chars } }
}
====
With caches initialized, an 5 prior runs, I got these numbers:
$ ruby19 bench.rb
user system total real
reg_split 2.550000 0.010000 2.560000 ( 2.799292)
uni_split 1.820000 0.020000 1.840000 ( 2.026265)
reg_index 0.040000 0.000000 0.040000 ( 0.097672)
uni_index 0.150000 0.000000 0.150000 ( 0.202700)
reg_chars 0.790000 0.010000 0.800000 ( 0.919995)
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)
====
So String#=~ with a bytestring and unicode regexp is faster than
String#index by a fator or ~0.5. In the other two cases, the opposite
is true.
Ps. BTW, in case there is any confusion, bytestrings aren’t going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default. You can also explicitly
decode from utf-8 back to ascii. and you can get a byte enumerator (or
array from calling to_a on the enumerator) from String#bytes, and an
iterator from #each_byte, irregardless of the encoding.
Regards,
Jordan