On Fri, Feb 22, 2013 at 1:21 AM, Tikhon B. [email protected] wrote:
Robert K. wrote in post #1098199:
On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. [email protected]
wrote:
You are doing the measurement wrong. You do not only measure matching
but also creation of the String. Given that fact, all your results
are moot.
I actively considered this matter when submitting the post, but in the
end I included the string creating in the test to allow anyone to try
out the example by copying one line.
Having a second line which defines a variable or constant isn’t really
that more inconvenient.
The time necessary for the actual
creation process is negligible; I need to run the string creation ten
thousand times before the result is substantial enough to significantly
affect the numbers I’m getting.
Still you are measuring one thing but reasoning about another.
I am not sure what sort of proper and complete tests you are looking
for.
Tests which measure exactly the functionality you want to scrutinize
not something else.
Regarding the actual slowdown, the issue comes down to the engine
repeating the match from the start for every single “a” in the string,
then failing and trying again with the next letter.
Backtracking.
I do not see how a
greedy modifier can help in this situation, and I do not know enough
about the ruby regex engine to comment on the underlying design.
Mine was a general statement about issues that can be avoided with
greediness modifiers. I didn’t say it’s the solution to all
performance issues of this kind.
I had a closer look at your expressions. They are of course
artificial to demonstrate the effects of backtracking. I would assume
that one would typically write different expressions in real
applications, for example, using anchoring which dramatically improves
the situation here.
$ ruby rx-bm.rb
user system total real
0 /(?>a[^b]b)/ 11.732000 0.000000 11.732000 ( 11.737671)
1 /(?>a[^b]b)/ 0.015000 0.000000 0.015000 ( 0.003000)
0 /\A(?>a[^b]b)/ 0.000000 0.000000 0.000000 ( 0.002000)
1 /\A(?>a[^b]b)/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /(?<!a)(?>a[^b]b)/ 0.016000 0.000000 0.016000 ( 0.012001)
1 /(?<!a)(?>a[^b]b)/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /aab/ 11.700000 0.000000 11.700000 ( 11.703670)
1 /aab/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /\Aaab/ 0.000000 0.000000 0.000000 ( 0.002000)
1 /\Aaab/ 0.015000 0.000000 0.015000 ( 0.002000)
0 /(?<!a)aab/ 0.000000 0.000000 0.000000 ( 0.013001)
1 /(?<!a)aab/ 0.016000 0.000000 0.016000 ( 0.002000)
0 /a+b/ 11.248000 0.000000 11.248000 ( 11.252644)
1 /a+b/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /\Aa+b/ 0.000000 0.000000 0.000000 ( 0.003000)
1 /\Aa+b/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /(?<!a)a+b/ 0.015000 0.000000 0.015000 ( 0.012001)
1 /(?<!a)a+b/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /a.*b/ 4.649000 0.000000 4.649000 ( 4.641265)
1 /a.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
0 /\Aa.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
1 /\Aa.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
0 /(?<!a)a.*b/ 0.016000 0.000000 0.016000 ( 0.011001)
1 /(?<!a)a.*b/ 0.000000 0.000000 0.000000 ( 0.000000)
$ cat -n rx-bm.rb
1
2 require ‘benchmark’
3
4 strings = [
5 (“a” * 10_000).freeze,
6 (“a” * 10_000 + “b”).freeze
7 ]
8
9 data = [
10 /(?>a[^b]b)/,
11 /\A(?>a[^b]b)/,
12 /(?<!a)(?>a[^b]b)/,
13
14 /aab/,
15 /\Aaab/,
16 /(?<!a)aab/,
17
18 /a+b/,
19 /\Aa+b/,
20 /(?<!a)a+b/,
21
22 /a.*b/,
23 /\Aa.*b/,
24 /(?<!a)a.*b/,
25 ]
26
27 REP = 20
28
29 Benchmark.bm 25 do |x|
30 data.each do |rx|
31 strings.each_with_index do |s, i|
32 x.report sprintf(“%2d %p”, i, rx) do
33 REP.times do
34 rx.match s
35 end
36 end
37 end
38 end
39 end
$
So, yes, there is a cost for backtracking involved, but it only shows
only in particular situations:
- large inputs
- specific expressions (absence of anchoring, using unlimited
repetition operators vs. limited (e.g. /a{0,10}/).
The phenomenon is well known with regular expression engines I
believe. The cases you mention seem to be rather the exception than
the common state of affairs. Which doesn’t mean that Ruby’s engine
cannot be improved. I just don’t think it’s not that dramatic an
issue.
Kind regards
robert