Regex#match(re, position) with start of string (\A) bug?

vlazar · January 9, 2022, 5:21pm

Found this. Is this a bug in Regex#match?

From Regex#match docs

If the second parameter is present, it specifies the position in the string to begin the search.

str = "hello world"
/\Aworld/.match(str, 6) # => nil

I would expect “specifies the position in the string to begin the search” would mean staring from position 6 in “hello world” should be equivalent to staring with position 0 in “world” and thus \A should work in this case too.

Consider another example. If I use the same with StringScanner#scan it works as I would expect:

require "strscan"

str = "hello world"
scanner = StringScanner.new(str)
scanner.pos = 6
scanner.scan /\Aworld/ # => "world"

To me these 2 cases (at least looking into current API docs) should work the same.

SouravGoswami · January 9, 2022, 8:25pm

Never tried the match with \A or ^ before. Looks like the behaviour should be the same as StringScanner when an offset is used. It makes sense that the word start with world when offset 6 is set.

In other words, I find it confusing that you can’t use \A or ^ whenever an offset is used. This might make sense for some other cases, don’t know. But it also looks like a bug to me.

I think you should report it on https://bugs.ruby-lang.org/
Even if it’s not a bug, you will find your answer there why match is behaving that way…

vlazar · January 11, 2022, 8:15am

Opened an issue in Ruby tracker https://bugs.ruby-lang.org/issues/18471

vlazar · January 12, 2022, 6:11am

The issue was closed.

This behavior difference when using a starting position in StringScanner#scan and Regex#match might be confusing, but here some speculation about why StringScanner#scan was designed this way, see Bug #18471: Regex#match(re, position) with start of a string anchors ^ and \A - Ruby master - Ruby Issue Tracking System

Proposed solution for Regex#match with position is to use \G:

str = "hello world"
/\Gworld/.match(str, 6) # => #<MatchData "world">

Found this description of \G from PCRE regex engine:

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero.

So looks like Ruby uses the same logic and \A means search from the start of the string and \G means a start point of the match which is specified by a second argument in Regex#match.

SouravGoswami · January 12, 2022, 11:27am

That’s insightful. The ruby doc also covers this:

^ - Matches beginning of line

$ - Matches end of line

\A - Matches beginning of string.

\Z - Matches end of string. If string ends with a newline, it matches just before newline

\z - Matches end of string

\G - Matches first matching position:

In methods like String#gsub and String#scan, it changes on each iteration. It initially matches the beginning of subject, and in each following iteration it matches where the last match finished.

" a b c".gsub(/ /, ‘_’) #=> "___a_b_c"
" a b c".gsub(/\G /, '') #=> “____a b c”

In methods like Regexp#match and String#match that take an (optional) offset, it matches where the search begins.

“hello, world”.match(/,/, 3) #=> #<MatchData “,”>
“hello, world”.match(/\G,/, 3) #=> nil