Regular expression matches last occurrence instead of first

I’ve found an anomoly in the way Ruby handles non-greedy regular
expressions and wonder whether it’s been discussed before. A search of
the documentation and a general Internet search didn’t turn up
information on this issue.

When I want to match the first quoted string in a string such as:

“aaaaa”“bbb”“ccc”

I match the last quoted string instead. The exact characters don’t
matter.

Here’s the sample code; note that (.*?) and ([^"]+) behave the same
way–and not the way I’d expect:

str = '"aaaaa""bbb""ccc"'

str.scan(/"(.*?)"/)
puts $1
# ccc

Andy Oram
str.scan(/"([^"]+)"/)
puts $1
# ccc

str.scan(/"(.*?)"(.*)/)
puts $1
# aaaaa

Adding an extra (.*) to the end produces the result I want, but I
don’t believe it should make any difference.

Here is the equivalent Perl, which works as expected:

$str = q{“aaaaa”“bbb”“ccc”};
$str =~ /"(.*?)"/;
print $1 , “\n”;

$str =~ /"([^"]+)"/;
print $1 , “\n”;

aaaaa

$str =~ /"(.?)"(.)/;
print $1 , “\n”;

aaaaa

And the equivalent PHP:

<?php

$str = '"aaaaa""bbb""ccc"';
preg_match('/"(.*?)"/', $str, $matches);
echo $matches[1] , "\n";
// aaaaa

preg_match('/"([^"]+)"/', $str, $matches);
echo $matches[1] , "\n";
// aaaaa

preg_match('/"(.*?)"(.*)/', $str, $matches);
echo $matches[1] , "\n";
// aaaaa

?>

andyo wrote:

Here’s the sample code; note that (.*?) and ([^"]+) behave the same
way–and not the way I’d expect:

str = '"aaaaa""bbb""ccc"'

str.scan(/"(.*?)"/)
puts $1
# ccc

Normal… #scan is not what you’r looking for:

------------------------------------------------------------ String#scan
str.scan(pattern) => array
str.scan(pattern) {|match, …| block } => str

 Both forms iterate through str, matching the pattern (which may be
 a Regexp or a String). For each match, a result is generated and
 either added to the result array or passed to the block. [...]

scan find all successive matches for the pattern, and sets the
captured groups variables everytime it finds one. So, here, you simply
get the $1 for the last match, ie “ccc”.

What you’re looking for is simply =~, as in Perl:

irb(main):001:0> str = ‘“aaaaa”“bbb”“ccc”’
=> ““aaaaa”“bbb”“ccc””
irb(main):002:0> str =~ /"(.*?)"/
=> 0
irb(main):003:0> $1
=> “aaaaa”

Cheers,

Vincent