Content preview: Hi, I’m used to be able to use the following in PHP.
What
is basically does is: return me all matches, including the captures,
order
by matching set and provide me the offsets. $ php -r
‘preg_match_all(“/(\w+)/”,
“foo bar”, $matches, PREG_SET_ORDER|PREG_OFFSET_CAPTURE);
var_dump($matches);’
array(2) { [0]=> array(2) { [0]=> array(2) { [0]=> string(5) “foo”
[1]=>
int(0) } [1]=> array(2) { [0]=> string(3) “foo” [1]=> int(1) } }
[1]=> array(2)
{ [0]=> array(2) { [0]=> string(5) “bar” [1]=> int(6) } [1]=>
array(2)
{ [0]=> string(3) “bar” [1]=> int(7) } } } […]
Content analysis details: (-2.9 points, 5.0 required)
pts rule name description
-1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-Cloudmark-Analysis: v=1.1
cv=HQ3F56nxkum+cgCiDL7AXQpbvw7DWrWCBJRnYYnM0Zc= c=1 sm=0
a=aofHTkXiRO8A:10 a=a8LjyqOez_YA:10 a=IkcTkHD0fZMA:10
a=zXYRzuxSnswNMOIj9CcA:9 a=F0ZZx-MZsyTlmd8l3nIA:7 a=QEXdDO2ut3YA:10
a=HpAAvcLHHh0Zw7uRqdWCyQ==:117
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk
Lines: 102
List-Id: ruby-talk.ruby-lang.org
List-Software: fml [fml 4.0.3 release (20011202/4.0.3)]
List-Post: mailto:removed_email_address@domain.invalid
List-Owner: mailto:removed_email_address@domain.invalid
List-Help: mailto:removed_email_address@domain.invalid?body=help
List-Unsubscribe: mailto:removed_email_address@domain.invalid?body=unsubscribe
Received-SPF: none (Address does not pass the Sender Policy Framework)
SPF=FROM;
sender=removed_email_address@domain.invalid;
remoteip=::ffff:221.186.184.68;
remotehost=carbon.ruby-lang.org;
helo=carbon.ruby-lang.org;
receiver=eq4.andreas-s.net;
Hi,
I’m used to be able to use the following in PHP. What is basically does
is: return me all matches, including the captures, order by matching set
and provide me the offsets.
$ php -r ‘preg_match_all(“/(\w+)/”, “foo bar”, $matches,
PREG_SET_ORDER|PREG_OFFSET_CAPTURE); var_dump($matches);’
array(2) {
[0]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) “foo”
[1]=>
int(0)
}
[1]=>
array(2) {
[0]=>
string(3) “foo”
[1]=>
int(1)
}
}
[1]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) “bar”
[1]=>
int(6)
}
[1]=>
array(2) {
[0]=>
string(3) “bar”
[1]=>
int(7)
}
}
}
I’ve found two ways in ruby getting in this direction, either use
String#match or String#scan, but both only provide me partial
information. I guess I can combine the knowledge of both, but before
attempting this I wanted to verify if I didn’t overlook something. Here
are my ruby attempts:
ruby-1.9.2-p180 :001 > m = “foo bar”.match(/(\w+)/)
=> #<MatchData “foo” 1:“foo”>
ruby-1.9.2-p180 :002 > [ m[0], m[1] ]
=> [“foo”, “foo”]
ruby-1.9.2-p180 :003 > [ m.begin(0), m.begin(1) ]
=> [0, 1]
But here I’m missing the further possible matches, “bar” and “bar”. Or
the #scan approach:
ruby-1.9.2-p180 :004 > m = “foo bar”.scan(/(\w+)/)
=> [[“foo”], [“bar”]]
But in this case I’ve even less information, the match including foo
or bar is not present and I can’t get the offsets too.
I re-read the documentation for Regexp#match and found out that you can
pass an offset into the string as second parameter, so I guess I can
iterate over the string in a loop until I find no further matches …?
Considering this I came up with:
$ cat test_match_all.rb
require ‘pp’
class String
def match_all(pattern)
matches = []
offset = 0
while m = match(pattern, offset) do
matches << m
offset = m.begin(0) + m[0].length
end
matches
end
end
pp “foo bar baz”.match_all(/(\w+)/)
$ ruby test_match_all.rb
[#<MatchData “foo” 1:“foo”>,
#<MatchData “bar” 1:“bar”>,
#<MatchData “baz” 1:“baz”>]
I’ve lots of data to parse so I could foresee that this approach can
become a bottleneck. Is there a more direct solution to it?
thanks,
- Markus