Restricted capture in Regexp

Is there a regexp feature that lets me require something to be present
in the input string for the regexp to match, but for that to not become
captured as part of the match?

I want this so that I can scan and gsub on a string of code and replace
variables. Matching just variables requires looking at the context
arround them, but if I capture this, I replace the context too.

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])xy/

but using that on “exp(x)” will match (and replace) “(x)”, which I don’t
want at all.

Cheers,
Benjohn

[email protected] wrote:

Is there a regexp feature that lets me require something to be present
in the input string for the regexp to match, but for that to not become
captured as part of the match?

Neither yes nor no, because of how you have worded your question. Se
below.

I want this so that I can scan and gsub on a string of code and replace
variables. Matching just variables requires looking at the context
arround them, but if I capture this, I replace the context too.

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])xy/

but using that on “exp(x)” will match (and replace) “(x)”, which I don’t
want at all.

There are a number of ways to accomplish this. The simplest is to put
the
part you want to preserve in parentheses, and refer to it in the
replacement.

Like this:

data.sub!(%r{(^|[^a-zA-Z])([xy])([^a-zA-Z]|$)},"\1\2\3")

Notice about this example that the [xy] character class is now captured
and
used as part of the replacement, so its original value is preserved.

Using this approach, you preserve the parts you don’t want to replace,
and
replace the parts you do. In the above example, everything is preserved,
but it is just meant to show the pattern.

Hi Benjohn,

Am Mittwoch, 13. Dez 2006, 18:24:08 +0900 schrieb [email protected]:

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])xy/

but using that on “exp(x)” will match (and replace) “(x)”, which I don’t
want at all.

/\b[xy]\b/

The \b pattern (word boundary) will look to the left like the ^ pattern
does.

I would appreciate if there were a general pattern looking to the left
corresponding to (?=re) what is non-consuming to the right.

Bertram

I want this so that I can scan and gsub on a string of code and

and
used as part of the replacement, so its original value is preserved.

Using this approach, you preserve the parts you don’t want to replace,
and
replace the parts you do. In the above example, everything is preserved,
but it is just meant to show the pattern.

Hi Paul,

thanks for the reply. I know I can do this, but it means that the
substitution ("\1\2\3") has to be aware of the composition of the
regular expression. The Regexp is no longer a neat little machine that
only grabs things to replace. It’s now grabbing the packaging around the
thing to replace too, so you’ve got to be aware of this in writing the
substitution.

Cheers,
Benjohn

/\b[xy]\b/

The \b pattern (word boundary) will look to the left like the ^ pattern
does.

This seems like the best approach in this case, as it’s a good enough
way to find variables. It does break down in the complex case though.

I would appreciate if there were a general pattern looking to the left
corresponding to (?=re) what is non-consuming to the right.

The book I’m reading (o’reilly pocket reference) hints at the look
arround constructs being:

(?=…) - look ahead.
(?!..) - negated look ahead.
(?<=…) - look behind.
(?<!..) - negated look behind.

So perhaps one of those is what you want?

[email protected] wrote:

/ …

thanks for the reply. I know I can do this, but it means that the
substitution ("\1\2\3") has to be aware of the composition of the
regular expression.

Yes, that is true for all regular expressions.

The Regexp is no longer a neat little machine that
only grabs things to replace. It’s now grabbing the packaging around the
thing to replace too, so you’ve got to be aware of this in writing the
substitution.

Yes, but this cannot be avoided. You have two choices for examined text
that
surrounds the area to be modified – you can capture it while examining
it,
and use the captured text in the replacement, or you can use
non-capturing
references:

(?=non-captured text)

But the two alternatives work much the same way – they examine text
that is
preserved as part of the overall regular expression. All that changes
is /how/ the text is preserved.

So, to move ahead, please post a specific example of what you need. Post
an
example of the original string and the desired replacement.

It is scarcely possible to describe in prose what one wants from a
regular
expression. It /is/ possible to take a first step by posting an example
of
original text, and replacement text. Maybe we should try that.

On 12/13/06, [email protected] [email protected] wrote:
[snip]

cf = CodeFragment.new
cf.code_fragment = “sin(x+y)”
puts cf.output_substitution({‘x’=>1, ‘y’=>2})

should give “sin(1+2)”
[snip]

prompt> cat a.rb
s = “sin(x+y)”
h = {
‘x’ => ‘1’,
‘y’ => ‘2’,
}
h.each do |pattern, replacement|
r = Regexp.new(‘\b’ + Regexp.escape(pattern) + ‘\b’)
s.gsub!(r) { replacement }
end
p s

prompt> ruby a.rb
“sin(1+2)”

The Regexp is no longer a neat little machine that
non-capturing
references:

(?=non-captured text)

I think this may be what I should use. Also, the sugestion of using word
edge tokens works for the specific case.

But the two alternatives work much the same way – they examine text
that is
preserved as part of the overall regular expression. All that changes
is /how/ the text is preserved.

So, to move ahead, please post a specific example of what you need. Post
an
example of the original string and the desired replacement.

:slight_smile: Well, I have a solution for the specific case. That’s not what I’m
getting at though. I’m trying to find out if regexp allow me to do
something more general. I want to do this (sorry, I don’t have a ruby to
hand):

class CodeFragment
attr_accessor :code_fragment

def variables_regexp
/\b[xyz]\b/
end

def utilised_variables
code_fragment.scan(variables_regexp).uniq.sort
end

def output_substitution(substitutes)
code_fragment.gsub(variables_regexp) do |v|
substitutes[v[0]]
end
end
end

cf = CodeFragment.new
cf.code_fragment = “sin(x+y)”
puts cf.output_substitution({‘x’=>1, ‘y’=>2})

should give “sin(1+2)”

What I want is for the thing that provides the regular expression to not
need to know about the function that is using it; and for the functions
that uses the regular expression to not know about the expression
provided.

regular
expression. It /is/ possible to take a first step by posting an example
of
original text, and replacement text. Maybe we should try that.

Thank you for your help here.

I’m not trying to solve a single problem though, I’m trying to
understant what kinds of problem I can solve.

I want something that acts as an abstract machine for finding things in
a string (in this case variables, but the rules could be more complex).
One should be able to use this machine without knowing what it finds, or
how it finds. All I should need to know is that it finds things. I’m
trying to understand if regexps are able to do this - to provide this
separation. Perhaps they don’t, which is fine. I’d just like to know if
they do or not, or if they do a bit, how much.

Thanks,
Benjohn

[email protected] wrote:

The Regexp is no longer a neat little machine that
non-capturing
preserved as part of the overall regular expression. All that changes

original text, and replacement text. Maybe we should try that.
trying to understand if regexps are able to do this - to provide this
separation. Perhaps they don’t, which is fine. I’d just like to know if
they do or not, or if they do a bit, how much.

Again, your prose description is not precise enough for a reader to know
exactly what you want, which is why we have such things as computer
languages and mathematics. But one can offer educated guesses.

Here is a function that doesn’t know in advance what will be sought, it
simply and blindly carries out a certain kind of filtering based on
caller-provided strings:

def get_text_between_tags(data,tag)
return data.scan(%r{<#{tag}>(.*?)</#{tag}>})
end

If I call this function with a set of HTML data in “data” (containing an
HTML page) and a tag string like “td”, this function will return an
array
containing the text between each pair of

… tags in the data
string.

Note that this function will accept any data string whatsoever, and it
will
also accept any search tag whatsoever.

Is this what you mean? Can you extrapolate this way of approaching the
problem to solve your own?

[email protected] wrote:

The book I’m reading (o’reilly pocket reference) hints at the look
arround constructs being:

(?=…) - look ahead.
(?!..) - negated look ahead.

The following two aren’t supported in the current Ruby regexp engine,
they are in the one Ruby 1.9 and on will use.

(?<=…) - look behind.
(?<!..) - negated look behind.

So perhaps one of those is what you want?

Either way, it’s possible to emulate positive lookbehinds by capturing
what would be the pre-match and putting it into the replacement:

string.sub(/(some lookbehind pattern)(what you’re looking for)/) {
$1 + replacement_of($2)
}

instead of:

string.sub(/(?<=some lookbehind pattern)what you’re looking for/) {
replacement_of($~.to_s)
}

and kludge negative lookbehinds by instead enumerating all the patterns
that would match in a positive one. They just make the pattern
(sometimes much) more elegant in most cases.

David V.

Hi,

Am Mittwoch, 13. Dez 2006, 19:28:08 +0900 schrieb [email protected]:

I would appreciate if there were a general pattern looking to the left
corresponding to (?=re) what is non-consuming to the right.

The book I’m reading (o’reilly pocket reference) hints at the look
arround constructs being:

(?=…) - look ahead.
(?!..) - negated look ahead.
(?<=…) - look behind.
(?<!..) - negated look behind.

The latter two don’t work in Ruby as far as I know.

irb(main):001:0> “hello” =~ /(?<=e)ll/
SyntaxError: compile error
(irb):1: undefined (?..) sequence: /(?<=e)ll/
from (irb):1

Bertram

[email protected] wrote:

but using that on “exp(x)” will match (and replace) “(x)”, which I don’t
want at all.

Cheers,
Benjohn

class String
def gsub_capture( regexp, replacement )
offset = 0
gsub( regexp ){|s|
offset = $~.offset(1)[0] - $~.offset(0)[0]
s[ 0, offset ] +
replacement + s[ offset + $1.size … -1 ] }
end
end

puts “1. Take two <2> cups and three <3> spoons”.
gsub_capture( / <(\d)> /, “x”)

On 12/13/06, [email protected] [email protected] wrote:
[snip]

I want something that acts as an abstract machine for finding things in
a string (in this case variables, but the rules could be more complex).
One should be able to use this machine without knowing what it finds, or
how it finds. All I should need to know is that it finds things. I’m
trying to understand if regexps are able to do this - to provide this
separation. Perhaps they don’t, which is fine. I’d just like to know if
they do or not, or if they do a bit, how much.

In a language like ruby, its not possible to distinguish between
a variablename or a methodname by just looking at the name.
Regexp just looks at the name.

If you want to replace a variable-name then you need to
parse the code.

On 13 Dec 2006, at 17:35, Paul L. wrote:

Is this what you mean? Can you extrapolate this way of approaching the
problem to solve your own?

I was able to. I had not understood that scan and gsub work
differently when capturing takes place. Scan seems to have more
sensible behaviour. I would like gsub’s block or second parameter to
provide an array, and for this to replace the captured parts of the
regexp, so:

"axb".gsub(/(.)x(.)/, ['A', 'B'])

would return:

"AxB"

gsub doesn’t behave like this, but I imagine it would be possible to
build a gsub like function that did. :slight_smile: It would probably need to
inspect the regular expression given to it with a regular expression.

Thanks everyone,
Benjohn

On Dec 14, 2006, at 7:05 PM, Paul L. wrote:

Not really. Each of sub(), gsub() and scan() have their niche. It
And, I am sure, in many other ways.


Paul L.
http://www.arachnoid.com

If you’re not interested in the other groupings you can use (?:slight_smile: to
group the regexp without capturing.

rep = [‘A’, ‘B’]
result = “axb”.gsub(/(?:.)(x)(?:.)/, “#{rep[0]}\1#{rep[1]}”)

Of course, these RE’s don’t even need to be grouped at all:

rep = [‘A’, ‘B’]
result = “axb”.gsub(/.(x)./, “#{rep[0]}\1#{rep[1]}”)

And (x) is just a match for ‘x’, so you don’t have to use a group at
all.

In general, you could take your regexp, rewrite to capture the parts
between the desired replacements, and use a replacement (or a block)
similar to what Paul introduced to get the result you desire.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

Benjohn B. wrote:

/…

gsub doesn’t behave like this, but I imagine it would be possible to
build a gsub like function that did. :slight_smile:

result = “axb”.gsub(/(.)(x)(.)/, “A\2B” ) # gets what you want.

It would probably need to
inspect the regular expression given to it with a regular expression.

Not really. Each of sub(), gsub() and scan() have their niche. It is
more a
matter of learning how to use them.

And, now that I think about it, your example using a provided array of
replacement values can be implemented like this:

rep = [‘A’, ‘B’]

result = “axb”.gsub(/(.)(x)(.)/, “#{rep[0]}\2#{rep[1]}”)

And, I am sure, in many other ways.