RE lookahead RE problems

I’ll start by confessing that this comes originally from something I
worked on in Perl, and I’ve assumed, rightly or wrongly, that regular
expressions are regular expressions are regular expressions.

See
http://www.ilovejackdaniels.com/cheat-sheets/regular-expressions-cheat-sheet/

The context is that there are a whole pile of patterns that must be
preceded by … … well, not words or some punctuation. Call them
“sort-of zero width”. That is, white space, beginning of line and some
opening sequences, call the '[‘and (’ and ‘{’ for the sake of the
example, are allowed.

I’m trying to put the RE into a ‘constant’ so that I don’t have to keep
repeating it - all the DRY stuff about changes and so forth!

I’m trying to use RE’s lookahead.

This works in perl

   $STARTWORD = qr/^|(?<=[\s\(\[\{])/m;

There is also the corresponding end word

   $ENDWORD   = qr/$|(?=[ \t\n\,\.\;\:\!\?\)])/om;

When I translate these into Ruby I get an error,
It doesn’t seem to like the lookbehind
The error message is

SyntaxError  undefined (?...) sequence: /^|(?<=[\s\(])/

Well, possibly. Or it may be that it I’m having problems when combining
it with an actual pattern.

What I’ve done is separate out the pattern to a constant (and tried to
eliminate things that might confuse the parser)

STARTWORD = %r{^|(?<=[\s(])}m

An LO! The parser chokes on that.
Does it choke because there isn’t actually pattern being compared?
Well, maybe. If I remove the ‘%r{’ stuff the parser doesn’t choke.
But it doesn’t choke on

ENDWORD = %r{$|(?=[\s,.;:!?)])}m

And I seem to be getting confused when combining these with other
regular expressions because of this inconsistency.

Right now I don’t know if the problem is having the REs as constants.
Does this make them ‘precompiled’?
ENDWORD.type ==> “Regexp”
so I’m presuming it is. In which case why can’t I precompile STARTWORD?

So: Is it that Ruby can’t handle the ‘?<=’ lookbehind assertion … or
what? Am I completely hung up on a wrong track?


Any simple problem can be made insoluble if enough meetings are held to
discuss it.

The ruby regular expression engine doesn’t support look-behind.

2008/1/16, Anton J Aylward [email protected]:

So: Is it that Ruby can’t handle the ‘?<=’ lookbehind assertion …

As far as I know, look-behind assertions are not handled by
Ruby 1.8.* but I think Oniguruma in 1.9 can.

You should ask your question in Ruby-Talk mailing list,
which is a better appropriate place for this kind of question.

-- Jean-François.

John H. said the following on 16/01/08 12:43 PM:

The ruby regular expression engine doesn’t support look-behind.

Comparison of regular expression engines - Wikipedia

{{ExpletiveDeleted!}}

Suggestions?


The state can’t give you free speech, and the state can’t take it away.
You’re born with it, like your eyes, like your ears. Freedom is
something you assume, then you wait for someone to try to take it away.
The degree to which you resist is the degree to which you are free…
–Utah Phillips

Oniguruma

http://oniguruma.rubyforge.org/

This engine is the RegExp engine for Ruby 1.9 and onwards, so you only
need
this gem for 1.8.x.

Jason

Jason R. said the following on 16/01/08 01:21 PM:

Oniguruma

http://oniguruma.rubyforge.org/

This engine is the RegExp engine for Ruby 1.9 and onwards, so you only
need this gem for 1.8.x.

Roll on 1.9 then, because I get pages and pages of error messages when I
try installing this gem, starting with

oregexp.c:2:23: error: oniguruma.h: No such file or directory

Now that can’t be because I don’t have the Ruby sources installed, can
it?


If God does not write LisP, God writes some code so similar to
LisP as to make no difference.
See also: xkcd: Lisp

The Oniguruma gem is just a wrapper around the actual library. I haven’t
installed this myself, though I assumed it would come with the needed
code.
You just need to install Oniguruma itself, then get the gem.

Jason

Thanks for all the help everyone. The problem was solved with the help
from pullmonkey on Rails Forum! Here is the solution:

Objective:

  1. Extract vowels and consonants from a string
  2. Handle the conditional treatment of ‘y’ as a vowel under the
    following circumstances:
    • y is a vowel if it is surrounded by consonants
    • y is a consonant if it is adjacent to a vowel

Here is the code that works:

def vowels(name_str)
reg = Oniguruma::ORegexp.new(’[aeiou]|(?<![aeiou])y(?![aeiou])’)
reg.match_all(name_str).to_s.scan(/./)
end

def consonants(name_str)
reg = Oniguruma::ORegexp.new(’[bcdfghjklmnpqrstvwx]|(?<=[aeiou])y|
y(?=[aeiou])’)
reg.match_all(name_str).to_s.scan(/./)
end

(Note, the .scan(/./) can be eliminated to return an array)

The major problem was getting the code to accurately treat “y” as a
consonant. The key to solving this problem was to:

  1. define unconditional consonants explicitly (i.e.,
    [bcdfghjklmnpqrstvwx]) – not as [^aeiou] which automatically includes
    “y” thus OVER-RIDING any conditional reatment of “y” that follows

  2. define conditional “y” regexp assertions independently, i.e., “| (?
    <=[aeiou]) y | y (?=[aeiou])” – not “|(?<=[aeiou]) y (?=[aeiou])”
    which only matches “y” preceded AND followed by a vowel, not preceded
    OR followed by a vowel

HTH.

The library can be found here:
http://www.geocities.jp/kosako3/oniguruma/

I am trying to get look behind working as well. However, having got
past the errors, I am now wrestling with syntax:

** Starting Rails with development environment…
Exiting
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in
gem_original_require': ./lib/string_extensions.rb:4: undefined (?...) sequence: /[aeiou]|(?<![aeiou])y(?![aeiou])/ (SyntaxError) ./lib/string_extensions.rb:8: undefined (?...) sequence: /![aeiou]|(? <=[aeiou])y(?=[aeiou])/ from /usr/local/lib/ruby/site_ruby/1.8/ rubygems/custom_require.rb:27:in require’
It seems to be complaining about the look-behind and look-ahead
assertions in the following code fragment (which origuruma is
supposed
to support):
class String
def vowels
scan(/[aeiou]|(?<![aeiou])y(?![aeiou])/i)
end
def consonants
scan(/![aeiou]|(?<=[aeiou])y(?=[aeiou])/i)
end
end
According to this reference (サービス終了のお知らせ
doc/RE.txt), the look behind and look ahead syntax that I am using
appears to be correct (ref section 7. Extended groups) but apparently
is not.