Alternate Regular Expressions?

Just randomly curious -

Is there an alternate RegExp “language” to the current one in Ruby
and Perl?

Don’t get me wrong, I love the current RegExp in Ruby, but I’m
allowed to be curious…

Also, is Ruby going to jump on the PERL 6 RegExp ship?

^^^^^^^ That’s a big one to some people I know.

Thanks,
~ Ari
English is like a pseudo-random number generator - there are a
bajillion rules to it, but nobody cares.

Ari B. wrote:

Just randomly curious -

Is there an alternate RegExp “language” to the current one in Ruby and
Perl?

I don’t know. So here’s a dissertation on where to start.

The good news is a RegExp is only two things at heart…

  • a Domain-Specific Language to program
  • a state machine.

The bad news is, back in the day, people used to invent DSL as long
strings
of easily parsed characters. For example, a language called LSYSTEM
might
describe turtle graphics like this:

s=[::cc!!!&&[FFcccZ]^^^^FFcccZ] # upper spikes

The really bad news is RegExp is one of these string-oriented DSLs that
stuck. It will always be useful, so programmers forget how much room it
has
for improvement.

The good news is Ruby excels at generating light DSLs. The equivalent
expression for a modern implementation of LSYSTEM might look like this:

upper_spikes = push.twist(2).thinner(2).increase_angle(4)…

etc. Because Ruby gives your programming interfaces extreme notational
flexibility, you can declare the interfaces most convenient for your
domain.

So start writing! and research other DSLs as you go. For example, here’s
a
DSL written with C++ metaprogramming:

http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/index.html

Whenever you like, that language slips back to raw RegExp. Your effort
should have a similar shunt.

English is like a pseudo-random number generator - there are a bajillion
rules to it, but nobody cares.

Of all the world’s languages, English is both the ugliest and the
beautifulest.

On Aug 6, 2007, at 9:40 PM, Phlip wrote:

So start writing! and research other DSLs as you go.

Ugh. If I must (which I must). What would you suggest as syntax?

Also, should I completely try to reinvent the wheel, or create a
wrapper for current RegExp?

Man. I need a mentor on this :expressionless:

aRi
--------------------------------------------|
IMO, Arabic has THE most beautiful script.
Poetically, English is extremely beautiful. It’s like a language of
RegExp - except there are no rules!
Spoken, the most beautiful language is either French (sorry) or
Esperanto.

Ari B. wrote:

Man. I need a mentor on this :expressionless:

This might give you a place to start:

Ari B. wrote:

Ugh. If I must (which I must).

You missed where I said I didn’t know the actual answer.

What would you suggest as syntax?

Ruby itself, as a DSL; that was the point.

rx = match('foo') or match('bar') # like /(foo|bar)/
assert_equal [['foo', 'bar']], rx('a foo b bar')

Make match() return an object that overloads the or operator, and away
you
go!

Ari,

How serious are you about this? Several years ago I wrote a Python
library that treats Python regular
expressions as semantic, not syntactic, objects, and that has been
incredibly useful to me. I’ve started
to port it to Ruby, but simply don’t have the time. If you do (you’re
probably looking at a couple of
weeks of full-time-equivalent hours to do a good job, including decent
documentation), I’m happy to pass
on the Python code, the Ruby code, and give advice and so on.

To help you evaluate this, and also as a potential source of ideas in
case you do something else, I’ve
appended my (probably out of date) intro text to the library at the
bottom of this reply.

Cheers,
Ken

Ari B. wrote:

Text from the Python library (In retrospect, I would do quite a bit
different):

Overview
========

'rex' provides regular expression and parsing facilities. It uses

(and is intended to functionally
replace) the Python ‘re’ module.

Regular expression functionality is provided through the '_Rexp' and

‘MatchResult’ classes,
and the CHAR, REP0, REP1, OPT, PAT, and ALT constructs.
These constructs can be used as or provide functions to create
rexps, and also define
attributes for commonly used rexps. (For example, PAT.float provides
a rexp
which matches basic floating-point (no exponent) numbers.)

Pattern-Matching Example
----------------------

If you are familiar with regular expressions, the following will

probably make at
least some sense. If you are not, skip this example for now. In
either case, come
back to it once you have have read the formal definitions of
functions and
constructs provided by rex.

    COMPLEX=         PAT.float['re']            + \
                REP0.whitespace         + \
                ALT("+", "-")['op']            + \
                REP0.whitespace        + \
                PAT.float['im']             + \
                'i'

The above example defines a pattern which will match complex
numbers, of the form  "-2.718 + 3.14i", for example. It uses the

predefined
match expressions PAT.float and REP0.whitespace to
ease the definition. Applied to the example complex number string,
the result will contain three
named substrings: ‘re’ will map to “-2.718”, “op” will map to “+”,
and “im” will map to “3.14”.

SEQ is an alternative form of joining rexps; the above is equivalent 

to:

    COMPLEX= SEQ(
                PAT.float['re'],
                REP0.whitespace,
                ALT("+", "-")['op'],
                REP0.whitespace,
                PAT.float['im'],
                'i'
                )

Regular Expressions
---------------

This is an introduction to using the pattern-matching

(regular-expression-related)
part of rex. See documentation associated
with a specific method/function/name for details on that entity.

In the following, we use the abbreviation RE to refer to standard

regular
expressions defined as strings, and the word ‘rexp’ to refer to rex
objects
which denote regular expressions.

The starting point for building a rexp is either rex.PAT,
which we'll just call PAT, or rex.CHAR, which we'll just call CHAR,

or rex.LIT.
CHAR provides rexps defining a set of characters, and which
will match a single character string if that character is in the
given
set. In addition to providing attributes which provide prebuilt
character
sets, the CHAR function may be used to define your own character
sets.

LIT builds rexps which match strings of varying lengths.

REP0 and REP1 are zero or more and 1 or ore

Also


    - PAT._someattribute_ returns (for defined attributes) a

corresponding rexp.
For example, PAT.stringstart returns a rexp matching at the
start of a string.

    - CHAR(a1, a2, . . .) returns a rexp matching a single character

from a set
of characters defined by its arguments. For example,
CHAR("-", [“0”,“9”], “.”)
iter the characters necessary to build basic floating point
numbers.
See CHAR docs for details.

    - CHAR._someattribute_ returns (for defined attributes) a

corresponding rexp
defining a set of characters.
For example, CHAR.digit returns a rexp matching a single
digit.

Now assume that A, B, C,... are rexps. The following Python 

expressions
(not strings) may be used to build more complex rexps:

    - X | Y | Z . . . : returns a rexp which iter a string if any of

the operands
match that string. Similar to “X|Y|Z” in normal REs, except
of course you can’t
use Python code to define a normal RE.

    - X + Y + Z ...: returns a rexp which iter a string if all of X,

Y, Z match consecutive
substrings of the string in succession. Like “XYZ” in normal
REs.

    - X*n : returns a rexp which iter a number of times as defined 

by n.
This replaces ‘?’, ‘+’, and '’ as used in normal REs. See
docs for details.
‘rex’ defines constants which allow you to say X
REP0,
XREP1, or XMAYBE,
indicating (0 or more iter), (1 or more iter), or (0 or 1
iter),
respectively.

    - X**n : Like X*n, but does nongreedy matching.

    - +X : positive lookahead assertion: iter if X iter, but doesn't
        consume any of the input.

    - ~+X : negative lookahead assertion: iter if X _doesn't_ match,
        but doesn't consume any of the input.

    - -X, ~-X : positive and negative lookback assertions. Lke

lookahead assertions,
but in the other direction.

    - X[name] : name must be a string: any matched by X can be 

referred
to by the given name in the match result object. (This is
the equivalent
of named groups in the re module).

    - X.group() : X will be in an unnamed group, referable by 

number.

In addition, a few other operations may be performed:

    - Some of the attributes defined in PAT have "natural inverses";

for such
attributes, the inverse may be taken. For example,
~PAT.digit is
a pattern matching any character except a digit.

    - Character classes may be inverted: ~CHAR("aeiouAEIOU") returns

a pattern
matching any except a vowel.

    - 'ALT' gives a different way to denote alternation: ALT(X, Y,

Z,…) does
the same thing as X | Y | Z | . . ., except that none of the
arguments
to ALT need be rexps; any which are normal strings will be
converted
to a rexp using PAT.

    - 'SEQ' can take multiple arguments: PAT(X, Y, Z,...), which

gives the same
result as PAT(X) + PAT(Y) + PAT(Z) + . . . .

Finally, a very convenient shortcut is that only the first object in

a sequence of
operator/method calls needs to be a rexp; all others will be
automatically
converted as if LIT(…) had been called on them. For example, the
sequence X | “hello” is the same as X | LIT(“hello”)

Ari B. wrote:

            :string => ["com", "net", "org", "edu"]

)

case line
when a
# …

My idea is to make it logical and human readable. Ruby is a language
for humans and UberBeings, and I think this should reflect Ruby’s ideas.
Reflecting on my own experience, I’d suggest a less verbose notation,
and one that uses Ruby idioms more. For example:

letters = CharClass.new(‘a’…‘z’).case_insensitive
a = letters + “@” + letters + “.” + (Literal.new(“com”) | “net” | “org”
| “edu”)

It’s not at all difficult to do this with Ruby. Strings can be used for
literals and character classes, and
ranges are perfect for use as char ranges in character classes.

Also, the ability to safely combine regular expressions (as shown above,
where “letters” is used in “a”)
is paramount in making this sort of wrapper really useful.

Also, was you library a wrapper for underlying PERL RegExp? or was it
the whole RegExp engine?

It was in Python; instances of my ‘rex’ class simply construct and use
Python patterns, and their associated
functions, internally and invisibly to the user.

Ken

On 07.08.2007 05:10, Ari B. wrote:

            :string => ["com", "net", "org", "edu"]

)

You cannot do this because Hashes are unordered so you loose the
original order. Also [a-z] is only valid if you define local variables
a and z.

Personally I find regular expressions pretty readable - at least if they
are crafted properly. :slight_smile: See also below.

case line
when a
# …

My idea is to make it logical and human readable. Ruby is a language for
humans and UberBeings, and I think this should reflect Ruby’s ideas.

Do you know the /x modifier? Than can go a long way to make a regular
expression readable. For example:

input = <<TEXT
adjasdkajda dadkajd [email protected] adklskkdaldjskj
[email protected] adkjasdjk
blah@org akjsd askdl asd [email protected] hello
asdj
TEXT

input.scan %r{
\b # word boundary
(?i:[a-z]+) # user name
@ # the famous “at” sign
(?i:[a-z]+) # host name
. # a literal dot
(?:com|net|org|edu) # only some of the TLDs
\b # word boundary
}x do |match|
puts “Found email address #{match}”
end

Kind regards

robert

I’m moderately serious. This is going to be one of those projects
that won’t see the light of day for maybe 6 months to a year.
This looks largely what I was hoping to make, although in Ruby I had
invisioned this:

matching email addresses (sample):
a = LeetExp.new(:letters => [[a-z], :insensitive],
:string => “@”,
:letters => [[a-z], :insensitive],
:string => “.”,
:string => [“com”, “net”, “org”, “edu”]
)

case line
when a

My idea is to make it logical and human readable. Ruby is a language
for humans and UberBeings, and I think this should reflect Ruby’s ideas.

Also, was you library a wrapper for underlying PERL RegExp? or was it
the whole RegExp engine?

Thanks,
Ari

On Aug 6, 2007, at 10:51 PM, Kenneth McDonald wrote:

on the Python code, the Ruby code, and give advice and so on.

To help you evaluate this, and also as a potential source of ideas
in case you do something else, I’ve
appended my (probably out of date) intro text to the library at the
bottom of this reply.

Cheers,
Ken

--------------------------------------------|
If you’re not living on the edge,
then you’re just wasting space.

Hi –

On Tue, 7 Aug 2007, Ari B. wrote:

    :string => ["com", "net", "org", "edu"]

)

case line
when a

My idea is to make it logical and human readable. Ruby is a language for
humans and UberBeings, and I think this should reflect Ruby’s ideas.

Regular expressions are nothing if not logical :slight_smile: And readability,
as always, is largely in the eye of the beholder. I think the quest
for an alternative notation is fine, but there’s nothing inherently
un-Ruby-like about what’s there already. Then again, I’m in a small
minority who find /x with a lot of extra whitespace a serious
impediment to understanding a pattern :slight_smile:

Anyway – somewhere out there, though I haven’t been able to find it,
is a library called Regexp::English by Florian G., which provides a
kind of English-language wrapper for regexes. I don’t know whether
it’s still in development and/or at a point of usability.

David

Ari B. wrote:

Is there an alternate RegExp “language” to the current one in Ruby
and Perl?

Snobol4 pattern are now available as a Python library. It should be
possible to port it to Ruby. I don’t think that the implementation is
complete, because I didn’t see the possibility of recursive pattern
definitions, which give Snobol4 the extreme power.

Infos

http://permalink.gmane.org/gmane.comp.python.announce/7217 (Snobol4 in
Python)

SNOBOL - Wikipedia (has some links)

Wolfgang Nádasi-Donner

On Aug 6, 10:10 pm, Ari B. [email protected] wrote:

                            :string => ["com", "net", "org", "edu"]

the whole RegExp engine?

How serious are you about this? Several years ago I wrote a Python
in case you do something else, I’ve
appended my (probably out of date) intro text to the library at the
bottom of this reply.

Cheers,
Ken

--------------------------------------------|
If you’re not living on the edge,
then you’re just wasting space.

Ari,

There have been other responses to this already, but I thought I’d
give you something else to look at:

I second (or third, or whatever) the contention that regular
expressions are pretty readable on their own (given some knowledge of
the syntax and good formatting). The thing to keep in mind is that
they’re a language of their own. Once you learn the language, you
find you can use it in many a programming language (though there are
some dialectical problems here and there).

Ari,

Do it!
excellent project. even if it fails in the long run, or if you pass
it off to somebody else.

I like the Rails-like hash-looking idea, of course you would need
some ordering, so it would need to be some kind of array or struct,
but it is an idea worth toying with.

I was actually 2 unread emails away from writing the list, thanking
everyone for their help, and that I would only write the wrapper if
someone really wanted me to.

Looks like I’m writing it.

-Ari

On Aug 7, 2007, at 11:15 AM, John J. wrote:

Ari,

Do it!
excellent project. even if it fails in the long run, or if you pass
it off to somebody else.

I like the Rails-like hash-looking idea, of course you would need
some ordering, so it would need to be some kind of array or struct,
but it is an idea worth toying with.

Ari
-------------------------------------------|
Nietzsche is my copilot

Speaking as someone who has actually written and used (in Python) a more
abstract regex library,
the biggest problem with regular expressions in most languages isn’t the
syntax, but rather
the inability to easily compose small REs into larger REs. Which is why
so many programs end
up with huge, unreadable REs. As a small example, it’s really nice (and
obvious) to be able to say

re3 = re1 + re2

instead of

re3 = "(?:#{re1})(?:#{re2})"

And the advantages go well beyond the convenience illustrated in the
above example…

Also, I think that people who are accustomed to regular expressions (or
any other DSL) tend
to forget about the problems with that DSL; the need for newcomers to
learn another syntax,
the inability to use standard language tools with the DSL, and so on.

So, though I’ve used REs for years, I certainly don’t agree with the
contention that “REs
are actually pretty good”. RE syntax in RE languages is optimized for
quickly entering onetime
REs on the command line, not for building robust REs that can be easily
maintained by
other programmers. It’s the difference between weird, Perl-style
variables, and meaningful
variable names. A good abstract wrapper in Ruby would be very useful.

Ken

On 8/7/07, [email protected] [email protected] wrote:
[snip]

Anyway – somewhere out there, though I haven’t been able to find it,
is a library called Regexp::English by Florian G., which provides a
kind of English-language wrapper for regexes. I don’t know whether
it’s still in development and/or at a point of usability.

long time ago I wrote a regexp engine 96% compatible with the ruby’s,
at that point in time. Maybe it’s useful to somebody?
http://raa.ruby-lang.org/project/regexp/

Kenneth McDonald wrote:

Speaking as someone who has actually written and used (in Python) a more
abstract regex library,
the biggest problem with regular expressions in most languages isn’t the
syntax, but rather
the inability to easily compose small REs into larger REs. Which is why
so many programs end
up with huge, unreadable REs. As a small example, it’s really nice (and
obvious) to be able to say

re3 = re1 + re2

I agree with this and that’s why I have the following add-on in my
standard lib:

class Regexp
def +(other)
if other.is_a?(Regexp)
if self.options == other.options
Regexp.new(source + other.source, options)
else
Regexp.new(source + other.to_s, options)
end
else
Regexp.new(source + Regexp.escape(other.to_s), options)
end
end
end

It could easily be improved so that, for example, a range would get
appended as a character class, etc.

Daniel

Thanks! I was actually thinking about this myself.

Please people, send an email if you want to see something in a Ruby
RegExp wrapper. Don’t be shy. If i can get drafted into making this,
then you can tell me what you want to see.

Thanks,
Ari

On Aug 7, 2007, at 12:55 PM, Kenneth McDonald wrote:

to learn another syntax,
variable names. A good abstract wrapper in Ruby would be very useful.

Ken

Yossef M. wrote:

--------------------------------------------|
If you’re not living on the edge,
then you’re just wasting space.

Ari B. wrote:

            :string => ["com", "net", "org", "edu"]

)

Another way to do something like this is to use a “compound” regular
expression where each part continues where the last one ended.
Something like @.* where $1 would be everything up to the @ sign,
i.e. the name. $2 would be everything between the @ and the ., i.e. the
ISP name. And $3 would be the remainder of the address. The only time
something like this would fail would be an address like mine, where the
ISP name is worldnet.att rather than just worldnet.

2007/8/7, Ari B. [email protected]:

Please people, send an email if you want to see something in a Ruby
RegExp wrapper. Don’t be shy. If i can get drafted into making this,
then you can tell me what you want to see.

Did you think about something like this (attached)? This is just a
raw hack to illustrate a possible way to do it.

Kind regards

robert