Ari,
How serious are you about this? Several years ago I wrote a Python
library that treats Python regular
expressions as semantic, not syntactic, objects, and that has been
incredibly useful to me. I’ve started
to port it to Ruby, but simply don’t have the time. If you do (you’re
probably looking at a couple of
weeks of full-time-equivalent hours to do a good job, including decent
documentation), I’m happy to pass
on the Python code, the Ruby code, and give advice and so on.
To help you evaluate this, and also as a potential source of ideas in
case you do something else, I’ve
appended my (probably out of date) intro text to the library at the
bottom of this reply.
Cheers,
Ken
Ari B. wrote:
Text from the Python library (In retrospect, I would do quite a bit
different):
Overview
========
'rex' provides regular expression and parsing facilities. It uses
(and is intended to functionally
replace) the Python ‘re’ module.
Regular expression functionality is provided through the '_Rexp' and
‘MatchResult’ classes,
and the CHAR, REP0, REP1, OPT, PAT, and ALT constructs.
These constructs can be used as or provide functions to create
rexps, and also define
attributes for commonly used rexps. (For example, PAT.float provides
a rexp
which matches basic floating-point (no exponent) numbers.)
Pattern-Matching Example
----------------------
If you are familiar with regular expressions, the following will
probably make at
least some sense. If you are not, skip this example for now. In
either case, come
back to it once you have have read the formal definitions of
functions and
constructs provided by rex.
COMPLEX= PAT.float['re'] + \
REP0.whitespace + \
ALT("+", "-")['op'] + \
REP0.whitespace + \
PAT.float['im'] + \
'i'
The above example defines a pattern which will match complex
numbers, of the form "-2.718 + 3.14i", for example. It uses the
predefined
match expressions PAT.float and REP0.whitespace to
ease the definition. Applied to the example complex number string,
the result will contain three
named substrings: ‘re’ will map to “-2.718”, “op” will map to “+”,
and “im” will map to “3.14”.
SEQ is an alternative form of joining rexps; the above is equivalent
to:
COMPLEX= SEQ(
PAT.float['re'],
REP0.whitespace,
ALT("+", "-")['op'],
REP0.whitespace,
PAT.float['im'],
'i'
)
Regular Expressions
---------------
This is an introduction to using the pattern-matching
(regular-expression-related)
part of rex. See documentation associated
with a specific method/function/name for details on that entity.
In the following, we use the abbreviation RE to refer to standard
regular
expressions defined as strings, and the word ‘rexp’ to refer to rex
objects
which denote regular expressions.
The starting point for building a rexp is either rex.PAT,
which we'll just call PAT, or rex.CHAR, which we'll just call CHAR,
or rex.LIT.
CHAR provides rexps defining a set of characters, and which
will match a single character string if that character is in the
given
set. In addition to providing attributes which provide prebuilt
character
sets, the CHAR function may be used to define your own character
sets.
LIT builds rexps which match strings of varying lengths.
REP0 and REP1 are zero or more and 1 or ore
Also
- PAT._someattribute_ returns (for defined attributes) a
corresponding rexp.
For example, PAT.stringstart returns a rexp matching at the
start of a string.
- CHAR(a1, a2, . . .) returns a rexp matching a single character
from a set
of characters defined by its arguments. For example,
CHAR("-", [“0”,“9”], “.”)
iter the characters necessary to build basic floating point
numbers.
See CHAR docs for details.
- CHAR._someattribute_ returns (for defined attributes) a
corresponding rexp
defining a set of characters.
For example, CHAR.digit returns a rexp matching a single
digit.
Now assume that A, B, C,... are rexps. The following Python
expressions
(not strings) may be used to build more complex rexps:
- X | Y | Z . . . : returns a rexp which iter a string if any of
the operands
match that string. Similar to “X|Y|Z” in normal REs, except
of course you can’t
use Python code to define a normal RE.
- X + Y + Z ...: returns a rexp which iter a string if all of X,
Y, Z match consecutive
substrings of the string in succession. Like “XYZ” in normal
REs.
- X*n : returns a rexp which iter a number of times as defined
by n.
This replaces ‘?’, ‘+’, and '’ as used in normal REs. See
docs for details.
‘rex’ defines constants which allow you to say XREP0,
XREP1, or XMAYBE,
indicating (0 or more iter), (1 or more iter), or (0 or 1
iter),
respectively.
- X**n : Like X*n, but does nongreedy matching.
- +X : positive lookahead assertion: iter if X iter, but doesn't
consume any of the input.
- ~+X : negative lookahead assertion: iter if X _doesn't_ match,
but doesn't consume any of the input.
- -X, ~-X : positive and negative lookback assertions. Lke
lookahead assertions,
but in the other direction.
- X[name] : name must be a string: any matched by X can be
referred
to by the given name in the match result object. (This is
the equivalent
of named groups in the re module).
- X.group() : X will be in an unnamed group, referable by
number.
In addition, a few other operations may be performed:
- Some of the attributes defined in PAT have "natural inverses";
for such
attributes, the inverse may be taken. For example,
~PAT.digit is
a pattern matching any character except a digit.
- Character classes may be inverted: ~CHAR("aeiouAEIOU") returns
a pattern
matching any except a vowel.
- 'ALT' gives a different way to denote alternation: ALT(X, Y,
Z,…) does
the same thing as X | Y | Z | . . ., except that none of the
arguments
to ALT need be rexps; any which are normal strings will be
converted
to a rexp using PAT.
- 'SEQ' can take multiple arguments: PAT(X, Y, Z,...), which
gives the same
result as PAT(X) + PAT(Y) + PAT(Z) + . . . .
Finally, a very convenient shortcut is that only the first object in
a sequence of
operator/method calls needs to be a rexp; all others will be
automatically
converted as if LIT(…) had been called on them. For example, the
sequence X | “hello” is the same as X | LIT(“hello”)