Regular Expressions (
The translator is written in the scheme language, and uses scheme data types for the regexp descriptions. Each one is a list of elements. A list is a pair of parenthesis with space-separated elements inside. The relevant elements are strings ("double quoted"), characters (#\A, with space written #\space. If you don't like this syntax, just use strings of one character, like "A"), keywords (barewords), and complex keywords ((keyword arguments)). Recognized symbols, both stand-alone and the complex keywords made from a sub-list of keyword and arguments, are described below. But first, a few examples to let people get the hang of it.
The translator can print compiled regular expressions with the appropriate characters escaped with backslashes for use in softcode (As an argument to regmatch() or another regexp function), or $-commands, or without any extra escapes. Most of the examples below are meant to be $-commands, but for clearness, only normal escaping is done. To use as a $-command, put a backslash before any ;'s in the regexps.
If you're here because of some other reason than mushes, don't worry about the escaping options. Just keep the default 'normal' radio button selected. Hopefully the examples should still be followable.
There is also an optimizer that transforms the regular expressions so they will match faster. In the examples, I assume this is turned off. (And need to re-write one of the examples, from before I made it possible to turn off the optimizer. Bah)
The first example is a regexp that makes
lowercase-who an alias for a +who command. The long
description of it looks like (start (maybe #\+) "who"
end). The keyword start means to match
at the very beginning at the string, and end at
the end. When a regexp is bounded by them, only an exact
match of the entire string being compared against the regexp
can succeed; otherwise, the pattern can match just a portion
of the string. For $-commands, this isn't normally
desired. The complex (maybe
argument) sub-list will either match, or
not be present at all. #\+ is the +
character. Since + is a special character in regexps, it is
escaped with a backslash in the compiled form. The string
"who" matches that exact text. The whole thing
compiles into: ^\+?who$.
For the next example, consider the command
used to list a game's admins. Sometimes it's +admin, or
+admins, or +staff, or +wizards, or something else. Rather
than making newcomers to a game guess which one it is, it's
easy to make one command that matches all of them. The
description for this one is (start (or "+admin"
"+admins" "+staff" "+wizards") end). The new
(or arguments) complex keyword will
match any one of its arguments. The whole thing is compiled
into
^(?:\+admin|\+admins|\+staff|\+wizards)$. However,
this regexp can be improved. First, since every argument to
the or starts with a +, that can be moved outside it, so it
only has to be looked at once, instead of possibly 4 times
(The compiler is smart enough to do this automatically for
simple cases, if optimization is turned on). Re-written to
make this prefix explicit rather than relying on the
optimizer (And incidently allowing another optimization),
the description is now (start #\+ (or "admin" "admins"
"staff" "wizards") end), which compiles to:
^\+(?:admin|admins|staff|wizards)$. Second, the
admin and admins parts are only different because of the
trailing s. "admins" would have to be compared against
"admin", fail, and then match "admins". That can be
simplified with the maybe complex keyword from the first
example: (start #\+ (or (cat "admin" (maybe #\s))
"staff" "wizard) end). The new (cat
arguments) complex keyword simply combines
all its arguments together into one expression for things
like alternation. The final compiled regexp is now
^\+(?:admins?|staff|wizards)$.
NOTE on the (?:) construct in the compiled regexps if you know regexps but are unfamilar with it. This is a perl extension that groups the sub-regexp inside it together so it's treated as one token, without capturing the text it matches. If you don't understand capturing, I'll mention that later.
Going back to the first example, let's
consider the case when you want your +who to take an
optional prefix argument to show only players whose name
start with that argument. This, and the no-argument form can
all be handled by one regular expression, like
so. (start (maybe #\+) "who" (maybe spaces (capture
lots)) end). This introduces two new simple keywords:
spaces, which matches one or more whitespace
characters (Space and tab. There's also space,
which matches any single whitespace character), and
lots, which matches at least one character and
up to as many as possible given the rest of the
pattern. Regular expressions in general try to match the
longest possible part of what they're matched against. The
new complex keyword, (capture
arguments), will save the text it matches
for use later. In the $-command, this text, the prefix
argument, will be in %1. If just a simple +who was given, %1
will be empty. In perl, it would in the variable $1. The
regexp compiles to ^\+who(?:\s+(.+))?$
Time for an example that doesn't deal with
$-command patterns. Here's one that tries to check to see if
a string looks like a simple number (And yes, I know in
mushes you can just use isnum() and isint()). (start
(maybe (one-of #\+ #\-)) (zero-or-more digit) (maybe ".")
digits end). The new keyword one-of will match a
single character from its arguments. zero-or-more will match
as many repetitions of its arguments as possible in a row,
down to possibly none at all. So, broken down, this will
look for strings that start with an optional + or - sign,
possibly followed by some digits, possibly a decimal point, and
then some more digits. The string "123" will match, as will
"1.23" or "+.23", but not "123 and something". The regexp is
compiled down to ^[+\-]?\d*\.?\d+$
Characters are written like so: #\X, where X is the character in question. For example, 'A' is written #\A, 'a' #\a. Space is written as #\space. If you don't like this style of characters, you can usually just use a string instead. "A", "a", " ". Characters are written out into the compiled regexp, escaped with a backslash if they have to because they have a special meaning in regular expressions (Metacharacters).
Strings are any text inclosed in double-quotes. Literal double-quotes can be escaped with a backslash ("a \"quoted\" word"). Strings are written out into the compiled regexp, with any metacharacters in them escaped.
| Keyword | Meaning | Compiled to |
|---|---|---|
| start | Anchors to the start of the string | ^ |
| end | Anchors to the end of the string | $ |
| digit | Matches any single digit | \d |
| digits | Matches one or more digits | \d+ |
| any | Matches any single character | . |
| lots | Matches one or more characters | .+ |
| space | Matches any single whitespace character | \s |
| spaces | Matches one or more whitespace characters | \s+ |
| letter | Matches any single alphanumeric character, or _ | \w |
| letters | Matches one or more alphanumeric or _ | \w+ |
| alpha | Matches any single alphabetic character | [[:alpha:]] |
| alphanumeric | Matches any single alphanumeric character. Unlike the letter keyword, it will not match an underscore. | [[:alnum:]] |
| lower-case | Matches any single lowercase letter | [[:lower:]] |
| upper-case | Matches any single uppercase letter | [[:upper:]] |
| non-space | Matches any single non-whitespace character | \S |
| non-digit | Matches any single non-digit character | \D |
| non-letter | Matches any single non-alphanumeric, non-_ character | \W |
| word-boundary | Asserts that the current location
being looked at in the string to be matched is between a
letter and a
non-letter. | \b |
| not-a-word-boundary | Assorts that the current
location being looked at is either between two
letters or two
non-letter's. | \B |
| Keyword | Meaning | Compiled to |
|---|---|---|
| (group expression ...) | Makes its arguments act like they're one for the repitition keywords described below. This usually isn't needed, because the compiler will add them where appropriate. | (?:expressions) |
| (capture expression ...) | Like
group, this makes its arguments act like one
for reptition, but also saves a copy of the text they
match. This saved text can be used with
match-captured, described below, or, on mushes,
after the match in a stack or register
variable. | (expressions) |
| (cat expression ...) | This one just makes its arguments look like one to the compiler, and not to the generated regexp. It just concatenates its arguments together. | expressions |
These complex keywords will match one of a number of choices.
| Keyword | Meaning | Compiled to |
|---|---|---|
| (or expressions...) | This will try to the match the first of its expressions that it can, in order from left to right. | (?:expression1|expression2|...) |
| (one-of character set ...) | Matches one character in the set described by its arguments. | [chars] |
| (not-one-of character set ...) | Matches one character NOT in the set described by its arguments. | [^chars] |
Character sets for one-of and
not-one-of are created from the keyword's
arguments. Each argument is either a single character (#\A), a
string of characters ("ABC"), a keyword, or a range of
characters ((#\A . #\Z)). For the latter, any character in that
range will be matched. If you want to match a caret (^), it
should not be the first thing in the set. If you want to match a
dash (-), it should be the last thing in the set.
| Keyword | Meaning | Compiled to |
|---|---|---|
| digit | Any digit character | \d |
| space | Any whitespace character | \s |
| letter | Any alphanumeric character or _ | \w |
| non-digit | Any non-digit character | \D |
| non-space | Any nonwhitespace character | \S |
| non-letter | Anything but alphanumeric characters or _ | \W |
| alpha | Any alphabetic letter | [:alpha:] |
| alphanumeric | Any alphanumeric character | [:alnum:] |
| lower-case | Any lower-case letter | [:lower:] |
| upper-case | Any upper-case letter | [:upper:] |
An example: (one-of
#\A #\B digit (#\X . #\Z) "!.,:") will match the characters A,
B, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, X, Y, Z, !, ., comma, and
:.
These complex keywords will check to see if their arguments are repeated a certain number of times.
| Keyword | Meaning | Compiled to |
|---|---|---|
| (maybe expression ...) | Matches the expressions 1 time, or not at all if that is what's needed to get the longest possible match of the overall regexp. | expressions? |
| (zero-or-more expression ...) | The enclosed arguments are matched as many times as possible, or not at all if that is required to make the overall pattern match. | expressions* |
| (one-or-more expression ...) | The enclosed arguments are matched as many times as possible, and always at least once. | expressions+ |
| (maybe-min expression ...) | Matches the expressions 0 times, or only once, if that is what's needed to get the longest possible match of the overall regexp. | expressions?? |
| (zero-or-more-min expression ...) | The enclosed arguments are matched as few times as possible, or not at all if that is required to make the overall pattern match. | expressions*? |
| (one-or-more-min expression ...) | The enclosed arguments are matched as few times as possible, but always at least once. | expressions+? |
| (n-to-m-times expression n m) | The expression must match at least n times, and at most m times. n and m are both integers. | expression{n,m} |
| (at-least-n-times expression n) | The expression must match at least n times. n is an integer. | expression{n,} |
| (exactly-n-times expression n) | The expression must match n times, no more; no less. n is an integer. | expression{n} |
| Keyword | Meaning | Compiled to |
|---|---|---|
| (raw text) | Inserts text unchanged into the compiled regexp. text must be a string. | text |
| (match-captured n) | Matches the same text as the n-th capture expression did, n starts with 1, and increases from left to right. | \N |
| (check-match expression ...) | Checks to see if its arguments will result in a match, without actually eating up any characters of the string being matched against. Like the start, end, boundry and not-boundry keywords, this is known as an assertation. It checks to see if something is true, and doesn't match a specific character. | (?=expressions) |
| (check-not-match expression ...) | This is very similar to
check-match, but it succeeds if its arguments
do NOT match. | (?!expressions) |
The or expressions are where the compiler does
most of its optimizations. These currently only work when all
the alternatives are either strings or characters. If everything
has a common prefix, that prefix is moved out of the alternation
to cut down on the amount of backtracking needed to match
it. Otherwise, the set of all the possible initial characters is
extracted, and an assertation that one of these characters is
present is made. This will make comparing, say "cheetah" against
(start (or "lion" "tiger" "bear")) fail very
quickly. The set of initial characters is l, t, and b. c, the
first character in "cheetah", isn't one of these, so the
assertation fails and no further checking is needed. When
matching against "leopard", however, the first-character
assertation will succeed, but the match will fail, so it isn't
foolproof. It just helps in obvious mismatches. Eventually the
compiler will be able to handle things besides strings and
characters.
A tarball of the current version (1.0.1) is available here. I develop and test it Chicken Scheme. It should run without much trouble on other modern schemes. At a minimum, it needs to support the R5RS standard and SRFI-0 and SRFI-13. See Schemers.org for more information on the standard, SRFI's, links to scheme implementations, and more.
There are three interfaces: The CGI script, which currently only works with chicken, and is hardwired to depend on the URL of this page, and thus probably isn't very useful to anyone else. There's also a Read-Eval-Print-Loop (REPL) interface that just repeats reading in a regexp description and showing the compiled form over and over. Finally, there are the individual functions for turning a regexp description into a compiled line-noise string, escaping it appropriately, and displaying it nicely.
Roughly in order of priority... Improve documentation. Make the compiler optimizer smarter about not adding groupings where they aren't needed, and implement more optimizations and get smarter about escapes in one-of and not-one-of clauses. An actual regexp matcher in addition to the compiler. A decompiler to go from line-noise to verbose description form.
The regexps the translator/compiler produces are meant for use with perl, or the PCRE (Perl Compatible Regular Expressions) library that PennMUSH and many other things use. Converting the compiler to use a different flavor should be fairly easy, though it'll come at a cost of loosing many of the available features.
Using scheme s-expressions to describe regular expressions is not a new idea, as I found out after writing the first version of the compiler (And here I had thought I was being brilliant...). For an example of another take on it, see Olin Shiver's SRE package. That is an actual regexp matcher, not just a compiler like mine (Though mine could be turned into a full matcher, and I plan to do so someday). It was influenced by different design decisions, and the results end up looking quite different despite sharing the same basic representation. Personally, I like mine better. But I'm biased.