Regexp Translator

Translator

Enter a regular expression description (Note that it must be enclosed in parenthesis!):
Example: (start (zero-or-more digit) (maybe ".") digits end)

Escape characters like:
Normal Softcode $-command
Optimize regular expression?

Introduction

Regular Expressions (regexps) are a very powerful tool for seeing if arbitrary text matches a complex pattern. However, to the uninitiated, they can look like line noise, and can be hard to learn. So, a while back, I hit upon the idea of writing a program that takes a more structured and readable description of a regexp, and compiles it into the line-noise form. The intent is for it to be a tool to help people learn regexps, by comparing the more verbose and understandable version to the resulting line-noise to see what symbols do what, and for people who just need the occasional regexp without wanting to learn everything about them. When I brought this up, people thought it was a good idea. This is the first fruits of that idea. Actually, it's gone a little beyond that, because it has an optimizing stage that tries to make the regexp 'better'. It's not much of an optimizer at the moment, but it's getting there...

The translator is written in the scheme language, and uses scheme data types for the regexp descriptions. Each one is a list of elements. A list is a pair of parenthesis with space-separated elements inside. The relevant elements are strings ("double quoted"), characters (#\A, with space written #\space. If you don't like this syntax, just use strings of one character, like "A"), keywords (barewords), and complex keywords ((keyword arguments)). Recognized symbols, both stand-alone and the complex keywords made from a sub-list of keyword and arguments, are described below. But first, a few examples to let people get the hang of it.

Get it!

Examples

The translator can print compiled regular expressions with the appropriate characters escaped with backslashes for use in softcode (As an argument to regmatch() or another regexp function), or $-commands, or without any extra escapes. Most of the examples below are meant to be $-commands, but for clearness, only normal escaping is done. To use as a $-command, put a backslash before any ;'s in the regexps.

If you're here because of some other reason than mushes, don't worry about the escaping options. Just keep the default 'normal' radio button selected. Hopefully the examples should still be followable.

There is also an optimizer that transforms the regular expressions so they will match faster. In the examples, I assume this is turned off. (And need to re-write one of the examples, from before I made it possible to turn off the optimizer. Bah)

  1. The first example is a regexp that makes lowercase-who an alias for a +who command. The long description of it looks like (start (maybe #\+) "who" end). The keyword start means to match at the very beginning at the string, and end at the end. When a regexp is bounded by them, only an exact match of the entire string being compared against the regexp can succeed; otherwise, the pattern can match just a portion of the string. For $-commands, this isn't normally desired. The complex (maybe argument) sub-list will either match, or not be present at all. #\+ is the + character. Since + is a special character in regexps, it is escaped with a backslash in the compiled form. The string "who" matches that exact text. The whole thing compiles into: ^\+?who$.

  2. For the next example, consider the command used to list a game's admins. Sometimes it's +admin, or +admins, or +staff, or +wizards, or something else. Rather than making newcomers to a game guess which one it is, it's easy to make one command that matches all of them. The description for this one is (start (or "+admin" "+admins" "+staff" "+wizards") end). The new (or arguments) complex keyword will match any one of its arguments. The whole thing is compiled into ^(?:\+admin|\+admins|\+staff|\+wizards)$. However, this regexp can be improved. First, since every argument to the or starts with a +, that can be moved outside it, so it only has to be looked at once, instead of possibly 4 times (The compiler is smart enough to do this automatically for simple cases, if optimization is turned on). Re-written to make this prefix explicit rather than relying on the optimizer (And incidently allowing another optimization), the description is now (start #\+ (or "admin" "admins" "staff" "wizards") end), which compiles to: ^\+(?:admin|admins|staff|wizards)$. Second, the admin and admins parts are only different because of the trailing s. "admins" would have to be compared against "admin", fail, and then match "admins". That can be simplified with the maybe complex keyword from the first example: (start #\+ (or (cat "admin" (maybe #\s)) "staff" "wizard) end). The new (cat arguments) complex keyword simply combines all its arguments together into one expression for things like alternation. The final compiled regexp is now ^\+(?:admins?|staff|wizards)$.

    NOTE on the (?:) construct in the compiled regexps if you know regexps but are unfamilar with it. This is a perl extension that groups the sub-regexp inside it together so it's treated as one token, without capturing the text it matches. If you don't understand capturing, I'll mention that later.

  3. Going back to the first example, let's consider the case when you want your +who to take an optional prefix argument to show only players whose name start with that argument. This, and the no-argument form can all be handled by one regular expression, like so. (start (maybe #\+) "who" (maybe spaces (capture lots)) end). This introduces two new simple keywords: spaces, which matches one or more whitespace characters (Space and tab. There's also space, which matches any single whitespace character), and lots, which matches at least one character and up to as many as possible given the rest of the pattern. Regular expressions in general try to match the longest possible part of what they're matched against. The new complex keyword, (capture arguments), will save the text it matches for use later. In the $-command, this text, the prefix argument, will be in %1. If just a simple +who was given, %1 will be empty. In perl, it would in the variable $1. The regexp compiles to ^\+who(?:\s+(.+))?$

  4. Time for an example that doesn't deal with $-command patterns. Here's one that tries to check to see if a string looks like a simple number (And yes, I know in mushes you can just use isnum() and isint()). (start (maybe (one-of #\+ #\-)) (zero-or-more digit) (maybe ".") digits end). The new keyword one-of will match a single character from its arguments. zero-or-more will match as many repetitions of its arguments as possible in a row, down to possibly none at all. So, broken down, this will look for strings that start with an optional + or - sign, possibly followed by some digits, possibly a decimal point, and then some more digits. The string "123" will match, as will "1.23" or "+.23", but not "123 and something". The regexp is compiled down to ^[+\-]?\d*\.?\d+$

  5. More examples to come...

Keyword Reference

Characters

Characters are written like so: #\X, where X is the character in question. For example, 'A' is written #\A, 'a' #\a. Space is written as #\space. If you don't like this style of characters, you can usually just use a string instead. "A", "a", " ". Characters are written out into the compiled regexp, escaped with a backslash if they have to because they have a special meaning in regular expressions (Metacharacters).

Strings

Strings are any text inclosed in double-quotes. Literal double-quotes can be escaped with a backslash ("a \"quoted\" word"). Strings are written out into the compiled regexp, with any metacharacters in them escaped.

Simple keywords

Recognized Simple Keywords
KeywordMeaningCompiled to
startAnchors to the start of the string^
endAnchors to the end of the string$
digitMatches any single digit\d
digitsMatches one or more digits\d+
anyMatches any single character.
lotsMatches one or more characters.+
spaceMatches any single whitespace character\s
spacesMatches one or more whitespace characters\s+
letterMatches any single alphanumeric character, or _\w
lettersMatches one or more alphanumeric or _\w+
alphaMatches any single alphabetic character[[:alpha:]]
alphanumericMatches any single alphanumeric character. Unlike the letter keyword, it will not match an underscore.[[:alnum:]]
lower-caseMatches any single lowercase letter[[:lower:]]
upper-caseMatches any single uppercase letter[[:upper:]]
non-spaceMatches any single non-whitespace character\S
non-digitMatches any single non-digit character\D
non-letterMatches any single non-alphanumeric, non-_ character\W
word-boundaryAsserts that the current location being looked at in the string to be matched is between a letter and a non-letter.\b
not-a-word-boundaryAssorts that the current location being looked at is either between two letters or two non-letter's.\B

Complex keywords

Grouping

KeywordMeaningCompiled to
(group expression ...)Makes its arguments act like they're one for the repitition keywords described below. This usually isn't needed, because the compiler will add them where appropriate.(?:expressions)
(capture expression ...)Like group, this makes its arguments act like one for reptition, but also saves a copy of the text they match. This saved text can be used with match-captured, described below, or, on mushes, after the match in a stack or register variable.(expressions)
(cat expression ...)This one just makes its arguments look like one to the compiler, and not to the generated regexp. It just concatenates its arguments together.expressions

Alternatives

These complex keywords will match one of a number of choices.

KeywordMeaningCompiled to
(or expressions...)This will try to the match the first of its expressions that it can, in order from left to right.(?:expression1|expression2|...)
(one-of character set ...)Matches one character in the set described by its arguments.[chars]
(not-one-of character set ...)Matches one character NOT in the set described by its arguments.[^chars]

Character sets for one-of and not-one-of are created from the keyword's arguments. Each argument is either a single character (#\A), a string of characters ("ABC"), a keyword, or a range of characters ((#\A . #\Z)). For the latter, any character in that range will be matched. If you want to match a caret (^), it should not be the first thing in the set. If you want to match a dash (-), it should be the last thing in the set.

Character Set keywords
KeywordMeaningCompiled to
digitAny digit character\d
spaceAny whitespace character\s
letterAny alphanumeric character or _\w
non-digitAny non-digit character\D
non-spaceAny nonwhitespace character\S
non-letterAnything but alphanumeric characters or _\W
alphaAny alphabetic letter[:alpha:]
alphanumericAny alphanumeric character[:alnum:]
lower-caseAny lower-case letter[:lower:]
upper-caseAny upper-case letter[:upper:]

An example: (one-of #\A #\B digit (#\X . #\Z) "!.,:") will match the characters A, B, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, X, Y, Z, !, ., comma, and :.

Repetition

These complex keywords will check to see if their arguments are repeated a certain number of times.

KeywordMeaningCompiled to
(maybe expression ...)Matches the expressions 1 time, or not at all if that is what's needed to get the longest possible match of the overall regexp.expressions?
(zero-or-more expression ...)The enclosed arguments are matched as many times as possible, or not at all if that is required to make the overall pattern match.expressions*
(one-or-more expression ...)The enclosed arguments are matched as many times as possible, and always at least once.expressions+
(maybe-min expression ...)Matches the expressions 0 times, or only once, if that is what's needed to get the longest possible match of the overall regexp.expressions??
(zero-or-more-min expression ...)The enclosed arguments are matched as few times as possible, or not at all if that is required to make the overall pattern match.expressions*?
(one-or-more-min expression ...)The enclosed arguments are matched as few times as possible, but always at least once.expressions+?
(n-to-m-times expression n m)The expression must match at least n times, and at most m times. n and m are both integers.expression{n,m}
(at-least-n-times expression n)The expression must match at least n times. n is an integer.expression{n,}
(exactly-n-times expression n)The expression must match n times, no more; no less. n is an integer.expression{n}

Others

KeywordMeaningCompiled to
(raw text)Inserts text unchanged into the compiled regexp. text must be a string.text
(match-captured n)Matches the same text as the n-th capture expression did, n starts with 1, and increases from left to right.\N
(check-match expression ...)Checks to see if its arguments will result in a match, without actually eating up any characters of the string being matched against. Like the start, end, boundry and not-boundry keywords, this is known as an assertation. It checks to see if something is true, and doesn't match a specific character.(?=expressions)
(check-not-match expression ...)This is very similar to check-match, but it succeeds if its arguments do NOT match.(?!expressions)

Optimizations

The or expressions are where the compiler does most of its optimizations. These currently only work when all the alternatives are either strings or characters. If everything has a common prefix, that prefix is moved out of the alternation to cut down on the amount of backtracking needed to match it. Otherwise, the set of all the possible initial characters is extracted, and an assertation that one of these characters is present is made. This will make comparing, say "cheetah" against (start (or "lion" "tiger" "bear")) fail very quickly. The set of initial characters is l, t, and b. c, the first character in "cheetah", isn't one of these, so the assertation fails and no further checking is needed. When matching against "leopard", however, the first-character assertation will succeed, but the match will fail, so it isn't foolproof. It just helps in obvious mismatches. Eventually the compiler will be able to handle things besides strings and characters.

Getting the translator

A tarball of the current version (1.0.1) is available here. I develop and test it Chicken Scheme. It should run without much trouble on other modern schemes. At a minimum, it needs to support the R5RS standard and SRFI-0 and SRFI-13. See Schemers.org for more information on the standard, SRFI's, links to scheme implementations, and more.

There are three interfaces: The CGI script, which currently only works with chicken, and is hardwired to depend on the URL of this page, and thus probably isn't very useful to anyone else. There's also a Read-Eval-Print-Loop (REPL) interface that just repeats reading in a regexp description and showing the compiled form over and over. Finally, there are the individual functions for turning a regexp description into a compiled line-noise string, escaping it appropriately, and displaying it nicely.

To Do

Roughly in order of priority... Improve documentation. Make the compiler optimizer smarter about not adding groupings where they aren't needed, and implement more optimizations and get smarter about escapes in one-of and not-one-of clauses. An actual regexp matcher in addition to the compiler. A decompiler to go from line-noise to verbose description form.

Further Notes

The regexps the translator/compiler produces are meant for use with perl, or the PCRE (Perl Compatible Regular Expressions) library that PennMUSH and many other things use. Converting the compiler to use a different flavor should be fairly easy, though it'll come at a cost of loosing many of the available features.

Using scheme s-expressions to describe regular expressions is not a new idea, as I found out after writing the first version of the compiler (And here I had thought I was being brilliant...). For an example of another take on it, see Olin Shiver's SRE package. That is an actual regexp matcher, not just a compiler like mine (Though mine could be turned into a full matcher, and I plan to do so someday). It was influenced by different design decisions, and the results end up looking quite different despite sharing the same basic representation. Personally, I like mine better. But I'm biased.


Shawn Wagner
Last modified: Wed May 2 20:24:54 PDT 2007