Clumsy regex syntax in Python is a feature

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

–Jamie Zawinski

A common complaint about Python is that its regular-expression syntax is clumsy compared to some other languages. But actually, library support rather than core-language support is a feature, not a bug: it keeps the core syntax clean, and makes you think twice about using regular exprssions.

When regular expressions are the right tool, they're great. Regular expressions have a big advantage in some cases:

  • Sometimes they're more concise than the code equivalent

But regular expressions have some disadvantages compared to straight code that make them best avoided in most cases:

  • They're harder to read (at least compared to languages like Python)
  • They're harder to debug
  • They're harder to maintain
  • They're less reliable (easy to miss edge cases)

Because of these disadvantages, regular expressions should usually be (at most) the second thing you consider. When regular expression support is provided in a library rather than the core language, it gives you that moment to think before rushing ahead with a regular-expression solution to a problem that could have been solved more easily, correctly, and maintainably with another technique.

Firstly, if you simply want to find string membership, do so:

line.startswith("Warning")
line.strip().lower().endswith("new york.")
"spam" in my_onigiri

Also, if there's a library that already does what you want, don't write a regular expression anyway. Remember the old adage: Good programmers write good code, but great programmers steal great code.

Here are some more cases where I would say that regular expressions are ill advised.

Parsing links in HTML text

Say we want to extract all the href's and link text from anchors in an HTML page. Here's a handy regular expression to do this from Mastering Regular Expressions:

# Note: the regex in the while(…) is overly simplistic-see text for discussion
while ($Html =~ m{<a\b([^>]+)>(.*?)</a>}ig)
{
    my $Guts = $1; # Save results from the match above, to their own…
    my $Link = $2; # …named variables, for clarity below.

    if ($Guts =~ m{
        b HREF      #  "href" attribute
        \s* = \s*   #  "=" may have whitespace on either side
        (?:         #  Value is…
        "([^"]*)"   #    double-quoted string,
        |           #    or…
        '([^']*)'   #    single-quoted string,
        |           #    or…
        ([^'"
>\s]+) #    "other stuff"
        )           #
    }xi)
    {
        my $Url = $+; # Gives the highest-numbered actually-filled $1, $2, etc.
        print "$Url with link text: $Linkn";
    }
}

OK, not the prettiest code, but certainly the best, right?

Let's try this without a regular expression. First, get yourself the lovely Beautiful Soup module. Now, try this out for size:

from BeautifulSoup import BeautifulSoup, SoupStrainer
from urllib2 import urlopen

anchors = SoupStrainer('a')
soup = BeautifulSoup(urlopen("http://google.com/"),
                     parseOnlyThese=anchors)

for tag in soup:
    print tag['href'], "with link text:", tag.renderContents()

The regular expression is written in Perl, and the code version is written in Python; but I'm not trying the argue the relative merits of one language over the other. I'm sure that Perl has a perfectly good HTML parsing library or seven. What I'm getting at is that the regular expressions themselves are less readable, less maintainable, less reliable, and more brittle than their code equivalents.

Parsing CSV files

Here's another example from Mastering Regular Expressions, this time parsing fields from a CSV file using Java (and java.util.regex).

    //Prepare the regexes we'll use

    Pattern pCSVmain = Pattern.compile(
        "   \\G(?:^|,)                                   \n"+
        "   (?:                                          \n"+
        "      # Either a double-quoted field…         \n"+
        "      \" # field's opening quote                \n"+
        "       ( (?> [^\"]*+ ) (?> \"\" [^\"]*+ )*+ )   \n"+
        "      \" # field's closing quote                \n"+
        "    # … or …                                \n"+
        "    |                                           \n"+
        "      # … some non-quote/non-comma text …   \n"+
        "      ( [^\",]*+ )                              \n"+
        "   )                                            \n",
        Pattern.COMMENTS);
    Pattern pCSVquote = Pattern.compile("\"\"");
    // Now create Matcher objects, with dummy text, that we'll use later.
    Matcher mCSVmain  = pCSVmain.matcher("");
    Matcher mCSVquote = pCSVquote.matcher("");

    mCSVmain.reset(csvText); // Tie the target text to the mCSVmain object
    while ( mCSVmain.find() )
    {
        String field; // We'll fill this in with $1 or $2 . . .
        String first = mCSVmain.group(2);
        if ( first != null )
            field = first;
        else {
            // If $1, must replace paired double-quotes with one double quote

            mCSVquote.reset(mCSVmain.group(1));
            field = mCSVquote.replaceAll("\"");
        }
        // We can now work with field . . .
        System.out.println("Field [" + field + "]");
    }

I hope he's getting paid by the line!

OK, here's the behemoth cut down to size using Python's built-in csv library (adapted from the excellent Python docs):

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    for field in row:
        print "Field [%s]" % field

Again, the code version is much more readable, reliable, maintainable, extendable, etc. There's really no reason to use a regular expression here except masochism.

Conclusion

Having regular expression support in a library instead of the core language gives us two advantages: (1) it serves as a sanity check, making us think before whipping out some big, complex regex; and (2) it reduces the amount of syntax we have to remember. That second point might seem trivial, but each extra bit of syntax we have to remember increases our cognitive load, reducing the processing power we have left for writing cool code. Take it from an old C++ programmer :).

Of course, there'll be times when a regular expression is the best way to go. But even when you're sure you've got a problem only a regex can solve without painful coding contortions, take a look at PyParsing first!

Finally, the attractiveness of the power of regular expressions will probably vary according to the power of your language. I know that regular expressions usually seem a lot more attractive when I'm writing C++ than Python.

2 comments to Clumsy regex syntax in Python is a feature

  • Ben

    What’s the point of comparing library routines to do a job with regular expressions?

  • > What’s the point of comparing library routines to do a job with regular expressions?

    My point is that when regular expressions have first-class syntax, they encourage programmers to use them when a library routine would be the better choice.

    That is one reason why, IMO, you see so many examples of using regular expressions to parse HTML, CSV, etc. when a library routine would have been better.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>