What price elegance?

In a recent post, I gave some code for counting the top n most frequent words in an arbitrary text file using itertools.groupby.

The code is written in a somewhat functional style. It's short and, dare I say, kind of elegant. But it turns out that this code is quite a bit slower than an imperative style using collections.defaultdict.

Here are the two functions:

from itertools import groupby
from collections import defaultdict

def get_top_freqs_gb(filename, num):
    """Get the top num words from filename as a list
    of (word, freq) tuples, using itertools.groupby
    "
""
    freqs = [(len(list(g)), k)
        for k, g in groupby(sorted(get_words(filename)))]
    return get_top(freqs, num)

def get_top_freqs_dd(filename, num):
    """Get the top num words from filename as a list
    of (word, freq) tuples, using collections.defaultdict
    "
""
    freq_dict = defaultdict(int)
    for word in get_words(filename):
        freq_dict[word] += 1
    freqs =[(v, k)
            for k, v in freq_dict.iteritems()]
    return get_top(freqs, num)

Here are the helper functions:

import re

def get_words(filename):
    """Get the words from filename"""
    split = re.compile(r"\b\w+\b").findall
    return [word
             for line in open(filename)
             for word in split(line.lower())]

def get_top(freqs, num):
    return [(b, a) for a, b in reversed(sorted(freqs)[num*-1:])]

The groupby version is shorter than the defaultdict version, and I'd say that it's simpler and more readable as well. Because it's shorter, the groupby version is less likely to contain bugs. In particular, the defaultdict version has a mutable local variable (used as an accumulator in the for loop), which is a classic source of bugs. The groupby version is also likely to be easier to maintain because it's shorter and simpler.

But the defaultdict version of the function winds up being considerably faster.

The times it took to run these functions 10 times on my computer, retrieving the top 50 most frequent words for "/python25/readme.txt", are as follows (seconds rounded to 4 decimal places).

  Without psyco With psyco
groupby version 0.3133 s 0.2193 s
defaultdict version 0.2852 s 0.1818 s
groupby / defaultdict 1.41 1.58

The defaultdict version is 1.4x faster than the groupby version. This gap grows even further when psyco is used, making the defaultdict version nearly 1.6x as fast. I'd say that most of the reason for the slowness is that the groupby version of the function performs two sorts, compared to one sort in the defaultdict version.

(The psyco speedup for the defaultdict version comes from the for loop; changing get_words to return a generator expression eliminates the speedup. The speedup for the groupby version comes from the freq list comprehension; changing this to a generator expression eliminates its speedup.)

So which one should I use?

It's pretty common for Python code written in a functional style to be slower than equivalent code written in an imperative style. Nevertheless, I tend to prefer the more functional style of programming, switching to a more imperative style (or other forms of optimization) if performance isn't satisfactory.

It is easier to optimize correct code, than correct optimized code.

–Yves Deville

A big question here is how to tell if the functional version is fast enough. My general rule of thumb is that the user would be prepared to wait up to two seconds for a typical "grovel through these files and tell me something interesting" command that's performed infrequently (how frequently do you need to get word frequencies from files?). For a more common action, the wait time should be under a second, with < .5 seconds being optimal (this includes GUI responsiveness but not Web page loading). Given the times above, and assuming that the user will search no more than 50 files of sizes comparable to Python's README file, then either version of the function is sufficient. If we assume that the user will search up to 100 files, or files substantially larger than the README file, then only the imperative version is acceptable (and we may need to optimize this further if our demands are higher than this).

That's why it's so important to profile and test Python programs from the very beginning. I keep a suite of test cases that I profile with every build (performed at least daily), noting trends in performance and optimizing when the code has gelled and bottlenecks remain.

Here is the test code:

from time import clock

def time_func(func, iterations, *args, **kwargs):
    """Return the time it takes to execute func
    itertations times."
""
    start = clock()
    for x in xrange(iterations):
        func(*args, **kwargs)
    return clock() – start

def main():
    filename = "/python25/readme.txt"
    top_gb = get_top_freqs_gb(filename, 100)
    top_dd = get_top_freqs_dd(filename, 100)
    assert top_gb == top_dd

    for func in [get_top_freqs_gb, get_top_freqs_dd]:
        name = func.__name__
        seconds = time_func(func, 10, filename, 50)
        print "%s: %s" % (name, seconds)

    print "With psyco"

    import psyco
    psyco.full()

    for func in [get_top_freqs_gb, get_top_freqs_dd]:
        name = func.__name__
        seconds = time_func(func, 10, filename, 50)
        print "%s: %s" % (name, seconds)

if __name__ == "__main__":
    main()

The whole shebang:

#coding: UTF8
"""
Testing functional programming stuff
"
""

from itertools import groupby
from collections import defaultdict
import re
from time import clock

def get_words(filename):
    """Get the words from filename"""
    split = re.compile(r"\b\w+\b").findall
    return [word
             for line in open(filename)
             for word in split(line.lower())]

def get_top(freqs, num):
    return [(b, a) for a, b in reversed(sorted(freqs)[num*-1:])]

def get_top_freqs_gb(filename, num):
    """Get the top num words from filename as a list
    of (word, freq) tuples, using itertools.groupby
    "
""
    freqs = [(len(list(g)), k)
        for k, g in groupby(sorted(get_words(filename)))]
    return get_top(freqs, num)

def get_top_freqs_dd(filename, num):
    """Get the top num words from filename as a list
    of (word, freq) tuples, using collections.defaultdict
    "
""
    freq_dict = defaultdict(int)
    for word in get_words(filename):
        freq_dict[word] += 1
    freqs =[(v, k)
            for k, v in freq_dict.iteritems()]
    return get_top(freqs, num)

def time_func(func, iterations, *args, **kwargs):
    """Return the time it takes to execute func
    itertations times."
""
    start = clock()
    for x in xrange(iterations):
        func(*args, **kwargs)
    return clock() – start

def main():
    filename = "/python25/readme.txt"
    top_gb = get_top_freqs_gb(filename, 100)
    top_dd = get_top_freqs_dd(filename, 100)
    assert top_gb == top_dd

    for func in [get_top_freqs_gb, get_top_freqs_dd]:
        name = func.__name__
        seconds = time_func(func, 10, filename, 50)
        print "%s: %s" % (name, seconds)

    print "With psyco"

    import psyco
    psyco.full()

    for func in [get_top_freqs_gb, get_top_freqs_dd]:
        name = func.__name__
        seconds = time_func(func, 10, filename, 50)
        print "%s: %s" % (name, seconds)

if __name__ == "__main__":
    main()

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>