Intermediate Python: Pythonic file searches

It's very easy to get up and running with Python, but programmers coming from other more verbose or procedural languages tend to write code that's not very pythonic — that is, it doesn't use Python idioms that experienced programmers use.

The problems with un-pythonic code are that it tends to be more verbose, more difficult to understand, and even to run slower. Here's a naive implementation of a function to find every line in a supplied filename containing a specified string. It returns a list of (line_num, line) tuples.

def naive_way(to_find, filename):
    """Find string to_find in file filename"""
    file_handle = open(filename)
    line_number = 0
    lines = []
    done = False
    while done == False:
        line = file_handle.readline()
        if not line:
            done = True
        else:
            line_number += 1
            index = line.find(to_find)
            if index > -1:
                lines.append((line_number, line))
    return lines

This code is fairly readable and it gets the job done, but we can do better. Notice all these variables lying around? Those are bad because they clutter up the function (making the intent of the function harder to see), and actually slow down the code. Things like "line_number += 1″ are more costly than you might expect, because every time you write "1″ you're creating an object.

We can get rid of "done" and "file_handle" by iterating over the file rather than using the low-level readline() method. We can avoid the code to increment "line_number" by using the built-in enumerate generator function. Finally, we can get rid of "index" by using the "in" statement.

Here's a more pythonic version of the above function:

def pythonic_way(to_find, filename):
    """Find string to_find in file filename"""
    lines = []
    for line_num, line in enumerate(open(filename)):
        if to_find in line:
            lines.append((line_num+1, line))
    return lines

Remember what I said about pythonic code being faster? Here are the times I got for running these functions 100 times, searching for "your system" in "/python25/readme.txt" (rounded to nearest three decimals):

  Without psyco With psyco
naive_way 0.411 s 0.213 s
pythonic_way 0.116 s 0.082 s

Psyco manages to narrow the gap a bit (probably by optimizing away those object creations), but even with psyco the pythonic function is 2.5x faster, not to mention more readable (to a Python programmer, at least!). And since bugs are directly correlated to number of lines of source code, it's likely to have fewer bugs as well.

14 comments to Intermediate Python: Pythonic file searches

  • You might want to use the fileinput module. It keeps track of line numbers and does things like allowing a filename of ‘-’ to mean stdin, and allows multiple files to be searched nearly as easily as one.

    - Paddy.

  • Jean-Baptiste Potonnier

    Even in Python there is more than one way to do it ;)

    What do you think of:

    def find_comprension(to_find, filename):
    “”"Find string to_find in file filename”"”
    return [ (line_num+1, line) for line_num, line in enumerate(open(filename))
    if to_find in line]

    This one is my favorite. I find list comprehension very elegant, and readable. But I usualy avoid nested list comprehension, for readability.

    def find_functional(to_find, filename):
    “”"Find string to_find in file filename”"”

    def is_inline((n ,line)):
    return to_find in line

    def incr_line_nb((n, line)):
    return (n+1, line)

    return map(incr_line_nb, filter(is_inline, enumerate(open(filename))))

    This one in just for fun. I find it quite tricky, in Python at least. It may be more idiomatic in a language like Haskell.

    def find_gen(to_find, filename):
    “”"Find string to_find in file filename”"”
    return ( (line_num+1, line) for line_num, line in enumerate(open(filename))
    if to_find in line)

    Note the ‘()’ instead of the ‘[]‘ in the first function.
    This one is different semantically. You get a generator to iterate over. I find generator expression usefull in some case, but I prefer list, when memory performance are not needed.

  • Jean-Baptiste Potonnier

    It seems that your blog doesn’t like Python code in comments :/
    And it even ate my ‘pre’ tags!

  • @Jean-Baptiste:
    I also prefer the list comprehension, but I thought that it seemed harder for a beginner to Python to understand. And sorry about eating tags in the comments; I’ve been meaning to add a markdown plugin for comments.

    @Paddy3118
    fileinput is a great module. I was just trying to illustrate a point about idiomatic code, but for heavy-duty uses I’d definitely recommend it. Incidentally, I timed a version of the function using fileinput, and it was as slow as the naive version — something to keep in mind.

  • Ian

    If the file size is large you might find a generator approach useful.

    def search(target, stream):
        for line in stream:
            if target in line:
                yield line

    or

    search = lambda target, stream: (line for line in stream if target in line)

    for line in search(“foo”, open(“bar”)):
        process(line)

  • One line, most pythonic:

    [(line_num,line) for line_num,line in enumerate(open(filename)) if to_find in line]

  • Anonymous

    Or, even shorter and easier to read – remove the line_num, line stuff and just focus on the result:

    [result for result in enumerate(open(filename)) if to_find in result[1]]

  • mackstann

    Python generally caches numeric (and maybe string) constants, so it actually only creates the 1 integer object once.

  • @Ian
    Yes, if the number of expected results is large. The enumerator is already a generator.

    @Anonymous
    True, but I’m assuming that the line numbers are actually needed. (The function is adapted from an actual query of comp.lang.python that asked something similar)

    @mackstann
    You’re right, I had totally forgotten about that. So the increment operation isn’t contributing to the slowdown; just the line count :)

  • Pete Cable

    @ryan
    Anonymous’ list comprehension still gives you line numbers… he just returns the tuple from enumerate. However, the line numbers are zero-indexed instead of one-indexed like yours.

  • @Pete Cable
    Quite right, sorry about that Anonymous. Nice trick about not unpacking the tuples returned by enumerate, although indexing on result seems somehow kind of vulgar… :)

  • wcyee

    I’m just learning python and I see code like this often:

    for line_num, line in enumerate(open(filename)):

    My question is: don’t you need the file handle to close the file? A lot of python code out there never takes that into account. Is there something I don’t know? Or am I just unnecessarily paranoid about resource leaks.

  • @wcyee
    Good question. When the call to open goes out of scope, the file handle will be closed upon the next garbage collection. Since GC is eager in CPython, that’ll be very soon. If you’re paranoid about these things, you can use the “with” syntax:
    with open(filename) as f:
        for num, line in enumerate(f):
            pass

    Here’s a nice link that explains how file handles and scope work in CPython:
    http://www.diveintopython.org/object_oriented_framework/instantiating_classes.html#fileinfo.scope

  • wcyee

    Ryan, thanks for answering my question. I had guessed, but didn’t know, that the GC handled closing file handles. Also, the “with” syntax is very nice. I’ll be using that a fair bit now. Thanks!

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>