Splitting queries into search terms with Python

I'm currently writing a site to host the archives for the Honyaku mailing list. I needed to split a search query into individual terms, so for example the query "懸念 risk" would retrieve all posts with the words "懸念" and "risk", not necessarily in that sequence or order. I also wanted to support quoted strings, to enable searching for exact phrases.

My first thought was to use the built-in shlex module.

>>> import shlex
>>> shlex.split('parrot "lovely plumage"')
['parrot', 'lovely plumage']

It seemed promising, but some further experimentation showed that the module wasn't going to work without some serious hackery.

>>> shlex.split("'"it's just sleeping" -deceased"')
it's just sleeping", '-deceased']

Still fine, but…

>>> shlex.split("-sleeping it's deceased")

Traceback (most recent call last):
ValueError: No closing quotation

So I discarded the shlex option.

Then I remembered the (also built-in) csv module. Let's give that one a try:

import csv

def get_terms(query):
    """Get the terms from the query string"""
    reader = csv.reader([' '.join(query.split())],
                        delimiter = ' ')
    return reader.next() # Since it's only one line

queries = """parrot "lovely plumage"
it's sleeping" -deceased
-sleeping it'
s deceased
"not sleeping" it's deceased

for query in queries.splitlines():
    print query
    print "    =>", get_terms(query)

Notice how I normalize the whitespace in the query before feeding it to the csv module.


parrot "lovely plumage"
    => ['parrot', 'lovely plumage']
"it's sleeping" -deceased
    => ["it's sleeping", '-deceased']
-sleeping it's deceased
    => ['-sleeping', "it's", 'deceased']
"not sleeping" it's deceased
    => ['not sleeping', "it's", 'deceased']

Looks pretty good so far!

So for now at least, I'm going with the csv option for tokenizing query strings. So far, it seems like the simplest thing that could possibly work <g>.

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>