October 25, 2007
Splitting queries into search terms with Python
I'm currently writing a site to host the archives for the Honyaku mailing list. I needed to split a search query into individual terms, so for example the query "懸念 risk" would retrieve all posts with the words "懸念" and "risk", not necessarily in that sequence or order. I also wanted to support quoted strings, to enable searching for exact phrases.
My first thought was to use the built-in shlex module.
>>> shlex.split('parrot "lovely plumage"')
['parrot', 'lovely plumage']
>>>
It seemed promising, but some further experimentation showed that the module wasn't going to work without some serious hackery.
["it's just sleeping", '-deceased']
Still fine, but…
Traceback (most recent call last):
…
ValueError: No closing quotation
So I discarded the shlex option.
Then I remembered the (also built-in) csv module. Let's give that one a try:
def get_terms(query):
"""Get the terms from the query string"""
reader = csv.reader([' '.join(query.split())],
delimiter = ' ')
return reader.next() # Since it's only one line
queries = """parrot "lovely plumage"
"it's sleeping" -deceased
-sleeping it's deceased
"not sleeping" it's deceased
"""
for query in queries.splitlines():
print query
print " =>", get_terms(query)
Notice how I normalize the whitespace in the query before feeding it to the csv module.
Output:
=> ['parrot', 'lovely plumage']
"it's sleeping" -deceased
=> ["it's sleeping", '-deceased']
-sleeping it's deceased
=> ['-sleeping', "it's", 'deceased']
"not sleeping" it's deceased
=> ['not sleeping', "it's", 'deceased']
Looks pretty good so far!
So for now at least, I'm going with the csv option for tokenizing query strings. So far, it seems like the simplest thing that could possibly work <g>.