Using custom functions with SQLAlchemy and SQLite

I recently developed a Web-based translation memory (TM) application in Python. One thing the application does is fuzzy glossary matching: given a source sentence, it'll find all terms in the glossary that are fuzzy substrings of that sentence (using my fuzzy substring matching module, which is based on the Levenshtein distance algorithm), and return the terms along with their translations.

Here's how I created a custom function for fuzzy glossary searches, using SQLAlchemy for the ORM, with SQLite as the database engine. Assuming you've got your SessionClass object, create a session, and get a connection object:

import subdist

def make_gloss_func(haystack):
    """Creates a fuzzy substring matcher using haystack
    "
""
    get_score = subdist.get_score

    def gloss_func(needle):
        return get_score(needle, haystack)
    return gloss_func

class TM(object):
    """Represents a translation memory (TM)/glossary"""

    # stuff skipped…

    def gloss_search(self, query, minscore):
        """Do a fuzzy glossary search.
        "
""

        try:
            session = self.SessionClass()

            # Create the custom function
            gloss_func = make_gloss_func(query)
            conn = session.bind.connect()
            conn.connection.create_function("gloss_score",
                                            1,
                                            gloss_func)

            # Execute the query
            search_string = """SELECT * FROM records
                    WHERE gloss_score(source)>=:minscore"
""
            return conn.execute(search_string,
                            dict(minscore=minscore))

        finally:
            self.SessionClass.remove()

Speedup

Just for fun, I compared the speed of (1) using custom functions in SQLite with (2) keeping the records as a Python array, and getting the matches in pure Python using a list comprehension. I found that the SQLAlchemy version is about 8 times faster. In the test, I created a glossary of 44,732 records using random word pairs, and got the fuzzy substrings for a query sentence.

version time
native Python 0.7837 s
SQLAlchemy 0.0966 s

Since the fuzzy-matching code and database code are written in C, the SQLAlchemy version is probably approaching near-C speeds, with the only slowdown being the overhead of calling them from Python (which is pretty minimal; most of the work is done elsewhere).

More importantly, the SQLAlchemy version easily meets my performance target of a 50,000-record search in 0.25 seconds, while the native Python version falls pretty far short.

Also interestingly, I found that psyco didn't speed up either version at all, and in fact made both slightly slower. Another demonstration that you should profile rather than applying psyco as a panacea.

Here's the code used for the test.

4 comments to Using custom functions with SQLAlchemy and SQLite

  • Hi Ryan,

    Searched through Python arrays will take order(n) time, no matter what you do. Psyco will not help at that at all.

    I like your SQLAlchemy solution. Just for fun, if you compare its performance against a Python dictionary (instead of array) or the same size, you’ll see completely different results.

    Searching through a dictionary takes order(log-n) time. Inserting to the dictionary would be slower – but not a killer.

    Amir

  • @Amir

    I’m not sure whether a dictionary would gain anything, because this is an iteration problem, not a search problem. You’ve still got to iterate through every entry in your database, because your search query is new each time. The dictionary would speed things up if you could leverage the hash function.

  • I probably didn’t understand your original algorithm. What I meant is that if there’s a long array and you’re implicitly searching it (like array.count()), you’re getting execution time that’s linear to the array’s size.

  • I probably didn’t explain the algorithm well enough. You’ve got to iterate through each record in your glossary, retrieving all the records with gloss_score(query) gte minscore. SQLite iterates through lots of records much faster than pure Python.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>