Archive for December, 2007

Version 0.2 of subdist module released

Just a quick note that I've released version 0.2 of my subdist module.
What is subdist?
subdist is a C Python extension that calculates fuzzy substring matches, based on Levenshtein distance.
subdist works purely with Unicode strings; calling one of its functions with a non-Unicode string will raise an error.
What's new in version 0.2?
Version 0.2 adds a get_score […]

Aim high

Alex Martelli has a great quote on optimization in Python in a Nutshell:
Start by designing, coding, and testing your application in Python, using available extension modules if they save you work. This takes much less time than it would with a classic compiled language. Then benchmark the application to find out if the resulting code […]

The past, present, and future of optimization

I have a relative ("Dan") who used to earn a living optimizing code in the late 70s and early 80s. Around then, a new-fangled high-level language named "C" was starting to catch on, but companies didn't like all the wasted cycles in C programs due to the under-optimized assembly code that their C compilers were […]

The machine-translation pipe dream

An article in Sankei News about NEC putting machine translation onto mobile phones (Japanese) has created a bit of buzz on Honyaku (a mailing list for J<>E translators).
Every time some new development in the machine translation world comes out, translators start to worry about whether they're going to be put out of work. Let […]

Extending Python with C: A case study

Near-100x speedup with a C extension
I recently wrote about an algorithm for fuzzy matching of substrings implemented in Python. This is a feature that I needed for a piece of software I'm currently developing.
When I started using the fuzzy_substring function on some test cases, however, it was unacceptably slow. Using a modestly large test corpus […]

Fuzzy substring matching with Levenshtein distance in Python

Levenshtein distance is a well known technique for fuzzy string matching. With a couple of modifications, it's also possible to use Levenshtein distance to do fuzzy matching of substrings.
Let's take a simple example just to show what I mean.
needle: "aba"
haystack: "c abba c"
We can intuitively see that "aba" should match up against "abba." Here's a […]