Version 0.2 of subdist module released

Just a quick note that I've released version 0.2 of my subdist module. What is subdist? subdist is a C Python extension that calculates fuzzy substring matches, based on Levenshtein distance. subdist works purely with Unicode strings; calling one of its functions with a non-Unicode string will raise an error. What's new in version 0.2? […]

Aim high

Alex Martelli has a great quote on optimization in Python in a Nutshell: Start by designing, coding, and testing your application in Python, using available extension modules if they save you work. This takes much less time than it would with a classic compiled language. Then benchmark the application to find out if the resulting […]

The past, present, and future of optimization

I have a relative ("Dan") who used to earn a living optimizing code in the late 70s and early 80s. Around then, a new-fangled high-level language named "C" was starting to catch on, but companies didn't like all the wasted cycles in C programs due to the under-optimized assembly code that their C compilers were […]

The machine-translation pipe dream

An article in Sankei News about NEC putting machine translation onto mobile phones (Japanese) has created a bit of buzz on Honyaku (a mailing list for J<>E translators). Every time some new development in the machine translation world comes out, translators start to worry about whether they're going to be put out of work. Let […]

Extending Python with C: A case study

Near-100x speedup with a C extension I recently wrote about an algorithm for fuzzy matching of substrings implemented in Python. This is a feature that I needed for a piece of software I'm currently developing. When I started using the fuzzy_substring function on some test cases, however, it was unacceptably slow. Using a modestly large […]

Fuzzy substring matching with Levenshtein distance in Python

Levenshtein distance is a well known technique for fuzzy string matching. With a couple of modifications, it's also possible to use Levenshtein distance to do fuzzy matching of substrings. Let's take a simple example just to show what I mean. needle: "aba" haystack: "c abba c" We can intuitively see that "aba" should match up […]