Archive for the 'python' Category

Intermediate Python: Pythonic file searches

It's very easy to get up and running with Python, but programmers coming from other more verbose or procedural languages tend to write code that's not very pythonic — that is, it doesn't use Python idioms that experienced programmers use.
The problems with un-pythonic code are that it tends to be more verbose, more difficult to […]

The partial rewrite

I haven't been doing much blogging lately. Instead, I've been busy working and hacking. On the hacking side of things, I've been doing a partial rewrite of my big application, a translation-memory application written in C++.
The application is now about 10 years old, and over time and several releases it's grown more and more difficult […]

Version 0.2 of subdist module released

Just a quick note that I've released version 0.2 of my subdist module.
What is subdist?
subdist is a C Python extension that calculates fuzzy substring matches, based on Levenshtein distance.
subdist works purely with Unicode strings; calling one of its functions with a non-Unicode string will raise an error.
What's new in version 0.2?
Version 0.2 adds a get_score […]

Aim high

Alex Martelli has a great quote on optimization in Python in a Nutshell:
Start by designing, coding, and testing your application in Python, using available extension modules if they save you work. This takes much less time than it would with a classic compiled language. Then benchmark the application to find out if the resulting code […]

The past, present, and future of optimization

I have a relative ("Dan") who used to earn a living optimizing code in the late 70s and early 80s. Around then, a new-fangled high-level language named "C" was starting to catch on, but companies didn't like all the wasted cycles in C programs due to the under-optimized assembly code that their C compilers were […]

Extending Python with C: A case study

Near-100x speedup with a C extension
I recently wrote about an algorithm for fuzzy matching of substrings implemented in Python. This is a feature that I needed for a piece of software I'm currently developing.
When I started using the fuzzy_substring function on some test cases, however, it was unacceptably slow. Using a modestly large test corpus […]

Fuzzy substring matching with Levenshtein distance in Python

Levenshtein distance is a well known technique for fuzzy string matching. With a couple of modifications, it's also possible to use Levenshtein distance to do fuzzy matching of substrings.
Let's take a simple example just to show what I mean.
needle: "aba"
haystack: "c abba c"
We can intuitively see that "aba" should match up against "abba." Here's a […]

Parsing multilingual email with Python

The email module in the Python standard library provides just about everything you need to parse multilingual emails with Python. There are a few traps, however, that can catch the unaware and unwary.
Parsing an email message
The email module provides a couple of handy functions for parsing email: message_from_string and message_from_file. Both functions return a Message […]

Fixing JIS mojibake with Python

JIS (iso-2022-jp) is a Japanese text encoding. The beginning and end of a JIS sequence are marked by escape sequences:
Beginning: 1B $ B (or 1B $ @)
Ending: 1B ( J (or 1B ( B)
This encoding is often used in email. Unfortunately, some email programs (and even mail routers) strip out the escape characters. This doesn't […]

Honyaku mailing-list code open sourced

I've open sourced the code for the honyaku mailing-list archive, and posted the code to Google code (I named it ml-archive because my plan is to make it a generic mailing-list archive site). The site is written in Python, using the django web framework. It's released under the MIT license.
One of the challenges I'm facing […]

« Previous PageNext Page »