December 1, 2007
Fuzzy substring matching with Levenshtein distance in Python
Levenshtein distance is a well known technique for fuzzy string matching. With a couple of modifications, it's also possible to use Levenshtein distance to do fuzzy matching of substrings.
Let's take a simple example just to show what I mean.
needle: "aba"
haystack: "c abba c"
We can intuitively see that "aba" should match up against "abba." Here's a graph of the Levenshtein distance between these strings:
| c | a | b | b | a | c | ||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| a | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | 7 |
| b | 2 | 2 | 2 | 3 | 2 | 3 | 4 | 5 | 6 |
| a | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 4 | 5 |
The Levenshtein distance is in the bottom-right corner: 5. That's how many edits you need to make in order to change "aba" into "c abba c." But we can see that the substring "abba" should match "aba" with just one edit.
You can kind of already see where the substrings should match up:
| c | a | b | b | a | c | ||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| a | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | 7 |
| b | 2 | 2 | 2 | 3 | 2 | 3 | 4 | 5 | 6 |
| a | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 4 | 5 |
To get the substring fuzzy match, set the top row to all zeroes. This means in effect that we don't care how many characters in the haystack that we skip until we start matching.
Then on the bottom row, we take the lowest score on the row, instead of the bottom-right corner. This effectively means that we don't care how many characters in the haystack that we skip after we stop matching.
So what is the score after this change?
| c | a | b | b | a | c | ||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| a | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
| b | 2 | 2 | 2 | 1 | 0 | 1 | 1 | 1 | 2 |
| a | 3 | 3 | 3 | 2 | 1 | 1 | 1 | 2 | 2 |
We take the lowest score on the bottom row, so our score is 1, exactly what our intuitions tell us it should be.
Here's a python function that implements this algorithm:
"""Calculates the fuzzy match of needle in haystack,
using a modified version of the Levenshtein distance
algorithm.
The function is modified from the levenshtein function
in the bktree module by Adam Hupp"""
m, n = len(needle), len(haystack)
# base cases
if m == 1:
return not needle in haystack
if not n:
return m
row1 = [0] * (n+1)
for i in range(0,m):
row2 = [i+1]
for j in range(0,n):
cost = ( needle[i] != haystack[j] )
row2.append( min(row1[j+1]+1, # deletion
row2[j]+1, #insertion
row1[j]+cost) #substitution
)
row1 = row2
return min(row1)
Your method seems to discard the starting point of the match - how do you get that when you have the end point?
@Ben
The intuition is that since we’re looking for a fuzzy substring, we don’t care what comes before or after the string we’re matching. So given the string “bbb aca bbb” and the substring candidate “aaa,” we can match the “aca” of the first string against the “aaa,” ignoring the “bbb” that comes before and after.
The top row is zeroed out to say that we don’t care what comes before the string we’re matching, and we take the lowest score on the bottom row to say that we don’t care what comes after.
However, you’re right that it’s hard to backtrack and find the exact path we took (e.g. in order to mark up the differences); there will often be ambiguity about the proper path.