Fuzzy substring matching with Levenshtein distance in Python
Levenshtein distance is a well known technique for fuzzy string matching. With a couple of modifications, it's also possible to use Levenshtein distance to do fuzzy matching of substrings.
Let's take a simple example just to show what I mean.
haystack: "c abba c"
We can intuitively see that "aba" should match up against "abba." Here's a graph of the Levenshtein distance between these strings:
The Levenshtein distance is in the bottom-right corner: 5. That's how many edits you need to make in order to change "aba" into "c abba c." But we can see that the substring "abba" should match "aba" with just one edit.
You can kind of already see where the substrings should match up:
To get the substring fuzzy match, set the top row to all zeroes. This means in effect that we don't care how many characters in the haystack that we skip until we start matching.
Then on the bottom row, we take the lowest score on the row, instead of the bottom-right corner. This effectively means that we don't care how many characters in the haystack that we skip after we stop matching.
So what is the score after this change?
We take the lowest score on the bottom row, so our score is 1, exactly what our intuitions tell us it should be.
Here's a python function that implements this algorithm:
"""Calculates the fuzzy match of needle in haystack,
using a modified version of the Levenshtein distance
The function is modified from the levenshtein function
in the bktree module by Adam Hupp"""
m, n = len(needle), len(haystack)
# base cases
if m == 1:
return not needle in haystack
if not n:
row1 =  * (n+1)
for i in range(0,m):
row2 = [i+1]
for j in range(0,n):
cost = ( needle[i] != haystack[j] )
row2.append( min(row1[j+1]+1, # deletion
row1 = row2