## Fuzzy substring matching with Levenshtein distance in Python

Levenshtein distance is a well known technique for fuzzy string matching. With a couple of modifications, it's also possible to use Levenshtein distance to do fuzzy matching of substrings.

Let's take a simple example just to show what I mean.

**needle:** "aba"

**haystack:** "c abba c"

We can intuitively see that "aba" should match up against "abba." Here's a graph of the Levenshtein distance between these strings:

c | a | b | b | a | c | ||||
---|---|---|---|---|---|---|---|---|---|

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |

a | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | 7 |

b | 2 | 2 | 2 | 3 | 2 | 3 | 4 | 5 | 6 |

a | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 4 | 5 |

The Levenshtein distance is in the bottom-right corner: 5. That's how many edits you need to make in order to change "aba" into "c abba c." But we can see that the substring "abba" should match "aba" with just one edit.

You can kind of already see where the substrings should match up:

c | a | b | b | a | c | ||||
---|---|---|---|---|---|---|---|---|---|

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |

a | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | 7 |

b | 2 | 2 | 2 | 3 | 2 | 3 | 4 | 5 | 6 |

a | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 4 | 5 |

To get the substring fuzzy match, set the top row to all zeroes. This means in effect that we don't care how many characters in the haystack that we skip until we start matching.

Then on the bottom row, we take the lowest score on the row, instead of the bottom-right corner. This effectively means that we don't care how many characters in the haystack that we skip after we stop matching.

So what is the score after this change?

c | a | b | b | a | c | ||||
---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

a | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |

b | 2 | 2 | 2 | 1 | 0 | 1 | 1 | 1 | 2 |

a | 3 | 3 | 3 | 2 | 1 | 1 | 1 | 2 | 2 |

We take the lowest score on the bottom row, so our score is 1, exactly what our intuitions tell us it should be.

Here's a python function that implements this algorithm:

"""Calculates the fuzzy match of needle in haystack,

using a modified version of the Levenshtein distance

algorithm.

The function is modified from the levenshtein function

in the bktree module by Adam Hupp"""

m, n = len(needle), len(haystack)

# base cases

if m == 1:

return not needle in haystack

if not n:

return m

row1 = [0] * (n+1)

for i in range(0,m):

row2 = [i+1]

for j in range(0,n):

cost = ( needle[i] != haystack[j] )

row2.append( min(row1[j+1]+1, # deletion

row2[j]+1, #insertion

row1[j]+cost) #substitution

)

row1 = row2

return min(row1)

Your method seems to discard the starting point of the match – how do you get that when you have the end point?

@Ben

The intuition is that since we’re looking for a fuzzy substring, we don’t care what comes before or after the string we’re matching. So given the string “bbb aca bbb” and the substring candidate “aaa,” we can match the “aca” of the first string against the “aaa,” ignoring the “bbb” that comes before and after.

The top row is zeroed out to say that we don’t care what comes before the string we’re matching, and we take the lowest score on the bottom row to say that we don’t care what comes after.

However, you’re right that it’s hard to backtrack and find the exact path we took (e.g. in order to mark up the differences); there will often be ambiguity about the proper path.

Tried a version in C++.. about 20 LOC, minus inclues/’using’ declarations:

Nice! I created a home for it on codepad:

http://codepad.org/GFfEVDHu

Thanks for your code. I wanted to get the the Similarity index and so updated the Levenshtein distance result like so (this is usually calculated based on the max length of the two strings):

@TomG — Thanks for the code, much appreciated.

[...] smart enough to figure it out or at least google it properly. Then, finally, I stumbled on this short, clear and helpful blog post that delivered me from ignorance. The solution was so simple and obvious, I’m surprised I [...]

I have been playing around with your modification to Levenshtein, and I have found some interesting results.

Most importantly: the original Levenshtein is bi-directional, meaning that Levenshtein(a,b) == Levenshtein(b,a)

However with your modification, this is no longer the case! In other words, FuzzyLevenshtein(a,b) != FuzzyLevenshtein(b,a)

From my experiments, it appears that FuzzyLevenshtein(a,b) can/will be a lower value when b (the target) has extra characters in its prefix/suffix.

When a (the source) has the extra characters, then the Fuzzy version is the same as the original.

Does this make sense to you? Do you agree? Thanks!

@McDougall

It makes sense that this is not bi-directional, because it’s for matching substrings. It should be bi-directional if both strings are of the same length, as you pointed out.

Hi, I am trying to translate this into a C# extension method. Its not quite there yet, I am getting an index out of range in the nested for loop

Hi Again, I have made some progress but I cant seem to get the cost scoring correct. do you have any ideas ?

@Derek

Here is my C++ code that implements fuzzy substring matching. This might be easier to translate.

I found this while looking for a way to do a fuzzy substring search. However I’m not only interested in the distance . I’m also interested in the position and length of the best candidate.

Is it somehow possible to get that ?

This is a neat way of doing it. Any ideas on how to tweak it to also get the match in the haystack? e.g., to get “abb”, “bba”, or “abba” when you search for “aba” in “c abba c”?

> Yati

You could cluster the lowest scores on the bottom row (adjusted for the length of the needle), and match them up with the haystack.

In the example, you have a cluster of three ones, so you could derive the haystack match of “bba.”