Fuzzy substring matching with Levenshtein distance in Python

Levenshtein distance is a well known technique for fuzzy string matching. With a couple of modifications, it's also possible to use Levenshtein distance to do fuzzy matching of substrings.

Let's take a simple example just to show what I mean.
needle: "aba"
haystack: "c abba c"

We can intuitively see that "aba" should match up against "abba." Here's a graph of the Levenshtein distance between these strings:

    c   a b b a   c
  0 1 2 3 4 5 6 7 8
a 1 1 2 2 3 4 5 6 7
b 2 2 2 3 2 3 4 5 6
a 3 3 3 2 3 3 3 4 5

The Levenshtein distance is in the bottom-right corner: 5. That's how many edits you need to make in order to change "aba" into "c abba c." But we can see that the substring "abba" should match "aba" with just one edit.

You can kind of already see where the substrings should match up:

    c   a b b a   c
  0 1 2 3 4 5 6 7 8
a 1 1 2 2 3 4 5 6 7
b 2 2 2 3 2 3 4 5 6
a 3 3 3 2 3 3 3 4 5

To get the substring fuzzy match, set the top row to all zeroes. This means in effect that we don't care how many characters in the haystack that we skip until we start matching.

Then on the bottom row, we take the lowest score on the row, instead of the bottom-right corner. This effectively means that we don't care how many characters in the haystack that we skip after we stop matching.

So what is the score after this change?

    c   a b b a   c
  0 0 0 0 0 0 0 0 0
a 1 1 1 0 1 1 0 1 1
b 2 2 2 1 0 1 1 1 2
a 3 3 3 2 1 1 1 2 2

We take the lowest score on the bottom row, so our score is 1, exactly what our intuitions tell us it should be.

Here's a python function that implements this algorithm:

def fuzzy_substring(needle, haystack):
    """Calculates the fuzzy match of needle in haystack,
    using a modified version of the Levenshtein distance
    algorithm.
    The function is modified from the levenshtein function
    in the bktree module by Adam Hupp"
""

    m, n = len(needle), len(haystack)

    # base cases
    if m == 1:
        return not needle in haystack
    if not n:
        return m

    row1 = [0] * (n+1)
    for i in range(0,m):
        row2 = [i+1]
        for j in range(0,n):
            cost = ( needle[i] != haystack[j] )

            row2.append( min(row1[j+1]+1, # deletion
                               row2[j]+1, #insertion
                               row1[j]+cost) #substitution
                           )
        row1 = row2
    return min(row1)

9 comments to Fuzzy substring matching with Levenshtein distance in Python

  • Ben

    Your method seems to discard the starting point of the match – how do you get that when you have the end point?

  • @Ben

    The intuition is that since we’re looking for a fuzzy substring, we don’t care what comes before or after the string we’re matching. So given the string “bbb aca bbb” and the substring candidate “aaa,” we can match the “aca” of the first string against the “aaa,” ignoring the “bbb” that comes before and after.

    The top row is zeroed out to say that we don’t care what comes before the string we’re matching, and we take the lowest score on the bottom row to say that we don’t care what comes after.

    However, you’re right that it’s hard to backtrack and find the exact path we took (e.g. in order to mark up the differences); there will often be ambiguity about the proper path.

  • Dan

    Tried a version in C++.. about 20 LOC, minus inclues/’using’ declarations:

    int fuzzy_substring(const string& needle,
                        const string& haystack) {
        const int nlen = needle.size(),
                  hlen = haystack.size();
        if (hlen == 0) return -1;
        if (nlen == 1) return haystack.find(needle);
        vector<int> row1(hlen+1, 0);
        for (int i = 0; i < nlen; ++i) {
            vector<int> row2(1, i+1);
            for (int j = 0; j < hlen; ++j) {
                const int cost = needle[i] != haystack[j];
                row2.push_back(std::min(row1[j+1]+1,
                                        std::min(row2[j]+1,
                                                 row1[j]+cost)));
            }
            row1.swap(row2);
        }
        return *std::min_element(row1.begin(), row1.end());
    }
  • TomG

    Thanks for your code. I wanted to get the the Similarity index and so updated the Levenshtein distance result like so (this is usually calculated based on the max length of the two strings):

            public static float GetSubStringSimilarity(string string1, string string2)
            {
                float dis = fuzzy_substring(string1, string2);
                float minLen = string1.Length;
                if (minLen > string2.Length)
                    minLen = string2.Length;
                if (minLen == 0.0F)
                    return 1.0F;
                else
                    return 1.0F - dis / minLen;
            }
    
  • @TomG — Thanks for the code, much appreciated.

  • [...] smart enough to figure it out or at least google it properly. Then, finally, I stumbled on this short, clear and helpful blog post that delivered me from ignorance. The solution was so simple and obvious, I’m surprised I [...]

  • McDougall

    I have been playing around with your modification to Levenshtein, and I have found some interesting results.

    Most importantly: the original Levenshtein is bi-directional, meaning that Levenshtein(a,b) == Levenshtein(b,a)

    However with your modification, this is no longer the case! In other words, FuzzyLevenshtein(a,b) != FuzzyLevenshtein(b,a)

    From my experiments, it appears that FuzzyLevenshtein(a,b) can/will be a lower value when b (the target) has extra characters in its prefix/suffix.
    When a (the source) has the extra characters, then the Fuzzy version is the same as the original.

    Does this make sense to you? Do you agree? Thanks!

  • @McDougall

    It makes sense that this is not bi-directional, because it’s for matching substrings. It should be bi-directional if both strings are of the same length, as you pointed out.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>