Fuzzy substring matching with Levenshtein distance in Python

Levenshtein distance is a well known technique for fuzzy string matching. With a couple of modifications, it's also possible to use Levenshtein distance to do fuzzy matching of substrings.

Let's take a simple example just to show what I mean.
needle: "aba"
haystack: "c abba c"

We can intuitively see that "aba" should match up against "abba." Here's a graph of the Levenshtein distance between these strings:

    c   a b b a   c
  0 1 2 3 4 5 6 7 8
a 1 1 2 2 3 4 5 6 7
b 2 2 2 3 2 3 4 5 6
a 3 3 3 2 3 3 3 4 5

The Levenshtein distance is in the bottom-right corner: 5. That's how many edits you need to make in order to change "aba" into "c abba c." But we can see that the substring "abba" should match "aba" with just one edit.

You can kind of already see where the substrings should match up:

    c   a b b a   c
  0 1 2 3 4 5 6 7 8
a 1 1 2 2 3 4 5 6 7
b 2 2 2 3 2 3 4 5 6
a 3 3 3 2 3 3 3 4 5

To get the substring fuzzy match, set the top row to all zeroes. This means in effect that we don't care how many characters in the haystack that we skip until we start matching.

Then on the bottom row, we take the lowest score on the row, instead of the bottom-right corner. This effectively means that we don't care how many characters in the haystack that we skip after we stop matching.

So what is the score after this change?

    c   a b b a   c
  0 0 0 0 0 0 0 0 0
a 1 1 1 0 1 1 0 1 1
b 2 2 2 1 0 1 1 1 2
a 3 3 3 2 1 1 1 2 2

We take the lowest score on the bottom row, so our score is 1, exactly what our intuitions tell us it should be.

Here's a python function that implements this algorithm:

def fuzzy_substring(needle, haystack):
    """Calculates the fuzzy match of needle in haystack,
    using a modified version of the Levenshtein distance
    algorithm.
    The function is modified from the levenshtein function
    in the bktree module by Adam Hupp"
""

    m, n = len(needle), len(haystack)

    # base cases
    if m == 1:
        return not needle in haystack
    if not n:
        return m

    row1 = [0] * (n+1)
    for i in range(0,m):
        row2 = [i+1]
        for j in range(0,n):
            cost = ( needle[i] != haystack[j] )

            row2.append( min(row1[j+1]+1, # deletion
                               row2[j]+1, #insertion
                               row1[j]+cost) #substitution
                           )
        row1 = row2
    return min(row1)

15 comments to Fuzzy substring matching with Levenshtein distance in Python

  • Ben

    Your method seems to discard the starting point of the match – how do you get that when you have the end point?

  • @Ben

    The intuition is that since we’re looking for a fuzzy substring, we don’t care what comes before or after the string we’re matching. So given the string “bbb aca bbb” and the substring candidate “aaa,” we can match the “aca” of the first string against the “aaa,” ignoring the “bbb” that comes before and after.

    The top row is zeroed out to say that we don’t care what comes before the string we’re matching, and we take the lowest score on the bottom row to say that we don’t care what comes after.

    However, you’re right that it’s hard to backtrack and find the exact path we took (e.g. in order to mark up the differences); there will often be ambiguity about the proper path.

  • Dan

    Tried a version in C++.. about 20 LOC, minus inclues/’using’ declarations:

    int fuzzy_substring(const string& needle,
                        const string& haystack) {
        const int nlen = needle.size(),
                  hlen = haystack.size();
        if (hlen == 0) return -1;
        if (nlen == 1) return haystack.find(needle);
        vector<int> row1(hlen+1, 0);
        for (int i = 0; i < nlen; ++i) {
            vector<int> row2(1, i+1);
            for (int j = 0; j < hlen; ++j) {
                const int cost = needle[i] != haystack[j];
                row2.push_back(std::min(row1[j+1]+1,
                                        std::min(row2[j]+1,
                                                 row1[j]+cost)));
            }
            row1.swap(row2);
        }
        return *std::min_element(row1.begin(), row1.end());
    }
  • TomG

    Thanks for your code. I wanted to get the the Similarity index and so updated the Levenshtein distance result like so (this is usually calculated based on the max length of the two strings):

            public static float GetSubStringSimilarity(string string1, string string2)
            {
                float dis = fuzzy_substring(string1, string2);
                float minLen = string1.Length;
                if (minLen > string2.Length)
                    minLen = string2.Length;
                if (minLen == 0.0F)
                    return 1.0F;
                else
                    return 1.0F - dis / minLen;
            }
    
  • @TomG — Thanks for the code, much appreciated.

  • [...] smart enough to figure it out or at least google it properly. Then, finally, I stumbled on this short, clear and helpful blog post that delivered me from ignorance. The solution was so simple and obvious, I’m surprised I [...]

  • McDougall

    I have been playing around with your modification to Levenshtein, and I have found some interesting results.

    Most importantly: the original Levenshtein is bi-directional, meaning that Levenshtein(a,b) == Levenshtein(b,a)

    However with your modification, this is no longer the case! In other words, FuzzyLevenshtein(a,b) != FuzzyLevenshtein(b,a)

    From my experiments, it appears that FuzzyLevenshtein(a,b) can/will be a lower value when b (the target) has extra characters in its prefix/suffix.
    When a (the source) has the extra characters, then the Fuzzy version is the same as the original.

    Does this make sense to you? Do you agree? Thanks!

  • @McDougall

    It makes sense that this is not bi-directional, because it’s for matching substrings. It should be bi-directional if both strings are of the same length, as you pointed out.

  • Derek

    Hi, I am trying to translate this into a C# extension method. Its not quite there yet, I am getting an index out of range in the nested for loop

            public static int FuzzySubstring(this string needle, string haystack, IUnityContainer container)
            {
                int result = -1;
                try
                {
                    int nlen = needle.Length;
                    int hlen = haystack.Length;
                    if (hlen == 0)
                        return -1;
                    if (nlen == 1)
                        return haystack.IndexOf(needle);
    
                    var row1 = new List(hlen + 1);
    
                    for (int i = 0; i < nlen; ++i)
                    {
                        var row2 = new List(i + 1);
                        for (int j = 0; j < hlen; ++j)
                        {
                            int cost = needle[i] != haystack[j] ? +1 : 0;
                            var one = row1[j + 1] + 1;
                            var two =  Math.Min(row2[j] + 1, row1[j] + cost);
                            var three= Math.Min(one,two);
    
                            row2.Add(three);
                        }
                        SwapLists(row1,row2);
                    }
                    result = row1.Min();
                }
                catch (Exception exception)
                {
                    container.Resolve().Error(exception.Message);
                    throw;
                }
                return result; 
            }
    
            static void SwapLists(List list1, List list2)
            {
                List temp = new List(list1);
                list1.Clear();
                list1.AddRange(list2);
                list2.Clear();
                list2.AddRange(temp);
            }
    
  • Derek

    Hi Again, I have made some progress but I cant seem to get the cost scoring correct. do you have any ideas ?

     public static int FuzzySubstring(this string needle, string haystack, IUnityContainer container)
            {
                int result = -1;
                try
                {
                    int nlen = needle.Length;
                    int hlen = haystack.Length;
                    if (hlen == 0)
                        return -1;
                    if (nlen == 1)
                        return haystack.IndexOf(needle);
    
                    var row1 = new List(new int[hlen + 1]);
    
                    for (int i = 0; i < nlen; ++i)
                    {
                        var row2 = new List(new int[i + 1]);
                        for (int j = 0; j < hlen; ++j)
                        {
                            int cost = needle[i] != haystack[j] ? 1 : 0;
                            row2.Add(Math.Min(row1[j + 1] + 1, Math.Min(row2[j] + 1, row1[j] + cost)));
                        }
                        SwapLists(row1,row2);
                    }
                    result = row1.Min();
                }
                catch (Exception exception)
                {
                    container.Resolve().Error(exception.Message);
                    throw;
                }
                return result; 
            }
    
            static void SwapLists(List list1, List list2)
            {
                List temp = new List(list1);
                list1.Clear();
                list1.AddRange(list2);
                list2.Clear();
                list2.AddRange(temp);
            }
    
  • @Derek

    Here is my C++ code that implements fuzzy substring matching. This might be easier to translate.

    class Distance
    {
      // various other stuff..
    
      // get fuzzy substring distance
      size_t subdist(const wstring &needle, const wstring &haystack)
      {
    	// prep
    	const wchar_t* needle_str = needle.c_str() ; 
    	size_t needle_len = needle.size() ; 
    	const wchar_t *haystack_str = haystack.c_str() ;
    	size_t haystack_len = haystack.size() ;
    
    	// ensure our static rows are large enough
    	ensure_size(haystack_len+1) ;
    
    	// init first row
    	std::fill(row1, row1+haystack_len+1, 0) ;
    
    	size_t cost = 0;
    
    	// Fill the matrix costs
    	for (size_t i = 0; i < needle_len; ++i)
    	{
    		row2[0] = i+1;
    
    		for (size_t j = 0; j < haystack_len; ++j)
    		{
    			cost = 1;
    			if (needle_str[i] == haystack_str[j])
    			{
    				cost = 0;
    			}
    
    			row2[j+1] = min3(row1[j+1]+1, //  deletion
    							 row2[j]+1, // insertion
    							 row1[j]+cost) // substitution
    							;
    		}
    		// row1 = row2
    		std::swap(row1, row2) ;
    	}
    
    	// return lowest cost on bottom row
    	return *std::min_element(row1, row1+haystack_len+1) ;
      }
    
      // make sure our rows are at least large enough to hold needle & haystack
      void ensure_size(size_t min_row_size)
      {
    	if (m_row_size < min_row_size)
    	{
    		if (row1)
    		{
    			free(row1) ;
    		}
    		if (row2)
    		{
    			free(row2) ;
    		}
    		m_row_size = min_row_size;
    		row1 = (size_t*)calloc(m_row_size, sizeof(size_t));
    		row2 = (size_t*)calloc(m_row_size, sizeof(size_t));
    		if (!row1 || !row2) // Allocation failed
    		{
    			throw std::bad_alloc("Failed to allocated memory for Distance test") ;
    		}
    	}
      }
      // Convenience function to get min of 3 values
      size_t min3( size_t a, size_t b, size_t c ) const 
      {
    	return std::min(std::min(a, b), c) ;
      }
    }
    
  • Dirk Reimann

    I found this while looking for a way to do a fuzzy substring search. However I’m not only interested in the distance . I’m also interested in the position and length of the best candidate.
    Is it somehow possible to get that ?

  • Yati Sagade

    This is a neat way of doing it. Any ideas on how to tweak it to also get the match in the haystack? e.g., to get “abb”, “bba”, or “abba” when you search for “aba” in “c abba c”?

  • > Yati

    You could cluster the lowest scores on the bottom row (adjusted for the length of the needle), and match them up with the haystack.

    In the example, you have a cluster of three ones, so you could derive the haystack match of “bba.”

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>