<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Fuzzy substring matching with Levenshtein distance in Python</title>
	<atom:link href="http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/feed/" rel="self" type="application/rss+xml" />
	<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/</link>
	<description>Random scribbling about programming, translation, and Japan</description>
	<lastBuildDate>Thu, 10 May 2012 05:41:14 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-204092</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Sun, 19 Feb 2012 21:54:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-204092</guid>
		<description>@Derek

Here is my C++ code that implements fuzzy substring matching. This might be easier to translate.

&lt;pre lang=&quot;C++&quot;&gt;
class Distance
{
  // various other stuff..

  // get fuzzy substring distance
  size_t subdist(const wstring &amp;needle, const wstring &amp;haystack)
  {
	// prep
	const wchar_t* needle_str = needle.c_str() ; 
	size_t needle_len = needle.size() ; 
	const wchar_t *haystack_str = haystack.c_str() ;
	size_t haystack_len = haystack.size() ;

	// ensure our static rows are large enough
	ensure_size(haystack_len+1) ;

	// init first row
	std::fill(row1, row1+haystack_len+1, 0) ;

	size_t cost = 0;

	// Fill the matrix costs
	for (size_t i = 0; i &lt; needle_len; ++i)
	{
		row2[0] = i+1;

		for (size_t j = 0; j &lt; haystack_len; ++j)
		{
			cost = 1;
			if (needle_str[i] == haystack_str[j])
			{
				cost = 0;
			}

			row2[j+1] = min3(row1[j+1]+1, //  deletion
							 row2[j]+1, // insertion
							 row1[j]+cost) // substitution
							;
		}
		// row1 = row2
		std::swap(row1, row2) ;
	}

	// return lowest cost on bottom row
	return *std::min_element(row1, row1+haystack_len+1) ;
  }

  // make sure our rows are at least large enough to hold needle &amp; haystack
  void ensure_size(size_t min_row_size)
  {
	if (m_row_size &lt; min_row_size)
	{
		if (row1)
		{
			free(row1) ;
		}
		if (row2)
		{
			free(row2) ;
		}
		m_row_size = min_row_size;
		row1 = (size_t*)calloc(m_row_size, sizeof(size_t));
		row2 = (size_t*)calloc(m_row_size, sizeof(size_t));
		if (!row1 &#124;&#124; !row2) // Allocation failed
		{
			throw std::bad_alloc(&quot;Failed to allocated memory for Distance test&quot;) ;
		}
	}
  }
  // Convenience function to get min of 3 values
  size_t min3( size_t a, size_t b, size_t c ) const 
  {
	return std::min(std::min(a, b), c) ;
  }
}
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>@Derek</p>
<p>Here is my C++ code that implements fuzzy substring matching. This might be easier to translate.</p>
<pre lang="C++">
class Distance
{
  // various other stuff..

  // get fuzzy substring distance
  size_t subdist(const wstring &#038;needle, const wstring &#038;haystack)
  {
	// prep
	const wchar_t* needle_str = needle.c_str() ;
	size_t needle_len = needle.size() ;
	const wchar_t *haystack_str = haystack.c_str() ;
	size_t haystack_len = haystack.size() ;

	// ensure our static rows are large enough
	ensure_size(haystack_len+1) ;

	// init first row
	std::fill(row1, row1+haystack_len+1, 0) ;

	size_t cost = 0;

	// Fill the matrix costs
	for (size_t i = 0; i < needle_len; ++i)
	{
		row2[0] = i+1;

		for (size_t j = 0; j < haystack_len; ++j)
		{
			cost = 1;
			if (needle_str[i] == haystack_str[j])
			{
				cost = 0;
			}

			row2[j+1] = min3(row1[j+1]+1, //  deletion
							 row2[j]+1, // insertion
							 row1[j]+cost) // substitution
							;
		}
		// row1 = row2
		std::swap(row1, row2) ;
	}

	// return lowest cost on bottom row
	return *std::min_element(row1, row1+haystack_len+1) ;
  }

  // make sure our rows are at least large enough to hold needle &amp; haystack
  void ensure_size(size_t min_row_size)
  {
	if (m_row_size < min_row_size)
	{
		if (row1)
		{
			free(row1) ;
		}
		if (row2)
		{
			free(row2) ;
		}
		m_row_size = min_row_size;
		row1 = (size_t*)calloc(m_row_size, sizeof(size_t));
		row2 = (size_t*)calloc(m_row_size, sizeof(size_t));
		if (!row1 || !row2) // Allocation failed
		{
			throw std::bad_alloc("Failed to allocated memory for Distance test") ;
		}
	}
  }
  // Convenience function to get min of 3 values
  size_t min3( size_t a, size_t b, size_t c ) const
  {
	return std::min(std::min(a, b), c) ;
  }
}
</pre>
</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Derek</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-204085</link>
		<dc:creator>Derek</dc:creator>
		<pubDate>Sun, 19 Feb 2012 21:25:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-204085</guid>
		<description>Hi Again, I have made some progress but I cant seem to get the cost scoring correct. do you have any ideas ?

&lt;pre&gt;
 public static int FuzzySubstring(this string needle, string haystack, IUnityContainer container)
        {
            int result = -1;
            try
            {
                int nlen = needle.Length;
                int hlen = haystack.Length;
                if (hlen == 0)
                    return -1;
                if (nlen == 1)
                    return haystack.IndexOf(needle);

                var row1 = new List(new int[hlen + 1]);

                for (int i = 0; i &lt; nlen; ++i)
                {
                    var row2 = new List(new int[i + 1]);
                    for (int j = 0; j &lt; hlen; ++j)
                    {
                        int cost = needle[i] != haystack[j] ? 1 : 0;
                        row2.Add(Math.Min(row1[j + 1] + 1, Math.Min(row2[j] + 1, row1[j] + cost)));
                    }
                    SwapLists(row1,row2);
                }
                result = row1.Min();
            }
            catch (Exception exception)
            {
                container.Resolve().Error(exception.Message);
                throw;
            }
            return result; 
        }

        static void SwapLists(List list1, List list2)
        {
            List temp = new List(list1);
            list1.Clear();
            list1.AddRange(list2);
            list2.Clear();
            list2.AddRange(temp);
        }
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>Hi Again, I have made some progress but I cant seem to get the cost scoring correct. do you have any ideas ?</p>
<pre>
 public static int FuzzySubstring(this string needle, string haystack, IUnityContainer container)
        {
            int result = -1;
            try
            {
                int nlen = needle.Length;
                int hlen = haystack.Length;
                if (hlen == 0)
                    return -1;
                if (nlen == 1)
                    return haystack.IndexOf(needle);

                var row1 = new List(new int[hlen + 1]);

                for (int i = 0; i &lt; nlen; ++i)
                {
                    var row2 = new List(new int[i + 1]);
                    for (int j = 0; j &lt; hlen; ++j)
                    {
                        int cost = needle[i] != haystack[j] ? 1 : 0;
                        row2.Add(Math.Min(row1[j + 1] + 1, Math.Min(row2[j] + 1, row1[j] + cost)));
                    }
                    SwapLists(row1,row2);
                }
                result = row1.Min();
            }
            catch (Exception exception)
            {
                container.Resolve().Error(exception.Message);
                throw;
            }
            return result;
        }

        static void SwapLists(List list1, List list2)
        {
            List temp = new List(list1);
            list1.Clear();
            list1.AddRange(list2);
            list2.Clear();
            list2.AddRange(temp);
        }
</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Derek</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-204051</link>
		<dc:creator>Derek</dc:creator>
		<pubDate>Sun, 19 Feb 2012 19:12:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-204051</guid>
		<description>
Hi, I am trying to translate this into a C# extension method. Its not quite there yet, I am getting an index out of range in the nested for loop

&lt;pre&gt;
        public static int FuzzySubstring(this string needle, string haystack, IUnityContainer container)
        {
            int result = -1;
            try
            {
                int nlen = needle.Length;
                int hlen = haystack.Length;
                if (hlen == 0)
                    return -1;
                if (nlen == 1)
                    return haystack.IndexOf(needle);

                var row1 = new List(hlen + 1);

                for (int i = 0; i &lt; nlen; ++i)
                {
                    var row2 = new List(i + 1);
                    for (int j = 0; j &lt; hlen; ++j)
                    {
                        int cost = needle[i] != haystack[j] ? +1 : 0;
                        var one = row1[j + 1] + 1;
                        var two =  Math.Min(row2[j] + 1, row1[j] + cost);
                        var three= Math.Min(one,two);

                        row2.Add(three);
                    }
                    SwapLists(row1,row2);
                }
                result = row1.Min();
            }
            catch (Exception exception)
            {
                container.Resolve().Error(exception.Message);
                throw;
            }
            return result; 
        }

        static void SwapLists(List list1, List list2)
        {
            List temp = new List(list1);
            list1.Clear();
            list1.AddRange(list2);
            list2.Clear();
            list2.AddRange(temp);
        }
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>Hi, I am trying to translate this into a C# extension method. Its not quite there yet, I am getting an index out of range in the nested for loop</p>
<pre>
        public static int FuzzySubstring(this string needle, string haystack, IUnityContainer container)
        {
            int result = -1;
            try
            {
                int nlen = needle.Length;
                int hlen = haystack.Length;
                if (hlen == 0)
                    return -1;
                if (nlen == 1)
                    return haystack.IndexOf(needle);

                var row1 = new List(hlen + 1);

                for (int i = 0; i &lt; nlen; ++i)
                {
                    var row2 = new List(i + 1);
                    for (int j = 0; j &lt; hlen; ++j)
                    {
                        int cost = needle[i] != haystack[j] ? +1 : 0;
                        var one = row1[j + 1] + 1;
                        var two =  Math.Min(row2[j] + 1, row1[j] + cost);
                        var three= Math.Min(one,two);

                        row2.Add(three);
                    }
                    SwapLists(row1,row2);
                }
                result = row1.Min();
            }
            catch (Exception exception)
            {
                container.Resolve().Error(exception.Message);
                throw;
            }
            return result;
        }

        static void SwapLists(List list1, List list2)
        {
            List temp = new List(list1);
            list1.Clear();
            list1.AddRange(list2);
            list2.Clear();
            list2.AddRange(temp);
        }
</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-176777</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Sat, 07 Jan 2012 17:27:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-176777</guid>
		<description>@McDougall

It makes sense that this is not bi-directional, because it&#039;s for matching substrings. It should be bi-directional if both strings are of the same length, as you pointed out.</description>
		<content:encoded><![CDATA[<p>@McDougall</p>
<p>It makes sense that this is not bi-directional, because it&#8217;s for matching substrings. It should be bi-directional if both strings are of the same length, as you pointed out.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: McDougall</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-176182</link>
		<dc:creator>McDougall</dc:creator>
		<pubDate>Fri, 06 Jan 2012 21:22:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-176182</guid>
		<description>I have been playing around with your modification to Levenshtein, and I have found some interesting results.

Most importantly: the original Levenshtein is bi-directional, meaning that Levenshtein(a,b) == Levenshtein(b,a)

However with your modification, this is no longer the case! In other words,  FuzzyLevenshtein(a,b) != FuzzyLevenshtein(b,a)

From my experiments, it appears that FuzzyLevenshtein(a,b) can/will be a lower value when b (the target) has extra characters in its prefix/suffix. 
When a (the source) has the extra characters, then the Fuzzy version is the same as the original.

Does this make sense to you? Do you agree? Thanks!</description>
		<content:encoded><![CDATA[<p>I have been playing around with your modification to Levenshtein, and I have found some interesting results.</p>
<p>Most importantly: the original Levenshtein is bi-directional, meaning that Levenshtein(a,b) == Levenshtein(b,a)</p>
<p>However with your modification, this is no longer the case! In other words,  FuzzyLevenshtein(a,b) != FuzzyLevenshtein(b,a)</p>
<p>From my experiments, it appears that FuzzyLevenshtein(a,b) can/will be a lower value when b (the target) has extra characters in its prefix/suffix.<br />
When a (the source) has the extra characters, then the Fuzzy version is the same as the original.</p>
<p>Does this make sense to you? Do you agree? Thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fuzzy substring searching &#171; codehost</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-108937</link>
		<dc:creator>Fuzzy substring searching &#171; codehost</dc:creator>
		<pubDate>Tue, 13 Sep 2011 00:08:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-108937</guid>
		<description>[...] smart enough to figure it out or at least google it properly. Then, finally, I stumbled on this short, clear and helpful blog post that delivered me from ignorance. The solution was so simple and obvious, I&#8217;m surprised I [...]</description>
		<content:encoded><![CDATA[<p>[...] smart enough to figure it out or at least google it properly. Then, finally, I stumbled on this short, clear and helpful blog post that delivered me from ignorance. The solution was so simple and obvious, I&#8217;m surprised I [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-65221</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Wed, 06 Apr 2011 01:31:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-65221</guid>
		<description>@TomG -- Thanks for the code, much appreciated.</description>
		<content:encoded><![CDATA[<p>@TomG &#8212; Thanks for the code, much appreciated.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: TomG</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-65207</link>
		<dc:creator>TomG</dc:creator>
		<pubDate>Tue, 05 Apr 2011 22:51:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-65207</guid>
		<description>Thanks for your code. I wanted to get the the Similarity index and so updated the Levenshtein distance result like so (this is usually calculated based on the max length of the two strings):

&lt;pre&gt;
        public static float GetSubStringSimilarity(string string1, string string2)
        {
            float dis = fuzzy_substring(string1, string2);
            float minLen = string1.Length;
            if (minLen &gt; string2.Length)
                minLen = string2.Length;
            if (minLen == 0.0F)
                return 1.0F;
            else
                return 1.0F - dis / minLen;
        }
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>Thanks for your code. I wanted to get the the Similarity index and so updated the Levenshtein distance result like so (this is usually calculated based on the max length of the two strings):</p>
<pre>
        public static float GetSubStringSimilarity(string string1, string string2)
        {
            float dis = fuzzy_substring(string1, string2);
            float minLen = string1.Length;
            if (minLen &gt; string2.Length)
                minLen = string2.Length;
            if (minLen == 0.0F)
                return 1.0F;
            else
                return 1.0F - dis / minLen;
        }
</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-104</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Thu, 23 Oct 2008 01:00:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-104</guid>
		<description>Nice! I created a home for it on codepad:

&lt;a href=&quot;http://codepad.org/GFfEVDHu&quot;&gt;http://codepad.org/GFfEVDHu&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>Nice! I created a home for it on codepad:</p>
<p><a href="http://codepad.org/GFfEVDHu">http://codepad.org/GFfEVDHu</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-103</link>
		<dc:creator>Dan</dc:creator>
		<pubDate>Wed, 22 Oct 2008 20:19:15 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-103</guid>
		<description>Tried a version in C++.. about 20 LOC, minus inclues/&#039;using&#039; declarations:

&lt;pre&gt;
int fuzzy_substring(const string&amp; needle,
                    const string&amp; haystack) {
    const int nlen = needle.size(),
              hlen = haystack.size();
    if (hlen == 0) return -1;
    if (nlen == 1) return haystack.find(needle);
    vector&lt;int&gt; row1(hlen+1, 0);
    for (int i = 0; i &lt; nlen; ++i) {
        vector&lt;int&gt; row2(1, i+1);
        for (int j = 0; j &lt; hlen; ++j) {
            const int cost = needle[i] != haystack[j];
            row2.push_back(std::min(row1[j+1]+1,
                                    std::min(row2[j]+1,
                                             row1[j]+cost)));
        }
        row1.swap(row2);
    }
    return *std::min_element(row1.begin(), row1.end());
}&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>Tried a version in C++.. about 20 LOC, minus inclues/&#8217;using&#8217; declarations:</p>
<pre>
int fuzzy_substring(const string&amp; needle,
                    const string&amp; haystack) {
    const int nlen = needle.size(),
              hlen = haystack.size();
    if (hlen == 0) return -1;
    if (nlen == 1) return haystack.find(needle);
    vector&lt;int> row1(hlen+1, 0);
    for (int i = 0; i &lt; nlen; ++i) {
        vector&lt;int> row2(1, i+1);
        for (int j = 0; j &lt; hlen; ++j) {
            const int cost = needle[i] != haystack[j];
            row2.push_back(std::min(row1[j+1]+1,
                                    std::min(row2[j]+1,
                                             row1[j]+cost)));
        }
        row1.swap(row2);
    }
    return *std::min_element(row1.begin(), row1.end());
}</pre>
]]></content:encoded>
	</item>
</channel>
</rss>

