<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Fuzzy substring matching with Levenshtein distance in Python</title>
	<atom:link href="http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/feed/" rel="self" type="application/rss+xml" />
	<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/</link>
	<description>Random scribbling about programming, translation, and Japan</description>
	<lastBuildDate>Fri, 03 Feb 2012 09:05:20 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-176777</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Sat, 07 Jan 2012 17:27:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-176777</guid>
		<description>@McDougall

It makes sense that this is not bi-directional, because it&#039;s for matching substrings. It should be bi-directional if both strings are of the same length, as you pointed out.</description>
		<content:encoded><![CDATA[<p>@McDougall</p>
<p>It makes sense that this is not bi-directional, because it&#8217;s for matching substrings. It should be bi-directional if both strings are of the same length, as you pointed out.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: McDougall</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-176182</link>
		<dc:creator>McDougall</dc:creator>
		<pubDate>Fri, 06 Jan 2012 21:22:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-176182</guid>
		<description>I have been playing around with your modification to Levenshtein, and I have found some interesting results.

Most importantly: the original Levenshtein is bi-directional, meaning that Levenshtein(a,b) == Levenshtein(b,a)

However with your modification, this is no longer the case! In other words,  FuzzyLevenshtein(a,b) != FuzzyLevenshtein(b,a)

From my experiments, it appears that FuzzyLevenshtein(a,b) can/will be a lower value when b (the target) has extra characters in its prefix/suffix. 
When a (the source) has the extra characters, then the Fuzzy version is the same as the original.

Does this make sense to you? Do you agree? Thanks!</description>
		<content:encoded><![CDATA[<p>I have been playing around with your modification to Levenshtein, and I have found some interesting results.</p>
<p>Most importantly: the original Levenshtein is bi-directional, meaning that Levenshtein(a,b) == Levenshtein(b,a)</p>
<p>However with your modification, this is no longer the case! In other words,  FuzzyLevenshtein(a,b) != FuzzyLevenshtein(b,a)</p>
<p>From my experiments, it appears that FuzzyLevenshtein(a,b) can/will be a lower value when b (the target) has extra characters in its prefix/suffix.<br />
When a (the source) has the extra characters, then the Fuzzy version is the same as the original.</p>
<p>Does this make sense to you? Do you agree? Thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fuzzy substring searching &#171; codehost</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-108937</link>
		<dc:creator>Fuzzy substring searching &#171; codehost</dc:creator>
		<pubDate>Tue, 13 Sep 2011 00:08:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-108937</guid>
		<description>[...] smart enough to figure it out or at least google it properly. Then, finally, I stumbled on this short, clear and helpful blog post that delivered me from ignorance. The solution was so simple and obvious, I&#8217;m surprised I [...]</description>
		<content:encoded><![CDATA[<p>[...] smart enough to figure it out or at least google it properly. Then, finally, I stumbled on this short, clear and helpful blog post that delivered me from ignorance. The solution was so simple and obvious, I&#8217;m surprised I [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-65221</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Wed, 06 Apr 2011 01:31:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-65221</guid>
		<description>@TomG -- Thanks for the code, much appreciated.</description>
		<content:encoded><![CDATA[<p>@TomG &#8212; Thanks for the code, much appreciated.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: TomG</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-65207</link>
		<dc:creator>TomG</dc:creator>
		<pubDate>Tue, 05 Apr 2011 22:51:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-65207</guid>
		<description>Thanks for your code. I wanted to get the the Similarity index and so updated the Levenshtein distance result like so (this is usually calculated based on the max length of the two strings):

&lt;pre&gt;
        public static float GetSubStringSimilarity(string string1, string string2)
        {
            float dis = fuzzy_substring(string1, string2);
            float minLen = string1.Length;
            if (minLen &gt; string2.Length)
                minLen = string2.Length;
            if (minLen == 0.0F)
                return 1.0F;
            else
                return 1.0F - dis / minLen;
        }
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>Thanks for your code. I wanted to get the the Similarity index and so updated the Levenshtein distance result like so (this is usually calculated based on the max length of the two strings):</p>
<pre>
        public static float GetSubStringSimilarity(string string1, string string2)
        {
            float dis = fuzzy_substring(string1, string2);
            float minLen = string1.Length;
            if (minLen &gt; string2.Length)
                minLen = string2.Length;
            if (minLen == 0.0F)
                return 1.0F;
            else
                return 1.0F - dis / minLen;
        }
</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-104</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Thu, 23 Oct 2008 01:00:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-104</guid>
		<description>Nice! I created a home for it on codepad:

&lt;a href=&quot;http://codepad.org/GFfEVDHu&quot;&gt;http://codepad.org/GFfEVDHu&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>Nice! I created a home for it on codepad:</p>
<p><a href="http://codepad.org/GFfEVDHu">http://codepad.org/GFfEVDHu</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-103</link>
		<dc:creator>Dan</dc:creator>
		<pubDate>Wed, 22 Oct 2008 20:19:15 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-103</guid>
		<description>Tried a version in C++.. about 20 LOC, minus inclues/&#039;using&#039; declarations:

&lt;pre&gt;
int fuzzy_substring(const string&amp; needle,
                    const string&amp; haystack) {
    const int nlen = needle.size(),
              hlen = haystack.size();
    if (hlen == 0) return -1;
    if (nlen == 1) return haystack.find(needle);
    vector&lt;int&gt; row1(hlen+1, 0);
    for (int i = 0; i &lt; nlen; ++i) {
        vector&lt;int&gt; row2(1, i+1);
        for (int j = 0; j &lt; hlen; ++j) {
            const int cost = needle[i] != haystack[j];
            row2.push_back(std::min(row1[j+1]+1,
                                    std::min(row2[j]+1,
                                             row1[j]+cost)));
        }
        row1.swap(row2);
    }
    return *std::min_element(row1.begin(), row1.end());
}&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>Tried a version in C++.. about 20 LOC, minus inclues/&#8217;using&#8217; declarations:</p>
<pre>
int fuzzy_substring(const string&amp; needle,
                    const string&amp; haystack) {
    const int nlen = needle.size(),
              hlen = haystack.size();
    if (hlen == 0) return -1;
    if (nlen == 1) return haystack.find(needle);
    vector&lt;int> row1(hlen+1, 0);
    for (int i = 0; i &lt; nlen; ++i) {
        vector&lt;int> row2(1, i+1);
        for (int j = 0; j &lt; hlen; ++j) {
            const int cost = needle[i] != haystack[j];
            row2.push_back(std::min(row1[j+1]+1,
                                    std::min(row2[j]+1,
                                             row1[j]+cost)));
        }
        row1.swap(row2);
    }
    return *std::min_element(row1.begin(), row1.end());
}</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-101</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Thu, 26 Jun 2008 02:57:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-101</guid>
		<description>@Ben

The intuition is that since we&#039;re looking for a fuzzy substring, we don&#039;t care what comes before or after the string we&#039;re matching. So given the string &quot;bbb aca bbb&quot; and the substring candidate &quot;aaa,&quot; we can match the &quot;aca&quot; of the first string against the &quot;aaa,&quot; ignoring the &quot;bbb&quot; that comes before and after.

The top row is zeroed out to say that we don&#039;t care what comes before the string we&#039;re matching, and we take the lowest score on the bottom row to say that we don&#039;t care what comes after.

However, you&#039;re right that it&#039;s hard to backtrack and find the exact path we took (e.g. in order to mark up the differences); there will often be ambiguity about the proper path.</description>
		<content:encoded><![CDATA[<p>@Ben</p>
<p>The intuition is that since we&#8217;re looking for a fuzzy substring, we don&#8217;t care what comes before or after the string we&#8217;re matching. So given the string &#8220;bbb aca bbb&#8221; and the substring candidate &#8220;aaa,&#8221; we can match the &#8220;aca&#8221; of the first string against the &#8220;aaa,&#8221; ignoring the &#8220;bbb&#8221; that comes before and after.</p>
<p>The top row is zeroed out to say that we don&#8217;t care what comes before the string we&#8217;re matching, and we take the lowest score on the bottom row to say that we don&#8217;t care what comes after.</p>
<p>However, you&#8217;re right that it&#8217;s hard to backtrack and find the exact path we took (e.g. in order to mark up the differences); there will often be ambiguity about the proper path.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/comment-page-1/#comment-102</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Fri, 20 Jun 2008 12:50:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/#comment-102</guid>
		<description>Your method seems to discard the starting point of the match - how do you get that when you have the end point?</description>
		<content:encoded><![CDATA[<p>Your method seems to discard the starting point of the match &#8211; how do you get that when you have the end point?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

