<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The GITS Blog &#187; C</title>
	<atom:link href="http://ginstrom.com/scribbles/category/programming/c/feed/" rel="self" type="application/rss+xml" />
	<link>http://ginstrom.com/scribbles</link>
	<description>Random scribbling about programming, translation, and Japan</description>
	<lastBuildDate>Wed, 20 Apr 2011 05:09:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Version 0.2 of subdist module released</title>
		<link>http://ginstrom.com/scribbles/2007/12/16/version-02-of-subdist-module-released/</link>
		<comments>http://ginstrom.com/scribbles/2007/12/16/version-02-of-subdist-module-released/#comments</comments>
		<pubDate>Sun, 16 Dec 2007 06:34:40 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
				<category><![CDATA[C]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/16/version-02-of-subdist-module-released/</guid>
		<description><![CDATA[Just a quick note that I've released version 0.2 of my subdist module. What is subdist? subdist is a C Python extension that calculates fuzzy substring matches, based on Levenshtein distance. subdist works purely with Unicode strings; calling one of its functions with a non-Unicode string will raise an error. What's new in version 0.2? [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick note that I've released <a href="/code/subdist.html">version 0.2 of my subdist module</a>.</p>
<h3>What is subdist?</h3>
<p><strong>subdist</strong> is a C Python extension that calculates fuzzy substring matches, based on <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a>.</p>
<p><strong>subdist</strong> works purely with Unicode strings; calling one of its functions with a non-Unicode string will raise an error.</p>
<h3>What's new in version 0.2?</h3>
<p>Version 0.2 adds a <code>get_score</code> function, which computes a score between 0.0 and 1.0 showing the match score of <em>needle</em> in <em>haystack</em>, based on the fuzzy substring distance.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2007/12/16/version-02-of-subdist-module-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Extending Python with C: A case study</title>
		<link>http://ginstrom.com/scribbles/2007/12/02/extending-python-with-c-a-case-study/</link>
		<comments>http://ginstrom.com/scribbles/2007/12/02/extending-python-with-c-a-case-study/#comments</comments>
		<pubDate>Sun, 02 Dec 2007 01:36:36 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
				<category><![CDATA[C]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.ginstrom.com/scribbles/2007/12/02/extending-python-with-c-a-case-study/</guid>
		<description><![CDATA[Near-100x speedup with a C extension I recently wrote about an algorithm for fuzzy matching of substrings implemented in Python. This is a feature that I needed for a piece of software I'm currently developing. When I started using the fuzzy_substring function on some test cases, however, it was unacceptably slow. Using a modestly large [...]]]></description>
			<content:encoded><![CDATA[<h3>Near-100x speedup with a C extension</h3>
<p>I recently wrote about an algorithm for <a href="/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/">fuzzy matching of substrings implemented in Python</a>. This is a feature that I needed for a piece of software I'm currently developing.</p>
<p>When I started using the fuzzy_substring function on some test cases, however, it was unacceptably slow. Using a modestly large test corpus and about 1,000 search terms, the function was taking about 30 seconds to run. Since this needs to be run in response to a user query, 30 seconds was just too long to wait.</p>
<h3>First attempt</h3>
<p>The first thing I tried to get this running faster was <a href="http://psyco.sourceforge.net/">psyco</a>. Merely by sticking a psyco.full() at the top of my script, the run time went down to 11 seconds. A 3x speedup for zero effort is pretty cool!</p>
<p>Pscyo rocks, but 11 seconds was still too slow. So, I decided to bite the bullet and write this function as an extension module in C.</p>
<h3>The C extension</h3>
<p>I'd never written a pure-C extension module before, and hadn't touched C in about 10 years, so it was with some trepidation that I read the <a href="http://docs.python.org/ext/ext.html">official docs on extending and embedding</a> and an <a href="http://starship.python.net/crew/mwh/toext/toext.html">online tutorial by Michael Hudson</a>.</p>
<p>It turned out to be incredibly easy. I simply copied an example C file and setup.py from the docs, filled in my method, ran setup.py, and pow! A shiny new subdist.pyd file! Since I'd already written the function in Python, it only took me about an hour to write it in C, and I could use the same unit tests to verify it.</p>
<p><strong>Update</strong>: I've set up <a href="/code/subdist.html">a Web page for this module</a> with code &amp; binaries.</p>
<h3>Benchmark</h3>
<p>I ran a benchmark on the C extension versus the pure-Python version (with and without psyco). For the benchmark, I took 1,000 unique words from the Python 2.5 README file as my needles, and the 100<sup>th</sup> to 200<sup>th</sup> lines in this same file as my haystacks.</p>
<div class="dean_ch" style="white-space: wrap;">
text = <span class="kw2">unicode</span><span class="br0">&#40;</span><span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;/python25/readme.txt&quot;</span><span class="br0">&#41;</span>.<span class="me1">read</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#41;</span><br />
sentences = text.<span class="me1">splitlines</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#91;</span><span class="nu0">100</span>:<span class="nu0">200</span><span class="br0">&#93;</span><br />
words = <span class="kw2">list</span><span class="br0">&#40;</span><span class="kw2">set</span><span class="br0">&#40;</span>text.<span class="me1">split</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#41;</span><span class="br0">&#41;</span><span class="br0">&#91;</span>:<span class="nu0">1000</span><span class="br0">&#93;</span></div>
<p>I then got the fuzzy substring match for each word against each of my "sentences."</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">def</span> time_func<span class="br0">&#40;</span>func, words, sentences<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; start = <span class="kw3">time</span>.<span class="kw3">time</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="kw1">for</span> sentence <span class="kw1">in</span> sentences:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span> word <span class="kw1">in</span> words:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; func<span class="br0">&#40;</span>word, sentence<span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="kw1">return</span> <span class="kw3">time</span>.<span class="kw3">time</span><span class="br0">&#40;</span><span class="br0">&#41;</span> &#8211; start</div>
<h3>The speedup</h3>
<p>Each function was run 10 times; here are the low, high, and median run times:</p>
<table>
<tr>
<th>&nbsp;</th>
<th>Low</th>
<th>Median</th>
<th>High</th>
</tr>
<tr>
<th align="left">Python</th>
<td class="number">53.6410 s</td>
<td class="number">54.6560 s</td>
<td class="number">55.2190 s</td>
</tr>
<tr>
<th align="left">Python (psyco)</th>
<td class="number">17.6100 s</td>
<td class="number">17.7960 s</td>
<td class="number">18.0620 s</td>
</tr>
<tr>
<th align="left">C</th>
<td class="number">0.7030 s</td>
<td class="number">0.7190 s</td>
<td class="number">0.7340 s</td>
</tr>
</table>
<h3>Ratios</h3>
<p>Python (psyco) to Python: <strong>0.3256</strong><br />
C to Python (psyco):<strong> 0.0404</strong><br />
C to Python: <strong>0.0132</strong></p>
<p>So the C extension is almost 100 times faster than the pure Python function, and over 20 times faster than Python with psyco, all for an hour's work. Not too shabby!</p>
<p>There are actually a couple more tricks I could have used in the C code to get a further 10% to 30% speedup (e.g. not calculating the lower left and upper right corners), but it's already fast enough for my purposes. If I need to squeeze out more speed in the future, I'll add the somewhat obfuscating optimizations then.</p>
<h3>Conclusion</h3>
<p>Compiling extensions in C is incredibly easy, much easier than I expected. Pure C is especially suited to cases like this, where you can immediately get away from Python objects and work with pure C data types. Writing my fuzzy string matching algorithm as a C extension turned an algorithm that was sort of interesting into something that will actually be useful.</p>
<p>It's fantastic to be able to get this kind of speedup on performance bottlenecks, yet still write 99% of my code using pure Python.</p>
<h3>Downloads</h3>
<p><a href="/code/subdist.html">Downloads are available here.</a> All code and binaries are released under the MIT license.</p>
<h3>Usage</h3>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">from</span> subdist <span class="kw1">import</span> substring<br />
<span class="kw1">print</span> substring<span class="br0">&#40;</span>u<span class="st0">&quot;needle&quot;</span>, u<span class="st0">&quot;Find the needle in the haystack&quot;</span><span class="br0">&#41;</span></div>
<p>Caveat: The function only accepts Unicode strings, and will return an error if non-Unicode strings are passed in.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2007/12/02/extending-python-with-c-a-case-study/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

