<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The GITS Blog &#187; iterators</title>
	<atom:link href="http://ginstrom.com/scribbles/tag/iterators/feed/" rel="self" type="application/rss+xml" />
	<link>http://ginstrom.com/scribbles</link>
	<description>Random scribbling about programming, translation, and Japan</description>
	<lastBuildDate>Fri, 11 May 2012 05:10:41 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Conditional &#8220;tee&#8221; with Python</title>
		<link>http://ginstrom.com/scribbles/2009/03/02/conditional-tee-with-python/</link>
		<comments>http://ginstrom.com/scribbles/2009/03/02/conditional-tee-with-python/#comments</comments>
		<pubDate>Mon, 02 Mar 2009 01:21:38 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[conditional]]></category>
		<category><![CDATA[functional]]></category>
		<category><![CDATA[generators]]></category>
		<category><![CDATA[iterators]]></category>
		<category><![CDATA[tee]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/?p=858</guid>
		<description><![CDATA[This post describes the conditional tee ("ctee") module I wrote to split a sequence into two generators, according to a filter function. The problem David Beazley has a great article about generator pipelining using Python. This is a technique for handling (potentially very large) streams of data in a flexible yet efficient way. As an [...]]]></description>
			<content:encoded><![CDATA[<p>This post describes the <a href="/code/ctee.zip">conditional tee ("ctee") module</a> I wrote to split a sequence into two generators, according to a filter function.</p>
<h3>The problem</h3>
<p>David Beazley has <a href="http://www.dabeaz.com/generators/">a great article about generator pipelining</a> using Python. This is a technique for handling (potentially very large) streams of data in a flexible yet efficient way. As an example, here's a code snippet he gives for summing the total bytes from a log file:</p>
<div class="dean_ch" style="white-space: wrap;">
wwwlog = <span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;access-log&quot;</span><span class="br0">&#41;</span><br />
bytecolumn = <span class="br0">&#40;</span>line.<span class="me1">rsplit</span><span class="br0">&#40;</span><span class="kw2">None</span>,<span class="nu0">1</span><span class="br0">&#41;</span><span class="br0">&#91;</span><span class="nu0">1</span><span class="br0">&#93;</span> <span class="kw1">for</span> line <span class="kw1">in</span> wwwlog<span class="br0">&#41;</span><br />
bytes = <span class="br0">&#40;</span><span class="kw2">int</span><span class="br0">&#40;</span>x<span class="br0">&#41;</span> <span class="kw1">for</span> x <span class="kw1">in</span> bytecolumn <span class="kw1">if</span> x != <span class="st0">'-'</span><span class="br0">&#41;</span></p>
<p><span class="kw1">print</span> <span class="st0">&quot;Total&quot;</span>, <span class="kw2">sum</span><span class="br0">&#40;</span>bytes<span class="br0">&#41;</span></div>
<p>The code above first opens a log file, then gets the byte column for each entry. The byte value (if any) is then calculated for each row. Finally, the generator is consumed (or "pumped"), yielding the sum.</p>
<p>Since the entire file is never loaded into active memory, you could run this on quite huge log files, or even add a few steps and run it on collections of log files, without blowing up your memory. Another feature of this technique is that it's very flexible: you can add steps, combine steps into atomic actions, rearrange them, and so on.</p>
<p>This works great, as long as your pipe doesn't branch. If you want to split your pipe &#8212; say, dividing a stream of integers into one stream of even numbers and another of odds, things get a little complicated. One really elegant way to handle this situation is with the <code>itertools.tee</code> function. <code>tee</code> takes an iterable sequence, and returns <em>n</em> "copies" of that sequence that can be iterated independently.</p>
<h3>Using <code>itertools.tee</code></h3>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">import</span> <span class="kw3">itertools</span></p>
<p>lines = <span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;numbers.txt&quot;</span><span class="br0">&#41;</span><br />
numbers = <span class="br0">&#40;</span><span class="kw2">int</span><span class="br0">&#40;</span>line<span class="br0">&#41;</span> <span class="kw1">for</span> line <span class="kw1">in</span> lines<span class="br0">&#41;</span></p>
<p>first, second = <span class="kw3">itertools</span>.<span class="me1">tee</span><span class="br0">&#40;</span>numbers<span class="br0">&#41;</span></p>
<p>evens = <span class="br0">&#40;</span>i <span class="kw1">for</span> i <span class="kw1">in</span> first <span class="kw1">if</span> <span class="kw1">not</span> i % <span class="nu0">2</span><span class="br0">&#41;</span><br />
odds = <span class="br0">&#40;</span>i <span class="kw1">for</span> i <span class="kw1">in</span> second <span class="kw1">if</span> i % <span class="nu0">2</span><span class="br0">&#41;</span></p>
<p><span class="kw1">print</span> <span class="st0">&quot;Evens total:&quot;</span>, <span class="kw2">sum</span><span class="br0">&#40;</span>evens<span class="br0">&#41;</span><br />
<span class="kw1">print</span> <span class="st0">&quot;Odds total:&quot;</span>, <span class="kw2">sum</span><span class="br0">&#40;</span>odds<span class="br0">&#41;</span></div>
<p>The code first opens a file containing a bunch of random integers, and creates a generator that's a stream of integers. It then uses <code>itertools.tee</code> to make two copies of that generator (first and second), and applies generator expressions to create two streams: one of even numbers, and one of odd numbers. The built-in <code>sum</code> function is then used to consume each tee.</p>
<p><a href="/code/numbers.txt">Here's the number file</a> that I used for this code. It's a list of 1,000 random integers between 1 and 1,000,000.</p>
<p>That's fine in this case, where the filter expression is relatively inexpensive. But what if we have an expensive filter, like testing whether the number is prime, or making a database query? It could really hurt our performance if we have to perform the same test twice. Ideally, we'd just like to perform the test once for each element in our pipeline.</p>
<p>There are lots of ways to handle that situation. One common way is the <a href="http://en.wikipedia.org/wiki/Continuation-passing_style">"continuation-passing" style</a>, where you pass some data, a filter condition, and one or more functions to perform depending on the results of the test.</p>
<p>This works, but it disrupts the pipeline. That costs us the flexibility and dynamic nature of the generator paradigm.</p>
<p>I wrote the conditional tee (ctee) module for cases when you want to use a generator pipeline, but you need to split the sequence into two generators, and the filter condition is expensive. It creates a pair of instances of the <code>ConditionalTee</code> class, which are linked to each other.</p>
<h3>Using conditional tee</h3>
<p>Here's the meat of the code. The module can be <a href="/code/ctee.zip">downloaded here (ctee.zip)</a>.</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">from</span> <span class="kw3">Queue</span> <span class="kw1">import</span> <span class="kw3">Queue</span></p>
<p><span class="kw1">class</span> ConditionalTee<span class="br0">&#40;</span><span class="kw2">object</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;A conditional tee class&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> <span class="kw4">__init__</span><span class="br0">&#40;</span><span class="kw2">self</span>, sequence, condition<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">sequence</span> = sequence<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">condition</span> = condition<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">othertee</span> = <span class="kw2">None</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">q</span> = <span class="kw3">Queue</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> next<span class="br0">&#40;</span><span class="kw2">self</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
&nbsp; &nbsp; &nbsp; &nbsp; Get the next item that matches the condition.<br />
&nbsp; &nbsp; &nbsp; &nbsp; Adds items to the queue of the other sequence until<br />
&nbsp; &nbsp; &nbsp; &nbsp; one matching this condition is reached.<br />
&nbsp; &nbsp; &nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="kw1">not</span> <span class="kw2">self</span>.<span class="me1">q</span>.<span class="me1">empty</span><span class="br0">&#40;</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw2">self</span>.<span class="me1">q</span>.<span class="me1">get</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; item = <span class="kw2">self</span>.<span class="me1">sequence</span>.<span class="me1">next</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">while</span> <span class="kw1">not</span> <span class="kw2">self</span>.<span class="me1">condition</span><span class="br0">&#40;</span>item<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">othertee</span>.<span class="me1">q</span>.<span class="me1">put</span><span class="br0">&#40;</span>item<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item = <span class="kw2">self</span>.<span class="me1">sequence</span>.<span class="me1">next</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> item</p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> <span class="kw4">__iter__</span><span class="br0">&#40;</span><span class="kw2">self</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;We are an iterator&quot;</span><span class="st0">&quot;&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw2">self</span></p>
<p><span class="kw1">def</span> ctee<span class="br0">&#40;</span>sequence, condition<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
&nbsp; &nbsp; Creates two sequences from sequence: one where<br />
&nbsp; &nbsp; condition holds, and the other where it doesn't<br />
&nbsp; &nbsp; sequence -&gt; (x for x in sequence if condition(x)),<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (x for x in sequence if not condition(x))<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; yes_iter = ConditionalTee<span class="br0">&#40;</span>sequence, condition<span class="br0">&#41;</span><br />
&nbsp; &nbsp; nocond = <span class="kw1">lambda</span> x : <span class="kw1">not</span> condition<span class="br0">&#40;</span>x<span class="br0">&#41;</span><br />
&nbsp; &nbsp; no_iter = ConditionalTee<span class="br0">&#40;</span>sequence, nocond<span class="br0">&#41;</span><br />
&nbsp; &nbsp; yes_iter.<span class="me1">othertee</span> = no_iter<br />
&nbsp; &nbsp; no_iter.<span class="me1">othertee</span> = yes_iter</p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> yes_iter, no_iter</div>
<p>The <code>ConditionalTee</code> class takes a sequence and a filter condition as arguments to its <code>__init__</code> method. The <code>__init__</code> method also creates an empty queue member, and an <code>othertee</code> member that's initialized to None.</p>
<p>When the <code>next</code> method of a <code>ConditionalTee</code> instance is called, it first looks for any items in its queue. If there is an item on the queue, it returns the first one. Otherwise, it iterates through its sequence; it keeps adding any items that don't match to the queue of its <code>othertee</code> member, until it either finds an item that matches or raises a <code>StopIteration</code> exception.</p>
<p>The <code>ctee</code> function also takes a sequence and a filter condition as arguments. It creates two <code>ConditionalTee</code> instances, and sets their <code>othertee</code> members to each other, then returns the two instances as a pair.</p>
<p>Here's some sample code using ctee:</p>
<div class="dean_ch" style="white-space: wrap;">
lines = <span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;numbers.txt&quot;</span><span class="br0">&#41;</span><br />
nums = <span class="br0">&#40;</span><span class="kw2">int</span><span class="br0">&#40;</span>line<span class="br0">&#41;</span> <span class="kw1">for</span> line <span class="kw1">in</span> lines<span class="br0">&#41;</span><br />
iseven = <span class="kw1">lambda</span> x : <span class="kw1">not</span> x % <span class="nu0">2</span><br />
evens, odds = ctee.<span class="me1">ctee</span><span class="br0">&#40;</span>nums, iseven<span class="br0">&#41;</span></div>
<p>There's still a problem if you "pump" each of these generators in succession, though: if the amount of data is large, the other generator class is going to accumulate a huge queue of data. It would be better to pump each generator expression alternately, taking an item and processing it from each generator in turn, in order to avoid building up a big queue.</p>
<p>Here's a function that'll do that:</p>
<h3>Pumping generators alternately instead of consecutively</h3>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">def</span> diagonalize<span class="br0">&#40;</span>sequences<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
&nbsp; &nbsp; Takes each sequence in turn, retrieving one item from that<br />
&nbsp; &nbsp; sequence and performing action on it, until all sequences<br />
&nbsp; &nbsp; are exhausted.<br />
&nbsp; &nbsp; sequence is a sequence of (iterable, action) pairs.<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span><br />
&nbsp; &nbsp; sequences = <span class="br0">&#91;</span><span class="br0">&#40;</span><span class="kw2">iter</span><span class="br0">&#40;</span>s<span class="br0">&#41;</span>, a<span class="br0">&#41;</span> <span class="kw1">for</span> <span class="br0">&#40;</span>s, a<span class="br0">&#41;</span> <span class="kw1">in</span> sequences<span class="br0">&#93;</span><br />
&nbsp; &nbsp; <span class="kw1">while</span> sequences:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span> sequence, action <span class="kw1">in</span> sequences:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item = sequence.<span class="me1">next</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; action<span class="br0">&#40;</span>item<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">except</span> <span class="kw2">StopIteration</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># remove the exhausted sequence from the list</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; sequences = <span class="br0">&#91;</span><span class="br0">&#40;</span>s, a<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span> <span class="br0">&#40;</span>s, a<span class="br0">&#41;</span> <span class="kw1">in</span> sequences<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> s != sequence<span class="br0">&#93;</span></div>
<p>This takes a sequence of (sequence, action) pairs. It iterates through each pair, taking the next item in the sequence and applying the action to it. If the sequence raises <code>StopIteration</code>, it's removed from the list of sequences. The list comprehension at the start of the function is to make <code>sequence</code> test False when empty, and to ensure each sequence in it is an iterable (i.e. supporting <code>next</code>).</p>
<p>Here's an example of using this function:</p>
<div class="dean_ch" style="white-space: wrap;">
lines = <span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;numbers.txt&quot;</span><span class="br0">&#41;</span><br />
numbers = <span class="br0">&#40;</span><span class="kw2">int</span><span class="br0">&#40;</span>line<span class="br0">&#41;</span> <span class="kw1">for</span> line <span class="kw1">in</span> lines<span class="br0">&#41;</span></p>
<p>evens, odds = ctee<span class="br0">&#40;</span>numbers, <span class="kw1">lambda</span> x : <span class="kw1">not</span> x % <span class="nu0">2</span><span class="br0">&#41;</span></p>
<p>evenout = <span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;evens.txt&quot;</span>, <span class="st0">&quot;w&quot;</span><span class="br0">&#41;</span><br />
oddout = <span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;odds.txt&quot;</span>, <span class="st0">&quot;w&quot;</span><span class="br0">&#41;</span></p>
<p><span class="kw1">def</span> writeline<span class="br0">&#40;</span>out, item<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="kw1">print</span> &gt;&gt; out, item</p>
<p>evenaction = <span class="kw1">lambda</span> x : writeline<span class="br0">&#40;</span>evenout, x<span class="br0">&#41;</span><br />
oddaction = <span class="kw1">lambda</span> x : writeline<span class="br0">&#40;</span>oddout, x<span class="br0">&#41;</span></p>
<p>diagonalize<span class="br0">&#40;</span><span class="br0">&#40;</span><span class="br0">&#40;</span>evens, evenaction<span class="br0">&#41;</span>, <span class="br0">&#40;</span>odds, oddaction<span class="br0">&#41;</span><span class="br0">&#41;</span><span class="br0">&#41;</span></div>
<p>This code will write all the even numbers to "evens.txt", and all the odd numbers to "odds.txt".</p>
<p>You might ask, how is this different from the continuation passing style? And you'd have a point; this is essentially continuation passing.</p>
<p>The thing is that here, the pumping only happens at the end. You can still go on wrapping all sorts of other filtering and transforming generators around your two conditional tees; the sequence won't actually be processed until you start pumping the generators at the end, so you won't build up enormous queues of data.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2009/03/02/conditional-tee-with-python/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

