<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The GITS Blog &#187; Unicode</title>
	<atom:link href="http://ginstrom.com/scribbles/tag/unicode/feed/" rel="self" type="application/rss+xml" />
	<link>http://ginstrom.com/scribbles</link>
	<description>Random scribbling about programming, translation, and Japan</description>
	<lastBuildDate>Wed, 20 Apr 2011 05:09:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Notes for using Unicode with Python 2.x</title>
		<link>http://ginstrom.com/scribbles/2008/11/16/notes-for-using-unicode-with-python-2x/</link>
		<comments>http://ginstrom.com/scribbles/2008/11/16/notes-for-using-unicode-with-python-2x/#comments</comments>
		<pubDate>Sat, 15 Nov 2008 16:04:23 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[doctests]]></category>
		<category><![CDATA[idle]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/?p=575</guid>
		<description><![CDATA[Python is very Unicode friendly, but there are still a few quirks that people new to the language (or not so new!) need to assimilate in order to use Unicode effectively. To avoid going over old ground, for a primer please see this excellent article on using Unicode with Python. Here, I want to talk [...]]]></description>
			<content:encoded><![CDATA[<p>Python is very Unicode friendly, but there are still a few quirks that people new to the language (or not so new!) need to assimilate in order to use Unicode effectively.</p>
<p>To avoid going over old ground, for a primer please see this <a href="http://www.amk.ca/python/howto/unicode">excellent article on using Unicode with Python</a>. Here, I want to talk about some of the corner cases remaining after you've absorbed the great advice in that article.</p>
<p>This is not, of course, to say that Unicode support in Python is in any way buggy. Nay, Python's Unicode support is a unique snowflake, perfect in its own special way. It's just us flawed humans who have trouble appreciating fully its snowy beauty, especially if we're not Dutch.</p>
<p>And of course, all strings are Unicode in Python 3.0. That and the <a href="http://www.python.org/dev/peps/pep-3132/">new syntax for extended iterable unpacking</a> are the two main reasons I'm looking forward to Python 3.0. But alas, we'll have to enjoy the unique aspects of Unicode in Python for a bit more, now.</p>
<h3>Input</h3>
<p>I like to keep my programs as bastions of sanity, where all text is handled as Unicode. I thus try to put gatekeepers on all code accepting input, passing it on to the rest of the program logic as Unicode.</p>
<p>Programs that fail to do this often break when dealing with text input that they were sure would be fine as "ascii." One example of this is file paths. Programmers generally expect paths to be in nice, ASCII characters, and that's why their scripts often break when I run them on my Japanese system. For example, on my system the Desktop folder contains Japanese characters:</p>
<div class="dean_ch" style="white-space: wrap;">
C:\Documents and Settings\Ryan Ginstrom\デスクトップ\</div>
<p>When a random python script breaks when run from my Desktop folder, I peek inside, and it's invariably because the programmer never expected the path to contain characters that couldn't be expressed as ASCII.</p>
<p>Put it into Unicode as soon as you get it.</p>
<p>As mentioned in the article above, the <a href="http://www.python.org/doc/2.5.2/lib/module-codecs.html">codecs</a> module makes reading text files as Unicode very simple:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">import</span> <span class="kw3">codecs</span><br />
unitext = <span class="kw3">codecs</span>.<span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;/data.txt&quot;</span>, encoding=<span class="st0">&quot;utf-8&quot;</span><span class="br0">&#41;</span>.<span class="me1">read</span><span class="br0">&#40;</span><span class="br0">&#41;</span></div>
<p>There are just a couple of twists to watch out for when using the <code>codecs</code> module.</p>
<ol>
<li>It obviously can't guess the encoding; you've got to figure this out yourself.</li>
<li><code>open()</code> converts the UTF-8 byte-order mark (BOM) ('\xef\xbb\xbf') into the UTF-16 BOM character ('\ufeff'), while removing the UTF-16 and UTF-16BE BOMs. This might not be what you expected.</li>
</ol>
<p>Because of these <del datetime="2008-11-15T14:52:00+00:00">shortcomings</del> unique aspects of the <code>codecs</code> module, I normally use the <a href="http://chardet.feedparser.org/">chardet</a> module in a custom function to get a random (i.e. user-supplied) text file as Unicode:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">def</span> bytes2unicode<span class="br0">&#40;</span>bytes, errors=<span class="st0">'replace'</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Convert a byte string into Unicode</p>
<p>&nbsp; &nbsp; Have to chop off the BOM by hand.<br />
&nbsp; &nbsp; Usage:<br />
&nbsp; &nbsp; text = bytes2unicode(open(&quot;</span>somefile.<span class="me1">txt</span><span class="st0">&quot;, &quot;</span>rb<span class="st0">&quot;).read())<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw3">encodings</span> = <span class="br0">&#40;</span><span class="br0">&#40;</span><span class="kw3">codecs</span>.<span class="me1">BOM_UTF8</span>, <span class="st0">&quot;utf-8&quot;</span><span class="br0">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#40;</span><span class="kw3">codecs</span>.<span class="me1">BOM_UTF16_LE</span>, <span class="st0">&quot;utf-16&quot;</span><span class="br0">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#40;</span><span class="kw3">codecs</span>.<span class="me1">BOM_UTF16_BE</span>, <span class="st0">&quot;UTF-16BE&quot;</span><span class="br0">&#41;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">for</span> bom, enc <span class="kw1">in</span> <span class="kw3">encodings</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> bytes.<span class="me1">startswith</span><span class="br0">&#40;</span>bom<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw2">unicode</span><span class="br0">&#40;</span>bytes<span class="br0">&#91;</span><span class="kw2">len</span><span class="br0">&#40;</span>bom<span class="br0">&#41;</span>:<span class="br0">&#93;</span>, enc, errors=errors<span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; <span class="co1"># No BOM found, so use chardet</span><br />
&nbsp; &nbsp; encoding = chardet.<span class="me1">detect</span><span class="br0">&#40;</span>bytes<span class="br0">&#41;</span>.<span class="me1">get</span><span class="br0">&#40;</span><span class="st0">'encoding'</span>, <span class="st0">'ascii'</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="kw1">return</span> <span class="kw2">unicode</span><span class="br0">&#40;</span>bytes, encoding, errors=errors<span class="br0">&#41;</span></div>
<h3>Output</h3>
<p>As I mentioned, I like to get my text into Unicode as early as possible, and keep it as Unicode as late as possible. Ideally, I'd like to just output my text as Unicode, and let the output stream take care of the encoding (if any).</p>
<p>That's why when I need to output Unicode as a stream of bytes, I use the codecs module for files, and wrap the output stream otherwise. This is needed, for example, when using <a href="http://www.python.org/doc/2.5.2/lib/module-cStringIO.html">cStringIO</a>, which chokes on Unicode.</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="co1">#coding: UTF8</span><br />
<span class="kw1">import</span> <span class="kw3">cStringIO</span></p>
<p>myval = u<span class="st0">&quot;日本語&quot;</span></p>
<p>out = <span class="kw3">cStringIO</span>.<span class="kw3">StringIO</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
<span class="kw1">print</span> &gt;&gt; out, myval</div>
<p>Error message:</p>
<div class="dean_ch" style="white-space: wrap;">
Traceback (most recent call last):<br />
&nbsp; File &quot;C:\workspace\SpamTest\uni2.py&quot;, line 8, in &lt;module&gt;<br />
&nbsp; &nbsp; print &gt;&gt; out, myval<br />
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)</div>
<p>I can fix this by wrapping <code>out</code> with a class that intercepts the <code>write()</code> method, and converts Unicode strings to the specified encoding just before writing.</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">class</span> OutStreamEncoder<span class="br0">&#40;</span><span class="kw2">object</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
&nbsp; &nbsp; Wraps a stream with an encoder</p>
<p>&nbsp; &nbsp; usage:<br />
&nbsp; &nbsp; out = OutStreamEncoder(out, &quot;</span>utf<span class="nu0">-8</span><span class="st0">&quot;)<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> <span class="kw4">__init__</span><span class="br0">&#40;</span><span class="kw2">self</span>, outstream, encoding<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">out</span> = outstream<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">encoding</span> = encoding</p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> write<span class="br0">&#40;</span><span class="kw2">self</span>, obj<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
&nbsp; &nbsp; &nbsp; &nbsp; Wraps the output stream, encoding Unicode<br />
&nbsp; &nbsp; &nbsp; &nbsp; strings with the specified encoding<br />
&nbsp; &nbsp; &nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="kw2">isinstance</span><span class="br0">&#40;</span>obj, <span class="kw2">unicode</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">out</span>.<span class="me1">write</span><span class="br0">&#40;</span>obj.<span class="me1">encode</span><span class="br0">&#40;</span><span class="kw2">self</span>.<span class="me1">encoding</span><span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">out</span>.<span class="me1">write</span><span class="br0">&#40;</span>obj<span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> <span class="kw4">__getattr__</span><span class="br0">&#40;</span><span class="kw2">self</span>, attr<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Delegate everything but 'write' to the stream&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw2">getattr</span><span class="br0">&#40;</span><span class="kw2">self</span>.<span class="me1">out</span>, attr<span class="br0">&#41;</span></div>
<p>Now the example above works:</p>
<div class="dean_ch" style="white-space: wrap;">
myval = u<span class="st0">&quot;日本語&quot;</span></p>
<p>out = <span class="kw3">cStringIO</span>.<span class="kw3">StringIO</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
out = OutStreamEncoder<span class="br0">&#40;</span>out, <span class="st0">&quot;utf-8&quot;</span><span class="br0">&#41;</span><br />
<span class="kw1">print</span> &gt;&gt; out, myval</div>
<h3>IDLE</h3>
<p><a href="http://www.python.org/idle/doc/idle2.html">IDLE</a> has its own peculiarities regarding Unicode. It actually handles Unicode like a champ, but it assumes that everything you type at the command prompt is in the file-system encoding. Since I'm on a Japanese system, this is "mbcs." You can thus get into some odd states:</p>
<div class="dean_ch" style="white-space: wrap;">
&gt;&gt;&gt; <span class="co1"># A unicode string of multibyte chars as bytes&#8230;</span><br />
&gt;&gt;&gt; u<span class="st0">&quot;日本語&quot;</span><br />
u<span class="st0">'<span class="es0">\x</span>93<span class="es0">\x</span>fa<span class="es0">\x</span>96{<span class="es0">\x</span>8c<span class="es0">\x</span>ea'</span><br />
&gt;&gt;&gt; <span class="co1"># This is what it should be</span><br />
&gt;&gt;&gt; <span class="kw2">unicode</span><span class="br0">&#40;</span><span class="st0">&quot;日本語&quot;</span>, <span class="st0">&quot;mbcs&quot;</span><span class="br0">&#41;</span><br />
u<span class="st0">'<span class="es0">\u</span>65e5<span class="es0">\u</span>672c<span class="es0">\u</span>8a9e'</span></div>
<p>The general way to avoid these problems in IDLE is using <code>sys.getfilesystemencoding()</code>.</p>
<div class="dean_ch" style="white-space: wrap;">
&gt;&gt;&gt; <span class="kw1">import</span> <span class="kw3">sys</span><br />
&gt;&gt;&gt; <span class="kw1">print</span> <span class="kw2">unicode</span><span class="br0">&#40;</span><span class="st0">&quot;日本語&quot;</span>, <span class="kw3">sys</span>.<span class="me1">getfilesystemencoding</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#41;</span><br />
日本語</div>
<h3>Doctests</h3>
<p><a href="http://docs.python.org/lib/module-doctest.html">doctest</a> is so full of snow-flaky uniqueness, I could put cherry syrup on it and call it a snow cone. Note in the example below that my "is_asian" function's doctests contain a Japanese character (日).</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="co1">#coding: UTF8</span></p>
<p><span class="co1"># 0&#215;3000 is ideographic space (i.e. double-byte space)</span><br />
IDEOGRAPHIC_SPACE = 0&#215;3000</p>
<p><span class="kw1">def</span> is_asian<span class="br0">&#40;</span>char<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
&nbsp; &nbsp; Is the character Asian?</p>
<p>&nbsp; &nbsp; &gt;&gt;&gt; is_asian(u'a')<br />
&nbsp; &nbsp; False<br />
&nbsp; &nbsp; &gt;&gt;&gt; is_asian(u'日')<br />
&nbsp; &nbsp; True<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> <span class="kw2">ord</span><span class="br0">&#40;</span>char<span class="br0">&#41;</span> &gt; IDEOGRAPHIC_SPACE</div>
<p>Running doctest on this gives a rather cryptic error:</p>
<div class="dean_ch" style="white-space: wrap;">
Failed example:<br />
&nbsp; &nbsp; is_asian(u'日')<br />
Exception raised:<br />
&nbsp; &nbsp; Traceback (most recent call last):<br />
&nbsp; &nbsp; &nbsp; File &quot;C:\Python25\lib\doctest.py&quot;, line 1228, in __run<br />
&nbsp; &nbsp; &nbsp; &nbsp; compileflags, 1) in test.globs<br />
&nbsp; &nbsp; &nbsp; File &quot;&lt;doctest __main__.is_asian[1]&gt;&quot;, line 1, in &lt;module&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; is_asian(u'日')<br />
&nbsp; &nbsp; &nbsp; File &quot;C:\workspace\SpamTest\uni1.py&quot;, line 15, in is_asian<br />
&nbsp; &nbsp; &nbsp; &nbsp; return ord(char) &gt; IDEOGRAPHIC_SPACE<br />
&nbsp; &nbsp; TypeError: ord() expected a character, but string of length 3 found</div>
<p>It turns out that doctests can't handle Unicode characters. It's making the same "string of utf-8 bytes as Unicode characters" error as IDLE, and thus interpreting one character ("日") as three.</p>
<p>So we have to trick doctest by taking the repr value of the Unicode text (I usually stick the actual characters in a comment above it). Here's a repaired version, which runs without errors:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">def</span> is_asian<span class="br0">&#40;</span>char<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
&nbsp; &nbsp; Repaired version of doctests</p>
<p>&nbsp; &nbsp; &gt;&gt;&gt; is_asian(u'a')<br />
&nbsp; &nbsp; False<br />
&nbsp; &nbsp; &gt;&gt;&gt; # u'日'<br />
&nbsp; &nbsp; &gt;&gt;&gt; is_asian(u'<span class="es0">\u</span>65e5&#8242;)<br />
&nbsp; &nbsp; True<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> <span class="kw2">ord</span><span class="br0">&#40;</span>char<span class="br0">&#41;</span> &gt; IDEOGRAPHIC_SPACE</div>
<p>To see the silver lining in this, at least it encourages you to keep your complicated tests in unit tests, and save doctests for simple, illustrative purposes.</p>
<h3>Conclusion</h3>
<p>Unicode support in Python is actually quite good &#8212; much better than most languages. And it will get even better with Python 3.0. In the meantime, however, there are a few gotchas to look out for when using Unicode in Python.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/11/16/notes-for-using-unicode-with-python-2x/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

