Notes for using Unicode with Python 2.x

Python is very Unicode friendly, but there are still a few quirks that people new to the language (or not so new!) need to assimilate in order to use Unicode effectively.

To avoid going over old ground, for a primer please see this excellent article on using Unicode with Python. Here, I want to talk about some of the corner cases remaining after you've absorbed the great advice in that article.

This is not, of course, to say that Unicode support in Python is in any way buggy. Nay, Python's Unicode support is a unique snowflake, perfect in its own special way. It's just us flawed humans who have trouble appreciating fully its snowy beauty, especially if we're not Dutch.

And of course, all strings are Unicode in Python 3.0. That and the new syntax for extended iterable unpacking are the two main reasons I'm looking forward to Python 3.0. But alas, we'll have to enjoy the unique aspects of Unicode in Python for a bit more, now.

Input

I like to keep my programs as bastions of sanity, where all text is handled as Unicode. I thus try to put gatekeepers on all code accepting input, passing it on to the rest of the program logic as Unicode.

Programs that fail to do this often break when dealing with text input that they were sure would be fine as "ascii." One example of this is file paths. Programmers generally expect paths to be in nice, ASCII characters, and that's why their scripts often break when I run them on my Japanese system. For example, on my system the Desktop folder contains Japanese characters:

C:\Documents and Settings\Ryan Ginstrom\デスクトップ\

When a random python script breaks when run from my Desktop folder, I peek inside, and it's invariably because the programmer never expected the path to contain characters that couldn't be expressed as ASCII.

Put it into Unicode as soon as you get it.

As mentioned in the article above, the codecs module makes reading text files as Unicode very simple:

import codecs
unitext = codecs.open("/data.txt", encoding="utf-8").read()

There are just a couple of twists to watch out for when using the codecs module.

  1. It obviously can't guess the encoding; you've got to figure this out yourself.
  2. open() converts the UTF-8 byte-order mark (BOM) ('\xef\xbb\xbf') into the UTF-16 BOM character ('\ufeff'), while removing the UTF-16 and UTF-16BE BOMs. This might not be what you expected.

Because of these shortcomings unique aspects of the codecs module, I normally use the chardet module in a custom function to get a random (i.e. user-supplied) text file as Unicode:

def bytes2unicode(bytes, errors='replace'):
    """Convert a byte string into Unicode

    Have to chop off the BOM by hand.
    Usage:
    text = bytes2unicode(open("somefile.txt", "rb").read())
    "
""

    encodings = ((codecs.BOM_UTF8, "utf-8"),
        (codecs.BOM_UTF16_LE, "utf-16"),
        (codecs.BOM_UTF16_BE, "UTF-16BE"))

    for bom, enc in encodings:
        if bytes.startswith(bom):
            return unicode(bytes[len(bom):], enc, errors=errors)

    # No BOM found, so use chardet
    encoding = chardet.detect(bytes).get('encoding', 'ascii')
    return unicode(bytes, encoding, errors=errors)

Output

As I mentioned, I like to get my text into Unicode as early as possible, and keep it as Unicode as late as possible. Ideally, I'd like to just output my text as Unicode, and let the output stream take care of the encoding (if any).

That's why when I need to output Unicode as a stream of bytes, I use the codecs module for files, and wrap the output stream otherwise. This is needed, for example, when using cStringIO, which chokes on Unicode.

#coding: UTF8
import cStringIO

myval = u"日本語"

out = cStringIO.StringIO()
print >> out, myval

Error message:

Traceback (most recent call last):
  File "C:\workspace\SpamTest\uni2.py", line 8, in <module>
    print >> out, myval
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

I can fix this by wrapping out with a class that intercepts the write() method, and converts Unicode strings to the specified encoding just before writing.

class OutStreamEncoder(object):
    """
    Wraps a stream with an encoder

    usage:
    out = OutStreamEncoder(out, "utf-8")
    "
""

    def __init__(self, outstream, encoding):
        self.out = outstream
        self.encoding = encoding

    def write(self, obj):
        """
        Wraps the output stream, encoding Unicode
        strings with the specified encoding
        "
""

        if isinstance(obj, unicode):
            self.out.write(obj.encode(self.encoding))
        else:
            self.out.write(obj)

    def __getattr__(self, attr):
        """Delegate everything but 'write' to the stream"""

        return getattr(self.out, attr)

Now the example above works:

myval = u"日本語"

out = cStringIO.StringIO()
out = OutStreamEncoder(out, "utf-8")
print >> out, myval

IDLE

IDLE has its own peculiarities regarding Unicode. It actually handles Unicode like a champ, but it assumes that everything you type at the command prompt is in the file-system encoding. Since I'm on a Japanese system, this is "mbcs." You can thus get into some odd states:

>>> # A unicode string of multibyte chars as bytes…
>>> u"日本語"
u'\x93\xfa\x96{\x8c\xea'
>>> # This is what it should be
>>> unicode("日本語", "mbcs")
u'\u65e5\u672c\u8a9e'

The general way to avoid these problems in IDLE is using sys.getfilesystemencoding().

>>> import sys
>>> print unicode("日本語", sys.getfilesystemencoding())
日本語

Doctests

doctest is so full of snow-flaky uniqueness, I could put cherry syrup on it and call it a snow cone. Note in the example below that my "is_asian" function's doctests contain a Japanese character (日).

#coding: UTF8

# 0x3000 is ideographic space (i.e. double-byte space)
IDEOGRAPHIC_SPACE = 0x3000

def is_asian(char):
    """
    Is the character Asian?

    >>> is_asian(u'a')
    False
    >>> is_asian(u'日')
    True
    """

    return ord(char) > IDEOGRAPHIC_SPACE

Running doctest on this gives a rather cryptic error:

Failed example:
    is_asian(u'日')
Exception raised:
    Traceback (most recent call last):
      File "C:\Python25\lib\doctest.py", line 1228, in __run
        compileflags, 1) in test.globs
      File "<doctest __main__.is_asian[1]>", line 1, in <module>
        is_asian(u'日')
      File "C:\workspace\SpamTest\uni1.py", line 15, in is_asian
        return ord(char) > IDEOGRAPHIC_SPACE
    TypeError: ord() expected a character, but string of length 3 found

It turns out that doctests can't handle Unicode characters. It's making the same "string of utf-8 bytes as Unicode characters" error as IDLE, and thus interpreting one character ("日") as three.

So we have to trick doctest by taking the repr value of the Unicode text (I usually stick the actual characters in a comment above it). Here's a repaired version, which runs without errors:

def is_asian(char):
    """
    Repaired version of doctests

    >>> is_asian(u'a')
    False
    >>> # u'日'
    >>> is_asian(u'\u65e5′)
    True
    """

    return ord(char) > IDEOGRAPHIC_SPACE

To see the silver lining in this, at least it encourages you to keep your complicated tests in unit tests, and save doctests for simple, illustrative purposes.

Conclusion

Unicode support in Python is actually quite good — much better than most languages. And it will get even better with Python 3.0. In the meantime, however, there are a few gotchas to look out for when using Unicode in Python.

4 comments to Notes for using Unicode with Python 2.x

  • empii

    When converting utf-8 bytes to unicode you can use the encoding utf_8_sig instead of utf-8 (on python >= 2.5). utf_8_sig skips the leading UTF-8 encoded BOM if it’s there. This nicely hides the second twist in your codecs.open example.

  • Good content.

    I also think it’s worth noting that, if you have any C extensions you’re writing you can use the “et” format arguments in your PyArg_ParseTuple (or equivalent) to keep Python from attempting to encode everything you pass to it to ascii.

    This also means that you have to PyMem_Free all of the strings you get, however.

  • @empii

    I didn’t know that — thanks for pointing it out!

    @Jack

    Also good to know. In the only C extensions I’ve written, I just required that all strings be passed in as Unicode. “et” is an elegant alternative.

  • Ali

    I want to learn python 3.0 because it asserts that all characters in unicode (UTF8) but I could not understand the stuation.
    Could you explain what did I wrong?

    I am using Windows XP professional version 2002 Service pack 3. AMD Athlon(TM)XP 2400+ 2.00GHz 992MB RAM.

    I have downloaded Windows x86 MSI Instaler Python 3.0 (sig) (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win32

    Control Panel -> System -> Advanced -> Environment Variables.
    System Variables -> Path -> edit C:\Windows\System32\Wbem;C:\Python30

    start -> programs -> python 3.0 -> IDLE(Python GUI)

    -> IDLE 3.0 -> File -> New Window -> i wrote “print(‘ğüşçöı’)” without qutes
    -> File -> Save -> Python30 -> i gave file name “d2.py” without qutes
    -> and Run -> Run Module -> it gives error “invalid character in identifier”

    then i tried second method

    start -> run -> cmd -> d2.py and enter it gives the error

    C:\>d2.py
    Traceback (most recent call last):
    File “C:\Python30\d2.py”, line 4, in
    print(‘\u011fü\u015fçö\u0131’)
    File “C:\Python30\lib\io.py”, line 1491, in write
    b = encoder.encode(s)
    File “C:\Python30\lib\encodings\cp437.py”, line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\u011f’ in position
    0: character maps to

    C:\>

    But if i write in Phyton Shell -> >>> print(‘ğüşçöı’) and pressed enter
    -> gives ‘ğüşçöı’ it works.

    and I tried below characters in Phyton Shell -> >>> print(‘?????????)
    but if I firstli write in notepad and copied pasted it works
    print(‘ḍḥḫṣṭẕāīū’)

    What is wrong?

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>