Using chardet to convert arbitrary byte strings to Unicode

chardet is a fantastic module for finding the encoding of arbitrary byte strings. You can combine this with a check for a BOM to pretty reliably turn them into Unicode.

Edit: Thanks to Kirit's comment below, I added code to check for UTF-32.

import chardet

def bytes2unicode(bytes, errors='replace'):
    """Convert a byte string into Unicode.
    First checks for a BOM, and if one is found returns
    the Unicode text minus the BOM. If there is no BOM,
    falls back to chardet."
""
   
    encoding_map = ('\xef\xbb\xbf', 'utf-8′),
        ('\xff\xfe\0\0', 'utf-32′),
        ('\0\0\xfe\xff', 'UTF-32BE'),
        ('\xff\xfe', 'utf-16′),
        ('\xfe\xff', 'UTF-16BE'))

    for bom, encoding in encoding_map:
        if bytes.startswith(bom):
            return unicode(bytes[len(bom):],
                           encoding,
                           errors=errors)
   
    # No BOM found, so use chardet
    detection = chardet.detect(bytes)
    encoding = detection.get('encoding') or 'utf-16′
    return unicode(bytes, encoding, errors=errors)

Usage:

text = bytes2unicode(open(filename).read(), 'replace')

Discussion: Why check for a BOM?

You might ask, why check for a BOM if chardet already does this? This is because although chardet will correctly detect the BOM, it won't tell you that it found it, so you won't know to chop it off before processing the text. Which means that you'd have to check for a BOM anyway in most cases.

Comments

  1. March 23rd, 2008| 9:27 pm

    In theory you should probably check the next two bytes are zero before sure you’re looking at UTF-16 not UTF-32.

  2. March 23rd, 2008| 11:54 pm

    @Kirit
    You’re quite right, thanks for pointing this out. I’ve fixed the code above to check for UTF-32 BOMs as well.

    Of course, it will fail with a LookupError if the Python installation doesn’t support UTF-32, but at least you’ll know why :)

Leave a reply