Using chardet to convert arbitrary byte strings to Unicode

chardet is a fantastic module for finding the encoding of arbitrary byte strings. You can combine this with a check for a BOM to pretty reliably turn them into Unicode.

Edit: Thanks to Kirit's comment below, I added code to check for UTF-32.

import chardet

def bytes2unicode(bytes, errors='replace'):
    """Convert a byte string into Unicode.
    First checks for a BOM, and if one is found returns
    the Unicode text minus the BOM. If there is no BOM,
    falls back to chardet."
""

    encoding_map = ('\xef\xbb\xbf', 'utf-8'),
        ('\xff\xfe\0\0', 'utf-32'),
        ('\0\0\xfe\xff', 'UTF-32BE'),
        ('\xff\xfe', 'utf-16'),
        ('\xfe\xff', 'UTF-16BE'))

    for bom, encoding in encoding_map:
        if bytes.startswith(bom):
            return unicode(bytes[len(bom):],
                           encoding,
                           errors=errors)

    # No BOM found, so use chardet
    detection = chardet.detect(bytes)
    encoding = detection.get('encoding') or 'utf-16'
    return unicode(bytes, encoding, errors=errors)

Usage:

text = bytes2unicode(open(filename).read(), 'replace')

Discussion: Why check for a BOM?

You might ask, why check for a BOM if chardet already does this? This is because although chardet will correctly detect the BOM, it won't tell you that it found it, so you won't know to chop it off before processing the text. Which means that you'd have to check for a BOM anyway in most cases.

3 comments to Using chardet to convert arbitrary byte strings to Unicode

  • In theory you should probably check the next two bytes are zero before sure you’re looking at UTF-16 not UTF-32.

  • @Kirit
    You’re quite right, thanks for pointing this out. I’ve fixed the code above to check for UTF-32 BOMs as well.

    Of course, it will fail with a LookupError if the Python installation doesn’t support UTF-32, but at least you’ll know why 🙂

  • Thomas Grainger

    It might be better to return:

    return unicode(
    bytes,
    chardet.detect(bytes).get(‘encoding’, ‘utf-16’),
    errors=errors
    )

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>