Fixing JIS mojibake with Python

JIS (iso-2022-jp) is a Japanese text encoding. The beginning and end of a JIS sequence are marked by escape sequences:

Beginning: 1B $ B (or 1B $ @)
Ending: 1B ( J (or 1B ( B)

This encoding is often used in email. Unfortunately, some email programs (and even mail routers) strip out the escape characters. This doesn't happen too much any more, but back in ancient times (the 1990s), it was pretty frequent when dealing with Japanese email. So you end up with mojibake like this:

$BCO?LC59[MQ:nF}5!(J

Fortunately, the problem is fairly easy to fix: just insert the escape characters back in. Here's some Python code to do that:

import re

def de_bakefy(text):
    """Repair JIS mojibake in text.
    The input text is assumed to be utf-8.
    If the decoding from JIS fails, then
    the original text is returned."
""

    matches = re.findall(r"\$[B|@].*?\([J|B]", text, re.S)

    try:
        out_text = text
        for m in matches:
            sub = chr(0x1B) + m[:-2] + chr(0x1B) + "(J"
            sub = sub.decode("iso-2022-jp").encode("utf8")
            out_text = out_text.replace(m, sub)
        return out_text
    except UnicodeDecodeError:
        # If we couldn't decode it, then it was a bogus sequence
        return text

Here's the above mojibake, run through this function:

>>> print de_bakefy("$BCO?LC59[MQ:nF}5!(J")
地震探鉱用作乳機

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>