November 17, 2007
Fixing JIS mojibake with Python
JIS (iso-2022-jp) is a Japanese text encoding. The beginning and end of a JIS sequence are marked by escape sequences:
Beginning: 1B $ B (or 1B $ @)
Ending: 1B ( J (or 1B ( B)
This encoding is often used in email. Unfortunately, some email programs (and even mail routers) strip out the escape characters. This doesn't happen too much any more, but back in ancient times (the 1990s), it was pretty frequent when dealing with Japanese email. So you end up with mojibake like this:
$BCO?LC59[MQ:nF}5!(J
Fortunately, the problem is fairly easy to fix: just insert the escape characters back in. Here's some Python code to do that:
def de_bakefy(text):
"""Repair JIS mojibake in text.
The input text is assumed to be utf-8.
If the decoding from JIS fails, then
the original text is returned."""
matches = re.findall(r"\$[B|@].*?\([J|B]", text, re.S)
try:
out_text = text
for m in matches:
sub = chr(0×1B) + m[:-2] + chr(0×1B) + "(J"
sub = sub.decode("iso-2022-jp").encode("utf8")
out_text = out_text.replace(m, sub)
return out_text
except UnicodeDecodeError:
# If we couldn't decode it, then it was a bogus sequence
return text
Here's the above mojibake, run through this function:
地震探鉱用作乳機