November 19, 2007
Parsing multilingual email with Python
The email module in the Python standard library provides just about everything you need to parse multilingual emails with Python. There are a few traps, however, that can catch the unaware and unwary.
Parsing an email message
The email module provides a couple of handy functions for parsing email: message_from_string and message_from_file. Both functions return a Message instance, taking a string and a file as input, respectively. Below, I'll assume that you've got yourself an email message using one of these functions.
The headers
"Internationalized" headers — headers containing other than ASCII characters — use a special 7-bit encoding defined by RFC 2822. What this means is that headers with Japanese and other text will have funny-looking values like this:
To: honyaku@googlegroups.com
Subject: =?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?=
Test!
You can access the headers of an email like a dictionary. So, given the above "email" as a string named text, we get the headers like this:
>>> msg["from"]
'=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?='
>>> msg["to"]
'honyaku@googlegroups.com'
>>> msg["subject"]
'=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?='
>>> print msg["nonexistent-field"]
None
The email.header module provides a convenient function for decoding this gobbledygook named, appropriately enough, decode_header. decode_header takes an internationalized header and returns a list of pairs, each pair consisting of a text string and an encoding. We can then glue these strings back together to reconstruct the header as a Unicode string. Here's a function that does that:
def getheader(header_text, default="ascii"):
"""Decode the specified header"""
headers = decode_header(header_text)
header_sections = [unicode(text, charset or default)
for text, charset in headers]
return u"".join(header_sections)
Here's the code in action:
>>> print getheader(text)
本日は晴天なり
>>> text = "=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA+?="
>>> print getheader(text)
桃太郎 <momo@taro.ne.jp>
The body
To get the text part of the email body, use the Message.get_payload method. But there's a catch: if the email is single-part (like the example above), then get_payload will return a string. If the message is multipart, then you've got to specify which part of the payload you want (or use a method to iterate the payloads). To find out if the message is multipart, use its is_multipart method.
Luckily, the email.Iterators module gives us the handy function typed_subpart_iterator, which we can use to iterate through the payloads we want (in this case those of type "text/plain").
Another thing to watch out for is how the encoding of the email is declared. There are two methods to show the encoding: get_content_charset and get_charset. get_content_charset is preferred. When you have a multipart email, each payload may have its own encoding. Furthermore, some email programs don't bother to set any encoding at all. To be robust, you've got to be prepared for all of this.
Here are a couple of functions that pull this all together.
def get_charset(message, default="ascii"):
"""Get the message charset"""
if message.get_content_charset():
return message.get_content_charset()
if message.get_charset():
return message.get_charset()
return default
def get_body(message):
"""Get the body of the email message"""
if message.is_multipart():
#get the plain text version only
text_parts = [part
for part in typed_subpart_iterator(message,
'text',
'plain')]
body = []
for part in text_parts:
charset = get_charset(part, get_charset(message))
body.append(unicode(part.get_payload(),
charset,
"replace"))
return u"\n".join(body).strip()
else: # if it is not multipart, the payload will be a string
# representing the message body
body = unicode(message.get_payload(decode=True),
get_charset(message),
"replace")
return body.strip()
Calling get_payload with decode=True is very important when you've got a single-part email, because that will decode base64-encoded emails for you. Another caveat is that setting decode to True with a multipart message will cause get_payload to return None…
Now you're ready to parse multilingual email for your own nefarious purposes (like maintaining an archive of email messages…)
Thanks a lot for your tip about decode_header, it was very useful to me.
However, decode_header fails to recognize the encoding if it’s inside quotations marks (e.g. ‘”=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?=”‘). This is common in gmail “From” header… Removing the quotation marks solves the problem.
The examples here are great! They work out of the box, And they give you an idea on where to start in order to look in deeper into things.
Thanks a lot!
@muriloq - Thanks for the tip. I am not seeing this behavior, however:
>>> t = decode_header("=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?=")[0] >>> print unicode(t[0], t[1]) 本日は晴天なり >>> rq, t, lq = decode_header('"=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?="') >>> print unicode(t[0], t[1]) 本日は晴天なりI’m using Python 2.5.1. Are you using an older version by any chance?