Parsing multilingual email with Python
The email module in the Python standard library provides just about everything you need to parse multilingual emails with Python. There are a few traps, however, that can catch the unaware and unwary.
Parsing an email message
The email module provides a couple of handy functions for parsing email: message_from_string and message_from_file. Both functions return a Message instance, taking a string and a file as input, respectively. Below, I'll assume that you've got yourself an email message using one of these functions.
"Internationalized" headers — headers containing other than ASCII characters — use a special 7-bit encoding defined by RFC 2822. What this means is that headers with Japanese and other text will have funny-looking values like this:
You can access the headers of an email like a dictionary. So, given the above "email" as a string named text, we get the headers like this:
>>> print msg["nonexistent-field"]
The email.header module provides a convenient function for decoding this gobbledygook named, appropriately enough, decode_header. decode_header takes an internationalized header and returns a list of pairs, each pair consisting of a text string and an encoding. We can then glue these strings back together to reconstruct the header as a Unicode string. Here's a function that does that:
def getheader(header_text, default="ascii"):
"""Decode the specified header"""
headers = decode_header(header_text)
header_sections = [unicode(text, charset or default)
for text, charset in headers]
Here's the code in action:
>>> print getheader(text)
>>> text = "=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA+?="
>>> print getheader(text)
To get the text part of the email body, use the Message.get_payload method. But there's a catch: if the email is single-part (like the example above), then get_payload will return a string. If the message is multipart, then you've got to specify which part of the payload you want (or use a method to iterate the payloads). To find out if the message is multipart, use its is_multipart method.
Luckily, the email.Iterators module gives us the handy function typed_subpart_iterator, which we can use to iterate through the payloads we want (in this case those of type "text/plain").
Another thing to watch out for is how the encoding of the email is declared. There are two methods to show the encoding: get_content_charset and get_charset. get_content_charset is preferred. When you have a multipart email, each payload may have its own encoding. Furthermore, some email programs don't bother to set any encoding at all. To be robust, you've got to be prepared for all of this.
Here are a couple of functions that pull this all together.
def get_charset(message, default="ascii"):
"""Get the message charset"""
"""Get the body of the email message"""
#get the plain text version only
text_parts = [part
for part in typed_subpart_iterator(message,
body = 
for part in text_parts:
charset = get_charset(part, get_charset(message))
else: # if it is not multipart, the payload will be a string
# representing the message body
body = unicode(message.get_payload(decode=True),
decode=True is very important when you've got a single-part email, because that will decode base64-encoded emails for you. Another caveat is that setting decode to True with a multipart message will cause get_payload to return None…
Edit:Following a suggestion in the comments, I changed the
get_payload call on the parts, setting
Now you're ready to parse multilingual email for your own nefarious purposes (like maintaining an archive of email messages…)