Parsing multilingual email with Python

The email module in the Python standard library provides just about everything you need to parse multilingual emails with Python. There are a few traps, however, that can catch the unaware and unwary.

Parsing an email message

The email module provides a couple of handy functions for parsing email: message_from_string and message_from_file. Both functions return a Message instance, taking a string and a file as input, respectively. Below, I'll assume that you've got yourself an email message using one of these functions.

The headers

"Internationalized" headers — headers containing other than ASCII characters — use a special 7-bit encoding defined by RFC 2822. What this means is that headers with Japanese and other text will have funny-looking values like this:

From: =?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?=
To: honyaku@googlegroups.com
Subject: =?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?=

Test!

You can access the headers of an email like a dictionary. So, given the above "email" as a string named text, we get the headers like this:

>>> msg = email.message_from_string(text)
>>> msg["from"]
'=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?='
>>> msg["to"]
'honyaku@googlegroups.com'
>>> msg["subject"]
'=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?='
>>> print msg["nonexistent-field"]
None

The email.header module provides a convenient function for decoding this gobbledygook named, appropriately enough, decode_header. decode_header takes an internationalized header and returns a list of pairs, each pair consisting of a text string and an encoding. We can then glue these strings back together to reconstruct the header as a Unicode string. Here's a function that does that:

from email.header import decode_header

def getheader(header_text, default="ascii"):
    """Decode the specified header"""

    headers = decode_header(header_text)
    header_sections = [unicode(text, charset or default)
                       for text, charset in headers]
    return u"".join(header_sections)

Here's the code in action:

>>> text = "=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?="
>>> print getheader(text)
本日は晴天なり
>>> text = "=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA+?="
>>> print getheader(text)
桃太郎 <momo@taro.ne.jp>

The body

To get the text part of the email body, use the Message.get_payload method. But there's a catch: if the email is single-part (like the example above), then get_payload will return a string. If the message is multipart, then you've got to specify which part of the payload you want (or use a method to iterate the payloads). To find out if the message is multipart, use its is_multipart method.

Luckily, the email.Iterators module gives us the handy function typed_subpart_iterator, which we can use to iterate through the payloads we want (in this case those of type "text/plain").

Another thing to watch out for is how the encoding of the email is declared. There are two methods to show the encoding: get_content_charset and get_charset. get_content_charset is preferred. When you have a multipart email, each payload may have its own encoding. Furthermore, some email programs don't bother to set any encoding at all. To be robust, you've got to be prepared for all of this.

Here are a couple of functions that pull this all together.

from email.Iterators import typed_subpart_iterator

def get_charset(message, default="ascii"):
    """Get the message charset"""

    if message.get_content_charset():
        return message.get_content_charset()

    if message.get_charset():
        return message.get_charset()

    return default

def get_body(message):
    """Get the body of the email message"""

    if message.is_multipart():
        #get the plain text version only
        text_parts = [part
                      for part in typed_subpart_iterator(message,
                                                         'text',
                                                         'plain')]
        body = []
        for part in text_parts:
            charset = get_charset(part, get_charset(message))
            body.append(unicode(part.get_payload(decode=True),
                                charset,
                                "replace"))

        return u"\n".join(body).strip()

    else: # if it is not multipart, the payload will be a string
          # representing the message body
        body = unicode(message.get_payload(decode=True),
                       get_charset(message),
                       "replace")
        return body.strip()

Calling get_payload with decode=True is very important when you've got a single-part email, because that will decode base64-encoded emails for you. Another caveat is that setting decode to True with a multipart message will cause get_payload to return None…

Edit:Following a suggestion in the comments, I changed the get_payload call on the parts, setting decode=True.

Now you're ready to parse multilingual email for your own nefarious purposes (like maintaining an archive of email messages…)

16 comments to Parsing multilingual email with Python

  • Thanks a lot for your tip about decode_header, it was very useful to me.

    However, decode_header fails to recognize the encoding if it’s inside quotations marks (e.g. ‘”=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?=”‘). This is common in gmail “From” header… Removing the quotation marks solves the problem.

  • The examples here are great! They work out of the box, And they give you an idea on where to start in order to look in deeper into things.

    Thanks a lot!

  • @muriloq – Thanks for the tip. I am not seeing this behavior, however:

    >>> t = decode_header("=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?=")[0]
    >>> print unicode(t[0], t[1])
    本日は晴天なり
    >>> rq, t, lq = decode_header('"=?iso-2022-jp?b?GyRCS1xGfCRPQDJFNyRKJGobKEI=?="')
    >>> print unicode(t[0], t[1])
    本日は晴天なり

    I’m using Python 2.5.1. Are you using an older version by any chance?

  • Very good ! Thanks for the snippet.

    You may want to use get_payload(decode=True) in the multipart mail processing section though. It took me a little time to figure out why I was getting weird encoding in that case.

    For people interested in attached documents, take a look at the documentation : http://docs.python.org/lib/node161.html (last example). There is a piece of code used to retrieve attachments.

    That was the only thing missing from this excellent post 🙂

  • Thanks for the information, Jeremy. In my testing, get_payload was returning None for multipart messages with decode=True, but I’ll try some more testing.

    I’m sure the information on handling attachments will be useful as well.

  • From the documentation of get_payload :
    “If the message is a multipart and the decode flag is True, then None is returned”
    So if your message is multipart, the get_payload on that message will return None, which is quite logic, given that every part can have a different encoding (or to be more accurate, a value in no encoding/base64/quoted-printable). But using get_payload on each part should not be an issue, unless they are multipart too (in which case it would juste be necessary to re-run the get_body function that you wrote on that subpart). However I do not understand fully how email works. Therefore I’m going to do some testing on that and will post result here later.

  • Thank you, this information was very helpful for me! 😉

  • @Jeremy

    You were right, thanks. setting “decode=True” in the get_payload function of the parts is the right thing to do.

  • Igor Serko

    Thanks for the article. It helped me a lot with a problem I have.

    I’d like to continue where muriloq stopped.
    Gmail’s From field is usually constructed as follows:
    ‘”Pretty name” ‘

    The problem with decode_header is that it won’t work when only the “Pretty name” is encoded. Ex. ‘”=?UTF-8?Q?Igor_=C5=A0erko?=” ‘

    In this case I used the following:

    def get_from(matchobj):
        txt = decode_header(matchobj.group(0)[1:-1])[0]
        txt = txt[0].decode(txt[1]) if txt[1] else txt[0]
        return u'"%s"'%txt
    
    mail_from = re.sub(r'"[^"]+"', get_from, mail_from)
  • @Igor

    Thanks for the code, Igor. I’m not sure exactly what the problem is, though. I get this for your string:
    >>> decode_header(‘”=?UTF-8?Q?Igor_=C5=A0erko?=”‘)
    [(‘”‘, None), (‘Igor \xc5\xa0erko’, ‘utf-8’), (‘”‘, None)]

    (Stupid WordPress is turning all the straight quotes into curly quotes, but you get the picture)

  • Thanks for the code and the explanations!

  • Thank you, we’ve been having tons of issues with weird unicode issues. I think you may have just solved them for us!

  • Excellent post!

    Here’s a little problem I ran into… The parameters in content-type weren’t properly separated by “;”… here’s a slightly more rubust version (and doesn’t make any double calls):

    def get_charset(message, default="ascii"):
        """Get the message charset
    
        charset = message.get_content_charset()
        if not charset:
            charset = message.get_charset()
    
        if charset:
            if charset.find('"')>0:
                charset = charset[:charset.find('"')]
            return charset
        return default
    

    PS. I also ran into Igor’s problem

  • The email field is quite problematic, I think I’ve nailed it now:

    def get_multilingual_header(header_text, default=”ascii”):
        if not header_text is None:
            try:
                headers = header.decode_header(header_text)
            except HeaderParseError:
                return u”Error”
    
            try:
                header_sections = [unicode(text, charset if charset and charset!='unknown' else default, errors='replace') for text, charset in headers]
            except LookupError:
                header_sections = [unicode(text, default, errors='replace') for text, charset in headers]
    
            return u”".join(header_sections)
        else:
            return None
    
    def decode_email(raw_email):
        raw_email = raw_email.replace(‘\r’, ‘ ‘).replace(‘\n’, ‘ ‘).replace(‘ ‘, ‘ ‘)
        if re.match(‘=\?.*?\?[QqBb]\?.*\?=$’, raw_email):
            name, email = utils.parseaddr(get_multilingual_header(raw_email))
        else:
            name, email = utils.parseaddr(raw_email)
            name = get_multilingual_header(name)
            email = get_multilingual_header(email)
    
        decoded_email = utils.formataddr((name, email))
        return decoded_email
    

    This worked for all my test cases, in particular:

    print decode_email(‘=?UTF-8?B?5qGD5aSqLCDpg44=?= ‘)
    

    This is how GMail sends it these days and if you decode the whole string first you’re in trouble because decoded it contains a “,”, which means you end up with 2 email addresses

  • […] I found a useful solution to something similar in ginstrom.com/scribbles/2007/11/19/… […]

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>