Counting words, characters, and Asian characters with Python

As a translator, I often need to get word counts. As a Japanese-to-English translator, I need to get Japanese character counts as well.

These are the kinds of counts that MS Word gives:

  • Characters (with spaces)
  • Characters (without spaces)
  • Words
  • Asian characters
  • Non-Asian words

Let's calculate each of them. Say that the name text points to the text we want.

Characters (with spaces)

This one is the easiest.

characters = len(text)

Characters (without spaces)

Still pretty easy:

chars_no_spaces = sum([not x.isspace() for x in text])

Here, we take advantage of the fact that True is 1, and False is 0. If you think that's too much magic, you could just say len([x for x in text if not x.isspace()]).

Asian characters

For this one, we need a way of telling which characters are Asian. Simple enough:

IDEOGRAPHIC_SPACE = 0x3000

def is_asian(char):
    """Is the character Asian?"""

    # 0x3000 is ideographic space (i.e. double-byte space)
    # Anything over is an Asian character
    return ord(char) > IDEOGRAPHIC_SPACE

Now, we just count all the characters in our text that are Asian:

asian_chars =  sum([is_asian(x) for x in text])

Non-Asian words

This is the tricky one. Say we have European text mixed in with Asian text. Odds are, the text won't be separated by spaces. For example, MS Word will give us a non-Asian word count of 2 for the string "日本語spam日本語eggs".

Here's one way to handle this:

def filter_jchars(c):
    """Filters Asian characters to spaces"""
    if is_asian(c):
        return ' '
    return c

def nonj_len(word):
    u"""Returns number of non-Asian words in {word}
    – 日本語AアジアンB -> 2
    – hello -> 1
    @param word: A word, possibly containing Asian characters
    "
""
    # Here are the steps:
    # 本spam本eggs
    # -> [' ', 's', 'p', 'a', 'm', ' ', 'e', 'g', 'g', 's']
    # -> ' spam eggs'
    # -> ['spam', 'eggs']
    # The length of which is 2!
    chars = [filter_jchars(c) for c in word]
    return len(u".join(chars).split())

Words

And this one is also very simple:

words = non_asian_words + asian_chars

The Full Monty

Here's the full code:

#coding: UTF8
"""
Get word, character, and Asian character counts

1. Get a word count as a dictionary:
    wc = get_wordcount(text)
    words = wc['words'] # etc.

2. Get a word count as an object
    wc = get_wordcount_obj(text)
    words = wc.words # etc.

properties counted:
    * characters
    * chars_no_spaces
    * asian_chars
    * non_asian_words
    * words

Python License
"""
__version__ = 0.1
__author__ = "Ryan Ginstrom"

IDEOGRAPHIC_SPACE = 0x3000

def is_asian(char):
    """Is the character Asian?"""

    # 0x3000 is ideographic space (i.e. double-byte space)
    # Anything over is an Asian character
    return ord(char) > IDEOGRAPHIC_SPACE

def filter_jchars(c):
    """Filters Asian characters to spaces"""
    if is_asian(c):
        return ' '
    return c

def nonj_len(word):
    u"""Returns number of non-Asian words in {word}
    – 日本語AアジアンB -> 2
    – hello -> 1
    @param word: A word, possibly containing Asian characters
    "
""
    # Here are the steps:
    # 日spam本eggs
    # -> [' ', 's', 'p', 'a', 'm', ' ', 'e', 'g', 'g', 's']
    # -> ' spam eggs'
    # -> ['spam', 'eggs']
    # The length of which is 2!
    chars = [filter_jchars(c) for c in word]
    return len(u".join(chars).split())

def get_wordcount(text):
    """Get the word/character count for text

    @param text: The text of the segment
    """

    characters = len(text)
    chars_no_spaces = sum([not x.isspace() for x in text])
    asian_chars =  sum([is_asian(x) for x in text])
    non_asian_words = nonj_len(text)
    words = non_asian_words + asian_chars

    return dict(characters=characters,
                chars_no_spaces=chars_no_spaces,
                asian_chars=asian_chars,
                non_asian_words=non_asian_words,
                words=words)

def dict2obj(dictionary):
    """Transform a dictionary into an object"""
    class Obj(object):
        def __init__(self, dictionary):
            self.__dict__.update(dictionary)
    return Obj(dictionary)

def get_wordcount_obj(text):
    """Get the wordcount as an object rather than a dictionary"""
    return dict2obj(get_wordcount(text))

The linked file (wordcount.zip) has this module and unit tests for it.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>