Counting words (etc.) in an HTML file with Python

In a previous post, I wrote about how to count words, characters, and Asian characters using python.
In this post I want to pull that together with code to get a word count from an HTML file.
What needs counting
What needs counting depends to some extent on what you need the word count for, but here I'm […]

The invisible translator

By the nature of our profession, translators are generally invisible when they're doing their jobs right.
I say "generally" because this isn't quite a universal truth. For example, unlike in the United States, Japan is a country where a movie subtitle translator (and arguably not even a stellar one) can become a television celebrity. But that's […]

Speeding up search on Honyaku archive site

Last summer, I launched a new archive site for the Honyaku mailing list.
The site is written in Python using the django framework, with MySQL as the database. I chose MySQL because my tests showed that it was much faster than PostgreSQL at text searching.
Lately, however, the searches have been taking a huge amount of time. […]

Do the math

I don't like doing so-called "native checks," or proofing other translators' work in general. It just turns into a bad experience way too often.
I'm candid about this with clients. I tell them I prefer not to do that sort of work. Sometimes they ask anyway, and if they're good clients (i.e. they send […]

What price elegance?

In a recent post, I gave some code for counting the top n most frequent words in an arbitrary text file using itertools.groupby.
The code is written in a somewhat functional style. It's short and, dare I say, kind of elegant. But it turns out that this code is quite a bit slower than an imperative […]

Counting occurrences in a sequence with itertools.groupby

itertools.groupby is a great tool for counting the numbers of occurrences in a sequence.
Here are some examples from the interactive interpreter.
A list of numbers

>>> # Create a random list of numbers
>>> from random import random
>>> numbers = [int(random() * 10) for x in range(20)]
>>> numbers
[8, 0, 3, 2, 3, 9, 8, 2, 8, 3, 0, […]

Making the robot dance

Some time around 1980, my elementary school classroom got a computer. While most of the other kids fooled around playing Hunt the Wumpus, my friend and I found the BASIC manual that came with the computer. We laboriously copied in the code to make a "robot" appear on the screen. After a lot of typos, […]

Using chardet to convert arbitrary byte strings to Unicode

chardet is a fantastic module for finding the encoding of arbitrary byte strings. You can combine this with a check for a BOM to pretty reliably turn them into Unicode.
Edit: Thanks to Kirit's comment below, I added code to check for UTF-32.

import chardet
def bytes2unicode(bytes, errors='replace'):
    """Convert a byte string into Unicode.
    First checks […]

Delivering the bad news

A few weeks ago, a translation agency I work for occasionally called me in a panic. It seems that a major client had rejected one of their Japanese-to-English translations, calling it "unreadable," and providing another translation as a sample of the quality they were after.
The agency wanted to pay me to review their translation, and […]

Python GUI programming platforms for Windows

[Edit]
By popular demand, I've added a section on PyGTK. See bottom of post.
There are several platforms for programming Windows GUI applications in Python. Below I outline a few of them, with a simple "hello world" example for each. Where I've lifted the example from another site, there's a link to the source.
Tkinter
Tkinter is the ubiquitous […]

Next Page »