April 29, 2008
Speeding up search on Honyaku archive site
Last summer, I launched a new archive site for the Honyaku mailing list.
The site is written in Python using the django framework, with MySQL as the database. I chose MySQL because my tests showed that it was much faster than PostgreSQL at text searching.
Lately, however, the searches have been taking a huge amount of time. Sometimes they would even time out. It makes sense, since I've got more than 216,000 emails in there now, and body__icontains isn't exactly a speed demon.
But it was also taking forever just to get the posts for a given day. That was pretty easy to solve, though: duh, create an index on the date_sent field. So simple I never thought about it until the system was bogging down like a Golden Week traffic jam.
That solved the date problem, but my text search problem remained. In the end, I had to create a full-text index for the simple search. This solved the speed problem — queries take a second or two now — but the problem with MySQL's full-text index is that it has lousy support for Japanese text (which isn't delimited by spaces). For that reason, I kept the old, slow search method for the advanced search. If you use that, I recommend narrowing the search rather than just entering some body text.
In the end, I'm going to have to bite the bullet and install some kind of n-gram indexing scheme that will support Japanese. Right now, though, I simply don't have the time.
As a stopgap measure, I added a Google search for the Honyaku archive. Google doesn't seem to have indexed the site yet (I just took the main archive out of the robots.txt file), but when it does it'll be a quick way to search with good Japanese support. They even have a gadget that I can put on the Honyaku archive site, but I can't get it to keep the height I set for it.
Here’s what I’ve come up with so far for the Asian character conundrum. In the search routine, if the query contains no Asian characters then it does a full-text search. Otherwise, it falls back to the slow __icontains method. Advanced search still uses the old method.