Speeding up search on Honyaku archive site

Last summer, I launched a new archive site for the Honyaku mailing list.

The site is written in Python using the django framework, with MySQL as the database. I chose MySQL because my tests showed that it was much faster than PostgreSQL at text searching.

Lately, however, the searches have been taking a huge amount of time. Sometimes they would even time out. It makes sense, since I've got more than 216,000 emails in there now, and body__icontains isn't exactly a speed demon.

But it was also taking forever just to get the posts for a given day. That was pretty easy to solve, though: duh, create an index on the date_sent field. So simple I never thought about it until the system was bogging down like a Golden Week traffic jam.

That solved the date problem, but my text search problem remained. In the end, I had to create a full-text index for the simple search. This solved the speed problem — queries take a second or two now — but the problem with MySQL's full-text index is that it has lousy support for Japanese text (which isn't delimited by spaces). For that reason, I kept the old, slow search method for the advanced search. If you use that, I recommend narrowing the search rather than just entering some body text.

In the end, I'm going to have to bite the bullet and install some kind of n-gram indexing scheme that will support Japanese. Right now, though, I simply don't have the time.

As a stopgap measure, I added a Google search for the Honyaku archive. Google doesn't seem to have indexed the site yet (I just took the main archive out of the robots.txt file), but when it does it'll be a quick way to search with good Japanese support. They even have a gadget that I can put on the Honyaku archive site, but I can't get it to keep the height I set for it.

7 comments to Speeding up search on Honyaku archive site

  • Here’s what I’ve come up with so far for the Asian character conundrum. In the search routine, if the query contains no Asian characters then it does a full-text search. Otherwise, it falls back to the slow __icontains method. Advanced search still uses the old method.

  • Ben

    Hmm, this problem may be easier than you think. I have made a search engine to search through about 8,000 files of Japanese, & I got a drastic increase in speed just by indexing by kanji.

    I started experimentally making the kanji index, expecting to have to use something more sophisticated later on, and got some surprises. First of all, there were only about 2500 kanjis in total. Second, each kanji appears in surprisingly few files. The more files I indexed, the less and less the percentage which had each kanji in it. By the time I got to 8,000 files, the maximum was 17%, which is a kanji which appears in the name of the institution. Even things like 中 were only in ten percent of the files.

    So I just made a big list of kanji/file (indexed by number) like

    中: 234 456 876
    急: 234 999

    etc. and read the file in. The file is about 1.5 megabytes. Then I selected all the lines with kanjis from the search string (one bug occurred searching for 毒物劇物 because I’d forgotten to eliminate duplicates). Then, I got the intersection of the lists of files, and just searched through the files.

    This simple algorithm was easily enough to make the search go from taking about ten seconds to running too fast to even notice it was running, so I expect you can speed up your search over 120,000 files very much without any fancy algorithms. I’d guess you will have an index of about ten megabytes in that case, which could be reduced to much less if you use native C structures in your index file. But my search is already fast enough.

    One thing you need for this to work is a rapid way to decide whether a character is a kanji or not. I did this by creating a lookup table from the UCD database data. You may have a better idea, but if not I’d be willing to share the code.

  • Thanks Ben — that’s an excellent idea. Using a “bag of kanji” approach should be pretty quick, yet effectively narrow down the candidate posts. I’m going to work on adding a kanji-post index to my database, with a django model for it.

    (By the way, the archive is up to 220,000+ posts now — woot!)

  • Ben

    Further on this topic, I found something else which may be useful:

    http://qwik.jp/senna/

    I haven’t tried it yet though.

    BTW if I forget to type the security code, my post contents disappear when I go back.

  • Thanks, Ben. I did look into that, but it looked like it was going to take some time/effort to get it set up and running on my host (webfaction). I’ll definitely keep it in mind when I get the search sorted out, though.

    Right now the site has a bigger problem, though: some base64-encoded emails are getting into the database without being decoded. Thanks to this comment by Jeremy Morel, however, I have an idea of how to fix that.

  • Ben

    I had another look at Senna today & found it seems very hard to understand its documentation. Also the Perl binding is faulty:

    http://cpantesters.perl.org/show/Senna.html

    So I agree it would be a lot of work to use this system.

    P.S. your security code system deleted my comment again.

  • Sorry about the captcha thing. I’m going to just get rid of it, since Akismet is so good at catching spam anyway.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>