My solution to the localization horror story

The localization horror story in this CPAN article about the Locale::Maketext module tells of the combinatory explosion of translation "rules" required when localizing text with variables (placements) into multiple languages.

Since I translate (and localize) Japanese to English, this is a problem that really strikes home with me. Here's a simple example of this problem in Japanese:

%s件のコメントを削除しました

Japanese lacks a plural marker, so this would be translated as "Deleted 1 comment" or "Deleted n comments," depending on the value assigned to the variable "%s" — the Japanese is the same for both the singular and plural cases.

There's an obvious conundrum here: we have to make a distinction in the translation that doesn't exist in the source text. How can you localize this text without changing the source code of the software (often not feasible) and teaching the developers enough English to code for these differences (even less likely to happen)?

The article's proposed solution

The authors propose a solution to this problem in the form of a rules-based system. (One example is "You have [quant,_1,piece] of new mail."). I see two big problems with this system: (1) you're hard-coding your rules into your code, and relatedly (2) you'll likely have to modify the actual code (or at least page template) every time you add a new language.

Getting all the possible rules for any language would pretty much mean writing a machine translation system, and we all know how successful they are.

This is actually a pretty common beginner's solution. You realize that simple string substitution doesn't work for translating sentences, so you think: I know, I'll add some rules. In the end, though, this never works — with every new sentence and language you'll be adding new rules. And languages tend to vary along different axes, so your rule system grows increasingly complex.

Look at Google Translate.  That's based on Systran, a rule-based system developed in the late 1960s. You can probably guess that in the 40-odd years since, they still haven't got the rules right.

If you only have a fixed number of sentences in a fixed number of languages to translate, the rules approach could work. But the problem that the authors are trying to address is arbitrarily many languages — and I'd add the probability that you'll have more text to translate later.

My solution

I propose that instead of hard-coding rules, we produce concrete translations for the exceptions. This follows a general principle of moving vectors of change out of the code whenever possible.

Then we write an "improved" gettext that first fills in the variable and sees if there's a matching translation, if not using the translation with the placement. Here's how a Python version might look:

#coding: UTF8

trans_dict = {u"スパム" : "spam",
              u"1件のコメントを削除しました" : "Deleted 1 comment",
              u"%s件のコメントを削除しました" : "Deleted %s comments" }

def get_translation(msgid):
    return trans_dict.get(msgid)

def get_and_fill(msgid, *args):
    """Get the translation, then fill in the variables"""

    trans = get_translation(msgid)
    if trans:
        return trans % args
    #If we didn't find any translation…
    return msgid % args

def fill(msgid, *args):
    """See if we have a filled-in translation,
    otherwise get the translation and fill it"
""

    trans = get_translation(msgid % args)
    return trans or get_and_fill(msgid, *args)

def gettext(msgid, *args):
    """Improved gettext that checks for filled-in versions"""

    # Base case — no variables
    if not args:
        return get_translation(msgid) or msgid
    # Try filling it in
    return fill(msgid, *args)

print gettext(u"スパム")
print gettext(u"%s件のコメントを削除しました", 1)
print gettext(u"%s件のコメントを削除しました", 2)
print gettext(u"ハム")

Notice that the code is fault tolerant — if a translation isn't found, it will just return the source text. In real life, we could (and probably should) log any source text with no translations.

Our output:

spam
Deleted 1 comment
Deleted 2 comments
ハム

Yay, it works! Here's how. We have a dictionary of 3 terms:
スパム : spam
1件のコメントを削除しました : Deleted 1 comment
%s件のコメントを削除しました : Deleted %s comments

We call gettext with msgid スパム. There are no variables, so we do a straight lookup and get "spam."

Next, we call it with msgid %s件のコメントを削除しました and a variable of 1. We fill in the string, and get 1件のコメントを削除しました.

get_translation finds the translation of 1件のコメントを削除しました , and returns the translation "Deleted 1 comment."

Next, we call it with msgid %s件のコメントを削除しました and a variable of 2. We fill in the string, and get 2件のコメントを削除しました.

get_translation doesn't find the translation of 2件のコメントを削除しました , so we look it up without the variable filled in.

get_translation finds the translation of %s件のコメントを削除しました, and returns the translation "Deleted %s comments."

We then supply the variable 2 to Deleted %s comments, and get Deleted 2 comments — Simple!

Finally, we call gettext with msgid ハム. We don't find a translation, so we just return ハム back.

Potential complication

One potential problem with this approach is a rule that would produce infinitely many "concrete" (filled-in) translations. The Russian example given in the article seems to fall into this category. Even so, however, it should be possible to give translations for up to some reasonable number — say, 10,000 directories. The localizer should be able to generate them for us automatically, and it shouldn't hurt lookup times too much or take too much disk space — as long as the number of such sentences isn't very large… At any rate, I think this still beats hard-coding exception rules for each new language.

If you are going to add some rule systems, though, I'd add a different one for each language, and add a different rule for each sentence. The localizer engine for each language could load its specific rule set along with its dictionary. The localizer engine would check for matches with rules, then go through the above sequence if no rules are found. So for Russian, I'd have a rule for the equivalent of "Deleted %s comments," and generate translations for just that sentence.

Conclusion

The beauty of this approach is that as we add localized languages, we only have to add more localized dictionaries — there's no need to change the code or our existing localized dictionaries.

Note that this example only works when you have one variable at most. And I think that as a rule of thumb, you should stick to no more than one variable per sentence.

With two or more variables, you'd have to use a dict (with **kwargs instead of *args) with named placements, since the order is likely to change. So the sentence above would be, for example, %(num_comments)s件のコメントを削除しました. Other languages do a "{1}", "{2}" type placement, which works the same way.

Edit: More explanation on why a rule-based approach isn't a good idea, and fleshed out when and how rules could be added for outliers.

3 comments to My solution to the localization horror story

  • “Even so, however, it should be possible to give translations for up to some reasonable number — say, 10,000 directories.”

    Aww no, doing that for the dozens of situations that even a small application would contain is a terrible waste of resources. The idea of translating specific cases instead of trying to generalise for nouns and grammatical roles is probably a good idea, but I’d add some way to dispatch on numbers, rather than on expanded text — you’d only need a few functions per language, which could categorize the numbers.

  • @Reginald

    Well, even with the worst-case Russian scenario, I’d imagine that even going up to 10,000 wouldn’t contain more than a couple of hundred exceptions. That doesn’t sound too bad to me.

    But yes, maybe in seriously degenerate cases you could write a rule. When all you have to deal with is singular/plural, or even singular/dual/plural, then it’s a no brainer (IMO).

    The article also mentioned the case of “Scanned 0 directories” being “Did not scan any directories” in Italian — the concrete translations would handle that as well.

  • Leo

    My solutions to this problem:

    A. Use concise texts that work in any language with any number, such as:
    “Deleted comments: %s”

    or

    B. Use more elaborate wording for very user friendly languages and localize your code with it:

    output := “Thank you, my friend! ”
    if (n == 0)
    output += “Unfortunately I couldn’t delete anything.”
    elseif (n == 1)
    output += “At your request I deleted one message.”
    else
    output += “At your request I deleted ” + n + “messages.”

    Then add complexity to code and translation when there is need.
    If you are going to put that effort into your UI texts then you can as well afford coding the exceptions.

    All other approaches are either too complicated or limited IMHO.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>