Translating Web applications for the Swiss market

Posted by Patrice Neff Sat, 25 Mar 2006

When I create a new Web application I usually try to implement translation right from the start. For example as the Swiss blogosphere has four significant languages, I translated both the blog list and the Swiss weblog statistics. The list is available in English, German and French (thanks to Jérôme for the French translation!) and the Stats are available in German and English so far. I'd like both applications to be translated into more languages. And potentially other future applications as well.

So I'm proposing a simple small "project": a contact point for translators. I want to collect some addresses of people who are willing to provide translations for free - for free applications of course. This would be used only for projects in the Swiss blogosphere (blogug.ch for example or swissblogs.com once that rock starts rolling again).

If you are willing and able to provide translations between German, English, Italian or French (pick two or more ;-) please contact me with a list of your languages and I'll note you. I'll just keep you in my personal address book, so your name and address won't be made public. And I won't spam you, honest!

If you need a translation, also contact me and I will forward your project to everyone who has announced willingness to translate between the required languages.

There is no commitment for the translators. So they can decide every time if they have time for that translation or not.

That's basically a braindump of an idea I just had. So what do you think? Is the idea sensible or complete rubbish?

Be careful when cutting UTF-8 text

Posted by Patrice Neff Fri, 24 Mar 2006

I just fixed a nasty problem on the two planets I run (one for namics and one for local.ch). The aggregator script would run forever without stopping. A bit of debugging showed, that the problem was about how UTF-8 character were handled (or rather weren't handled).

The script uses PHP's DomDocument, more specifically it's functions loadHTML and saveXML, to extract valid XML from the blog posts. That's necessary because the posts are shortened and this shortening can lead to a completely invalid (X)HTML structure. Let alone all the rubbish content that many a software produces.

Shortening the content was of course done with the PHP function substr. And that's where the problem was. The relevant part of the text that caused problem was "steht nur auf Englisch zur Verfügung". Translated to UTF-8 this becomes "steht nur auf Englisch zur Verf??gung". To this string substr was applied and it produced "steht nur auf Englisch zur Verf?" - the second half of the UTF-8 character was cut off. If you know anything about UTF-8 you will go "ouch" here and smile about your knowledge and skip the following two paragraphs. If you don't see the problem yet, let me enlighten you.

UTF-8 is a Unicode encoding. So it can transport any of the characters defined in Unicode which is just about any character that you might ever want to use in today's computing. It's neat because for most of the content in European languages it requires just one byte per character (versus two bytes in UTF-16 for example). When a byte is in the ASCII range it's displayed and all is well. But when the byte is outside of the ASCII range (which only knows about 128 characters and can be encoded in 7 bits per character as you probably know) this means that the following byte belongs to the same character. I'm sorry, I don't really know how to explain that any better so let me just give you an example.

The string Für becomes F??r in UTF-8. So the UTF-8 decoder would read the first byte, the letter F. That fits nicely into ASCII, so that byte is read and the decoder continues with the next character. It reads one byte which is ?. "Holy cow" you hear the decoder exclaim, "that's not ASCII". So the decoder has to read one additional byte and gets the ?. Reading those two characters, putting them together and calculating a bit, the decoder then knows that it has just read an ü.

So do you already see the problem in the UTF-8 string "steht nur auf Englisch zur Verf?"? The decoder arrives at the last byte which is ?. It knows it has to read one more byte, but there are none. So somehow the PHP code in question decides to patiently wait until the string magically grows longer.

The real problem though is of course the careless use of substr. You shouldn't just cut UTF-8 characters in half. The problem can be solved with mb_substr, a substr function that is Unicode-aware. Just give it 'utf-8' as its fourth argument and the problem is solved.


Update: It seems that the problem goes away automatically with newer libxml versions. On my server it's 2.6.16, while Chregu uses 2.6.23 and can't reproduce the problem. Thanks Chregu for digging into this.