Be careful when cutting UTF-8 text

I just fixed a nasty problem on the two planets I run (one for namics and one for local.ch). The aggregator script would run forever without stopping. A bit of debugging showed, that the problem was about how UTF-8 character were handled (or rather weren’t handled). The script uses PHP’s DomDocument, more specifically it’s functions loadHTML and saveXML, to extract valid XML from the blog posts. That’s necessary because the posts are shortened and this shortening can lead to a completely invalid (X)HTML structure. Let alone all the rubbish content that many a software produces. Shortening the content was of course done with the PHP function substr. And that’s where the problem was. The relevant part of the text that caused problem was “steht nur auf Englisch zur Verfügung”. Translated to UTF-8 this becomes “steht nur auf Englisch zur Verf??gung”. To this string substr was applied and it produced “steht nur auf Englisch zur Verf?” - the second half of the UTF-8 character was cut off. If you know anything about UTF-8 you will go “ouch” here and smile about your knowledge and skip the following two paragraphs. If you don’t see the problem yet, let me enlighten you. UTF-8 is a Unicode encoding. So it can transport any of the characters defined in Unicode which is just about any character that you might ever want to use in today’s computing. It’s neat because for most of the content in European languages it requires just one byte per character (versus two bytes in UTF-16 for example). When a byte is in the ASCII range it’s displayed and all is well. But when the byte is outside of the ASCII range (which only knows about 128 characters and can be encoded in 7 bits per character as you probably know) this means that the following byte belongs to the same character. I’m sorry, I don’t really know how to explain that any better so let me just give you an example. The string Für becomes F??r in UTF-8. So the UTF-8 decoder would read the first byte, the letter F. That fits nicely into ASCII, so that byte is read and the decoder continues with the next character. It reads one byte which is ?. “Holy cow” you hear the decoder exclaim, “that’s not ASCII”. So the decoder has to read one additional byte and gets the ?. Reading those two characters, putting them together and calculating a bit, the decoder then knows that it has just read an ü. So do you already see the problem in the UTF-8 string “steht nur auf Englisch zur Verf?”? The decoder arrives at the last byte which is ?. It knows it has to read one more byte, but there are none. So somehow the PHP code in question decides to patiently wait until the string magically grows longer. The real problem though is of course the careless use of substr. You shouldn’t just cut UTF-8 characters in half. The problem can be solved with mb_substr, a substr function that is Unicode-aware. Just give it ‘utf-8’ as its fourth argument and the problem is solved. Update: It seems that the problem goes away automatically with newer libxml versions. On my server it’s 2.6.16, while Chregu uses 2.6.23 and can’t reproduce the problem. Thanks Chregu for digging into this. ...

March 24, 2006 · Patrice Neff