PHP | Patrice's Weblog

Okapi 1.0

Yesterday we’ve released version 1.0 of Okapi, a web framework built with PHP and XSLT. I’ve spent a substantial amount of time during the last months working on that release. Okapi is the framework we use at local.ch for all our frontend needs and was originally developed by Silvan from Liip. So far we used a heavily modified fork of Okapi. New projects at local now don’t use that internal fork anymore but instead the official central Okapi. To facilitate that I’ve merged some stuff from our fork into the main repository and also I’ve cleaned up the code base so that Okapi now sucks less. ...

PHP Testing with SimpleTest

Maarten’s post at Tillate finally brought the motivation to document the PHP testing approach we use at local.ch. First let me give you a short introduction to our architecture at local.ch. We have a clear separation of frontend (presentation, user-visible parts) and backend (search logic and database accesses). The frontend is written in PHP and XSLT. The PHP-part basically only orchestrates queries to our Java-based backend and passes the XML responses to XSLT. The bigger parts of the system are the XSLT stylesheet. All this means, that traditional unit tests don’t have a big value for the frontend as there isn’t much traditional logic. But we need to do functional/integration testing. ...

Looking for a frontend developer

At local.ch we’re looking for a developer in the area of PHP/XSLT development. You will take over work regarding all user-visible aspects with technologies such as PHP, XSLT or Javascript. A job at local.ch gives you a lot of freedom to explore, find good solutions, learn new technologies, bring in your opinions and knowledge. We want a developer who knows how to write clean XHTML and CSS, has experience in client side Javascript and is well-versed in XML. We will gladly teach you XSLT on the job but if you already know to program in XSLT so much the better. ...

Mail testing with Selenium

For the next phase of local.ch E-Mail processes will play a central role. So I wanted to include those processes in our Selenium tests. It’s actually quite easy to do. First create an account where test mails can go to. That account should be accessible by one of your scripts. I use a normal IMAP account for that. Then write a script which always outputs the newest mail on that account. I include some of the important headers plus the body (body parts for multi-part mails). I also made that page refresh itself every two seconds. ...

Book Review: Building Scalable Web Sites

A short while back the book Building Scalable Web Sites came out on Safari. The book is written by Cal Henderson, one of the main people behind Flickr. As I’m currently involved in building a Web application (local.ch) I was interested to learn a few lessons by the Flickr people. Okay, Flickr is not beating any speed records right now, but it’s still an incredibly big application with tons of users and data. Management review: the book is worth a read. Technical short review: the book covers a lot of stuff a bit and nothing extremely well. The book does not completely live up to it’s title as scaling is only part of the book. It seems more to be a list of lessons learned while building Flickr. That’s also the reason for one of the book’s main deficiencies: it’s mostly PHP and MySQL only. But it also includes enough lessons that can be applied in other environments for it to be useful. A short chapter by chapter review follows. ...

Be careful when cutting UTF-8 text

I just fixed a nasty problem on the two planets I run (one for namics and one for local.ch). The aggregator script would run forever without stopping. A bit of debugging showed, that the problem was about how UTF-8 character were handled (or rather weren’t handled). The script uses PHP’s DomDocument, more specifically it’s functions loadHTML and saveXML, to extract valid XML from the blog posts. That’s necessary because the posts are shortened and this shortening can lead to a completely invalid (X)HTML structure. Let alone all the rubbish content that many a software produces. Shortening the content was of course done with the PHP function substr. And that’s where the problem was. The relevant part of the text that caused problem was “steht nur auf Englisch zur Verfügung”. Translated to UTF-8 this becomes “steht nur auf Englisch zur Verf??gung”. To this string substr was applied and it produced “steht nur auf Englisch zur Verf?” - the second half of the UTF-8 character was cut off. If you know anything about UTF-8 you will go “ouch” here and smile about your knowledge and skip the following two paragraphs. If you don’t see the problem yet, let me enlighten you. UTF-8 is a Unicode encoding. So it can transport any of the characters defined in Unicode which is just about any character that you might ever want to use in today’s computing. It’s neat because for most of the content in European languages it requires just one byte per character (versus two bytes in UTF-16 for example). When a byte is in the ASCII range it’s displayed and all is well. But when the byte is outside of the ASCII range (which only knows about 128 characters and can be encoded in 7 bits per character as you probably know) this means that the following byte belongs to the same character. I’m sorry, I don’t really know how to explain that any better so let me just give you an example. The string Für becomes F??r in UTF-8. So the UTF-8 decoder would read the first byte, the letter F. That fits nicely into ASCII, so that byte is read and the decoder continues with the next character. It reads one byte which is ?. “Holy cow” you hear the decoder exclaim, “that’s not ASCII”. So the decoder has to read one additional byte and gets the ?. Reading those two characters, putting them together and calculating a bit, the decoder then knows that it has just read an ü. So do you already see the problem in the UTF-8 string “steht nur auf Englisch zur Verf?”? The decoder arrives at the last byte which is ?. It knows it has to read one more byte, but there are none. So somehow the PHP code in question decides to patiently wait until the string magically grows longer. The real problem though is of course the careless use of substr. You shouldn’t just cut UTF-8 characters in half. The problem can be solved with mb_substr, a substr function that is Unicode-aware. Just give it ‘utf-8’ as its fourth argument and the problem is solved. Update: It seems that the problem goes away automatically with newer libxml versions. On my server it’s 2.6.16, while Chregu uses 2.6.23 and can’t reproduce the problem. Thanks Chregu for digging into this.