Autorefresh for local.ch location pages

Posted by Patrice Neff Mon, 23 Mar 2009

The local.ch location pages contain many nice information about every city and village from Switzerland. Since putting the regional news online, I regularly look at the location page for Olten.

I’ve now created a Greasmonkey script which – once installed – automatically jumps from tab to tab every five seconds.

You can get the script. When you open that page Greasemonkey will ask you to install the script. I’ve tested the script both with the original Greasemonkey as well as with the Safari implementation GreaseKit.

Architecture of news.local.ch

Posted by Patrice Neff Fri, 20 Feb 2009

Intro

Probably my favourite project at local.ch was news.local.ch. That’s a site which crawls Swiss newspapers for news and classifies them by town. It only shows news which can be associated with any Swiss town. This leads to a good news collection even for very small villages.

In this post I’m going to outline the architecture and how we used Amazon’s web services to make building this a lot easier than it could have been. There will be a follow-up detailing how we used Python in this project.

Pipeline processing

I initially started the project as proof of concept. We quickly found that it has a lot of potential. And we quickly saw that there are a few components where we had to try different approaches to find which one works best – mostly in the area of crawling and the geo classification. So Harry Fuecks and I decided to write a pipeline instead of one big application.

The pipeline in this case is implemented as eight different processes which communicate using a queue. Each component can be replaced without having to notify the other components.

This approach buys mostly the following advantages: easy load balancing, good performance, modularity and extensibility.

  • Load balancing: Individual processes can run on different machines without having to touch a single line of code.
  • Performance: To speed things up each pipeline component can be loaded more than once. This way we can linearly scale the throughput of the whole pipeline.
  • Modularity: Every component is very small. Each of them can be read top to bottom in less then half an hour so it’s very easy to debug problems.
  • Extensibility: To add new functionality a new component can be inserted without having to touch other parts of the pipeline.

All these small processes cause quite some overhead. The queue processing in our case takes up big chunks of the processing time because the work units are very small. So raw performance would be better with a monolithic process. But we decided that in our case the advantages of the pipeline system easily outweigh the disadvantages.

Components

The currently implemented pipeline components are:

  • Crawler: Parses newspaper article lists and extracts all URLs which lead to a story.
  • Parser: Parses each story and extracts a structured article with title, lead, body, date, etc.
  • Deduplicator: Finds duplicates and near-duplicates of past stories and handles them.
  • Geo classifier: Attaches a list of Swiss towns to the article based on parsing the body.
  • Region classifier: Attaches a list of regions to the article (usually a city with some towns around it)
  • Image extractor: Gets the images which are used by the article and stores them under local.ch control.
  • Indexer: Puts the article into an index from where it’s then served to the live site.
  • Purger: This process is responsible for cleaning up some garbage when an article is to be deleted.

With the exception of the parser and the geo classifier they those are all very small programs. Well under a hundred lines of code each.

Amazon Web services

A goal of the pipeline system was to be able to run it on Amazon’s EC2 infrastructure. So we use SQS, S3 and SimpleDB heavily.

We do queueing with the SQS queue service. I like the service a lot with one small caveat. You have to poll the service to check if there are new messages. Ideally I’d be able to keep a connection open and be notified when a new message is available.

To store the structured news articles we use S3. Each article is one XML document. Additionally the images are stored on S3 as well. The local.ch binarypool handles that part.

In SimpleDB we store small lookup tables for duplicate detection. We wanted to use SimpleDB also for the article storage, but unfortunately it has a maximum value size of just 1 KB.

Setting sail again

Posted by Patrice Neff Fri, 12 Dec 2008

Today is my last day at local.ch.

I have been involved with that company since the beginning of 2005 with the initial prototypes. After my time in Peru I then started as a full-time employee in May 2006. That was a few days after the very first version of local.ch had gone live. So I spent close to four years involved with the project and of that two years and seven months as an employee.

Now it’s time to sail to fresh territory again. Beginning first of January I’ll be working on Nektoon – a startup I’m co-founding with some friends. I’m not disclosing too much about it’s purpose yet. Just this: your information will become very fluid soon… I’ll stay in Zürich – we’re renting some spare rooms from Liip.

I want to thank all the people at local.ch for the past years. It’s been an exciting trip to build that site from zero. I got to know many outstanding people and have formed some great friendships while working there. Also I advanced a lot professionally as I was able to work with and learn from true masters in various fields. I’ll always be thankful for that privilege.

Active positive feedback

Posted by Patrice Neff Thu, 20 Nov 2008

At local.ch we have a Wall of fame in our wiki. In my personal opinion this is one of the best ideas we’ve ever had when it comes to our team organisation. The reason is that negative feedback gets delivered very quickly and all the time. Strangely enough you have to invest more in making sure positive feedback also gets around.

So what happens now is that if somebody helps a colleague or the whole team in a special way, that person gets an entry in the hall of fame. It’s a really encouraging sight to scroll through that list and see how much people are helping each other. And how thankful the recipients are.

This image shows an anonymized extract.

1 Million gratis Telefonanrufe

Posted by Patrice Neff Tue, 28 Oct 2008

local.ch bietet eine Million gratis Telefonanrufe für Schweizer KMUs. Nachdem die Schweizer Banken nun 68 Milliarden erhalten haben, will local.ch auch den Schweizer KMUs etwas schenken.

local.ch zeigt wie’s geht

(Disclaimer: Ich arbeite für local.ch ag)

Okapi 1.0

Posted by Patrice Neff Fri, 14 Mar 2008

Yesterday we’ve released version 1.0 of Okapi, a web framework built with PHP and XSLT. I’ve spent a substantial amount of time during the last months working on that release. Okapi is the framework we use at local.ch for all our frontend needs and was originally developed by Silvan from Liip.

So far we used a heavily modified fork of Okapi. New projects at local now don’t use that internal fork anymore but instead the official central Okapi. To facilitate that I’ve merged some stuff from our fork into the main repository and also I’ve cleaned up the code base so that Okapi now sucks less.

The main features are these:

Because of it’s good XML and XSLT support it’s ideal for SOA architectures, especially when built with REST APIs.

Development on Okapi will continue. But the idea is to add new features as extensions, so that the core can stay small.

"Quiet time" at work

Posted by Patrice Neff Sat, 20 Oct 2007

Intel is doing a quiet time pilot with about 300 people. In that pilot they disconnect for Tuesday morning. No mails, no IM, no phones, no person allowed to walk in to the office. The pilot has launched end of August, but I only learned of it today through another story on the same blog: No Email Day (via NZZ). On Friday they now encourage not using email but instead talk directly face-to-face or use the phone.

I like the idea of quiet times a lot. In fact at local.ch we’re implementing something similar since a few weeks. Every morning from 10:00 to 12:30 no interrupts are allowed – at all. There is the exception of Thursday which is our meeting day because most of our partners are around on that day. But for the other days it means that you can’t walk in to co-workers during that time, our Skype clients are switched to “Do not disturb” and many people shut off their mail clients. Skype is by the way quiet intelligent about it’s “Do not disturb” mode – any message that comes in gets queued. While you can access them if you need to, they are not pushed into your face.

The times are not arbitrary. Most of us come to the office at 9 in the morning. Beginning the quiet time makes sure that everybody can start gathering the required notes and feedbacks needed for the morning work. Then when people switch into quiet mode at 10, they hopefully have all they need to get their work done. The end is defined by our lunch break which usually starts at about 12:30.

The other idea Intel is implementing doesn’t sound very interesting to me, though. While the concept of “no email” sounds good their motivation stucks me as strange. Walking in on people or using the phone steals a lot more of the attention then sending a mail. You can turn off your mail client or switch off the alerts. And actually I’m doing that. I check my mail only about every second day. The rest of the time my mail client is turned off.

PHP Testing with SimpleTest

Posted by Patrice Neff Tue, 22 May 2007

Maarten’s post at Tillate finally brought the motivation to document the PHP testing approach we use at local.ch.

First let me give you a short introduction to our architecture at local.ch. We have a clear separation of frontend (presentation, user-visible parts) and backend (search logic and database accesses). The frontend is written in PHP and XSLT. The PHP-part basically only orchestrates queries to our Java-based backend and passes the XML responses to XSLT. The bigger parts of the system are the XSLT stylesheet. All this means, that traditional unit tests don’t have a big value for the frontend as there isn’t much traditional logic. But we need to do functional/integration testing.

Only since a short time we actually have a nice PHP-based testing infrastructure. Before that, we almost exclusively used Selenium Core – see for example my presentation of last year. Now we use SimpleTest slightly extended and with a helper class for the Selenium testing (to be documented in a separate blog post).

This is the basic test.php file which we use to execute the tests:

require_once(“common.php”);

// “Configuration”
$GLOBALS['TLD’] = 'local.ch’;
$GLOBALS['SELENIUM_SERVER’] = 'localhost’;

if (file_exists('config_developer.php’)) { include_once('config_developer.php’);
}
if (getenv('SELENIUM_SERVER’)) { $GLOBALS['SELENIUM_SERVER’] = getenv('SELENIUM_SERVER’);
}
if (getenv('TLD’)) { $GLOBALS['TLD’] = getenv('TLD’);
}

/** * $case: Only run this test case * $test: Only run this test within the case */
function runAllTests($onlyCase = false, $onlyTest = false) { $test = &new TestSuite('All tests’); $dirs = array(“unit”, “selenium”, “selenium/*”);

foreach ($dirs as $dir) { foreach (glob($dir . ‘*.php’) as $file) { $test->addTestFile($file); } } if (!empty($onlyCase)) $result = $test->run(new SelectiveReporter(new TextReporter(), $onlyCase, $onlyTest)); else $result = $test->run(new XMLReporter()); return ($result ? 0 : 1); }

return runAllTests($argv[1], $argv2);
?>

The top part sets up some configuration values we use for Selenium. There are two global variables, the TLD which defines the host name to test against and SELENIUM_SERVER which is the Selenium server to connect to. There are two ways to configure. Either with the “config-developer.php” file which is excluded from version control and can be created by the developer. And then by setting environment variables when calling the test script.

After that the tests are run. Basically it includes tests from a set of directories. Then it either uses the SelectiveReporter or our own XMLReporter to execute tests. The SelectiveReporter will only execute a given test class or even only a given method (the first and second parameter from the command line respectively). The XMLReport gives a JUnit-style parseable output that we use for the continuous integration tool (Bamboo in our case).

The included common.php file contains this:

error_reporting(E_ALL);
ini_set('log_errors’, '0’);

if (! defined('SIMPLE_TEST’)) { define('SIMPLE_TEST’, BX_PROJECT_DIR . ‘inc/vendor/simpletest/’);
}
require_once(SIMPLE_TEST . 'reporter.php’);
require_once(SIMPLE_TEST . 'unit_tester.php’);

class XMLReporter extends SimpleReporter { function XMLReporter() { $this->SimpleReporter();

$this->doc = new DOMDocument(); $this->doc->loadXML(’'); $this->root = $this->doc->documentElement; } function paintHeader($test_name) { $this->testsStart = microtime(true); $this->root->setAttribute('name’, $test_name); $this->root->setAttribute('timestamp’, date('c’)); $this->root->setAttribute('hostname’, 'localhost’); echo “\n”; echo “param string $test_name Name class of test. *
access public */ function paintFooter($test_name) { echo “—>\n”; $duration = microtime(true) – $this->testsStart; $this->root->setAttribute('tests’, $this->getPassCount() + $this->getFailCount() + $this->getExceptionCount()); $this->root->setAttribute('failures’, $this->getFailCount()); $this->root->setAttribute('errors’, $this->getExceptionCount()); $this->root->setAttribute('time’, $duration); $this->doc->formatOutput = true; $xml = $this->doc->saveXML(); // Cut out XML declaration echo preg_replace(’/<\?[^>]*\?>/’, “”, $xml); echo “\n”; } function paintCaseStart($case) { echo “- case start $case\n”; $this->currentCaseName = $case; } function paintCaseEnd($case) { // No output here } function paintMethodStart($test) { echo “ – test start: $test\n”; $this->methodStart = microtime(true); $this->currCase = $this->doc->createElement('testcase’); } function paintMethodEnd($test) { $duration = microtime(true) – $this->methodStart; $this->currCase->setAttribute('name’, $test); $this->currCase->setAttribute('classname’, $this->currentCaseName); $this->currCase->setAttribute('time’, $duration); $this->root->appendChild($this->currCase); } function paintFail($message) { parent::paintFail($message); if (!$this->currCase) { error_log(”!! currCase was not set.”); return; } error_log(“Failure: “ . $message); $ch = $this->doc->createElement('failure’); $breadcrumb = $this->getTestList(); $ch->setAttribute('message’, $breadcrumb[count($breadcrumb)-1]); $ch->setAttribute('type’, $breadcrumb[count($breadcrumb)-1]); $message = implode(’ -> ', $breadcrumb) . “\n\n\n” . $message; $content = $this->doc->createTextNode($message); $ch->appendChild($content); $this->currCase->appendChild($ch); } } ?>

This file sets up SimpleTest by including the necessary file. Then follows the definition of the XMLReporter. It will print out some debugging so we know where it’s at. That’s necessary for us because our Selenium tests take about 15 to 20 minutes. At the end follows the XML-result which can be parsed by Bamboo. It should also work for other tools that expect JUnit XML output but I haven’t tested that.

local.ch Vortrag am Tweakfest

Posted by Patrice Neff Mon, 21 May 2007

Am Tweakfest werde ich am Freitag einen kleinen Ajax-Workshop halten. Während dem Workshop zeige ich, wie mit einem kleinen API von local.ch wie in einem Formular die Schweizer Ostschaften autovervollständig werden können. Es geht darum eine kurze Einführung in Ajax mit einem praktischen Beispiel zu zeigen.

Vor meinem Vortrag stellt Marcel Vogt Apollo vor.

Der Workshop findet am Freitag, 25. Mai um 14:30 im Raum Newton 1011 im Technopark Zürich statt, Daniel Vogt beginnt um 13:30.

Melde dich bei mir, falls du gerne kostenlos zum Workshop kommen willst. Generell kostet die Teilnahme.

Looking for a frontend developer

Posted by Patrice Neff Wed, 14 Feb 2007

At local.ch we’re looking for a developer in the area of PHP/XSLT development. You will take over work regarding all user-visible aspects with technologies such as PHP, XSLT or Javascript.

A job at local.ch gives you a lot of freedom to explore, find good solutions, learn new technologies, bring in your opinions and knowledge.

We want a developer who knows how to write clean XHTML and CSS, has experience in client side Javascript and is well-versed in XML. We will gladly teach you XSLT on the job but if you already know to program in XSLT so much the better.

If you’re interested, head over to our blog to read more details. You can get in contact with me via e-mail (patrice [at] local.ch) or Skype (patriceneff).

market.local.ch

Posted by Patrice Neff Wed, 13 Dec 2006

In the past few weeks I was mainly busy creating the new local.ch market. It’s actually in cooperation with Fundgrueb, a very well known and traditional classifieds newspaper in the German part of Switzerland.

I created a longer posting on the local.ch blog which also contains some information about what’s different from other such platforms.

The nice upload is thanks to Chregu’s Upload Progress Meter PECL extension by the way (forgot to mention that on the local.ch blog post).

Mail testing with Selenium

Posted by Patrice Neff Thu, 23 Nov 2006

For the next phase of local.ch E-Mail processes will play a central role. So I wanted to include those processes in our Selenium tests. It’s actually quite easy to do.

First create an account where test mails can go to. That account should be accessible by one of your scripts. I use a normal IMAP account for that. Then write a script which always outputs the newest mail on that account. I include some of the important headers plus the body (body parts for multi-part mails). I also made that page refresh itself every two seconds.

Then writing the tests is easy. Write a test first that executes the action that sends a mail. Make sure the mail is sent to your test account.

Next write a test that opens the getmail script (using the selenese command “open”). Follow that with a waitForTextPresent action to wait until the test mail has arrived – which never lasts more than a few seconds in my environment. Then you can use the normal test commands such as verifyText, verifyTextPresent or even click etc. if you output HTML mails correctly.

Works like a charm around here. If there is interest I can publish my script to get the mails. It’s written in PHP and is basically an IMAP client using the two PEAR packages Net_IMAP and Mail_mimeDecode.

Norton Ad Blocking and AJAX

Posted by Patrice Neff Tue, 26 Sep 2006

Management summary of this article: turn off Norton Ad Blocking as installed by Norton Personal Firewall. It sucks. If you are interested in details or you develop AJAX sites for the broad market, read on.

Some users were not able to navigate to page two, three, etc. on the result lists of local.ch. As that functionality is implemented with AJAX I first suspected a general XMLHTTPRequest problem. But funnily enough people were still able to submit feedback – which is also submitted to our server with an AJAX call. After seeing some of the raised JavaScript exceptions by our users I noticed that there was a function called SymOnLoad in the output. As we don’t have any function by that name I started investigating and found a connection to the Norton Personal Firewall.

It turns out that the Norton Personal Firewall will insert it’s own script tags into the delivered pages when the ad blocker is enabled. It does so only if there are script tags already in the original page as well. In our case we hade a script tag in JSON responses. The addition by the Norton Firewall caused the resulting JSON to be invalid and thus raising exceptions. Those exceptions in turn were silently ignored because of the Norton Firewall addition.

I fixed the issue by removing the script tag and putting the corresponding script in our general utility file. The other solution would have been to check for existence of the Norton Firewall and turn off AJAX in that case. It’s actually quite easy to detect Norton Firewall. Something like

if (typof(SymOnLoad) != 'undefined') {
    useAjax = false;
}

should do the trick.

local.ch is looking for you

Posted by Patrice Neff Mon, 04 Sep 2006

As you may already have seen on our blog, we at local.ch are looking for new co-workers. If you want to be part of a team of geeks working on one of the most exciting Web projects in Switzerland, join us.

We’re looking for exceptionally talented geeks. The currently open positions require Java programming, so you should already feel comfortable with that. You top it off if you already know Spring and Lucene – but that’s not required as we’re looking for people who learn quickly.

Behind the scenes of local.ch there is a lot of data crunching to be done. While creating a phonebook may not sound like much of a challenge at a first glance, there are some exciting problems that have to be solved. And the currently available slots phonebook and guide are just the start. Also we don’t only offer you exciting work but also a job with a future. There is a solid business model behind local.ch and we are here to stay.

See the official job descriptions at our blog. You can write me at patrice [at] local.ch if you have questions or to send in your CV.

local.ch guide

Posted by Patrice Neff Mon, 12 Jun 2006

Last Friday we had a reason to celebrate at local.ch (my employer). We released the Guide aka Leisure. That's the place to go for events and locations like restaurants, bars, cinemas, etc. See the extended release notes on the team blog.

The data is not as broad yet as we'd like it to be. You'll currently only find real information in the bigger cities of Switzerland. Of course we want to improve that in the upcoming months.

Many other new features are upcoming this year. Stay tuned. Even better, subscribe to the local.ch Team Blog for the latest information.

Webtuesday Zurich

Posted by Patrice Neff Wed, 17 May 2006

I attended the Webtuesday yesterday, hosted by Jürg Stuker at the namics office in Zürich. The talk was mainly done by Urban Müller of search.ch about their clustering implementation. Statistics, performance, server redundancy, etc. were the topics. Bernhard Seefeld talked about his current solution for Endoxon. And Cédric added a few notes about our solution at local.ch. Silvan and Stefan of Tilllate had a few things to say about their layout as well. All in all it was a very interesting evening with lots of stuff learned. I guess all of us learned a few things.

We then moved to El Lokal for a beer and Pizza. We of local.ch had to accept being defeated by tel.search.ch for looking up the number of a Pizza delivery service. But only because we don't have the mobile interface ready, yet. It's already being implemented, though and is one of the lacking features I personally care most about.

It was once again interesting to meet a few people I had only met online so far, especially Denis De Mesmaeker and Alain Petignat.

Next time I'll talk about Ruby on Rails which I already defended at our table yesterday. We planned/are planning to have that event on June 13 but that date collides with the Swiss victory over France at the Football World Cup. Details are currently being negotiated and will be announced on the Webtuesday Zurich Web site.

Started at local.ch

Posted by Patrice Neff Thu, 11 May 2006

On Monday I officially started my new job at local.ch, the search engine for every village in Switzerland.

We currently offer access to the phone book but will add other services in the future. I'll keep you posted of course! There is also the local.ch Team Blog where I'll contribute to.

I'll mainly be working in system administration (we run a small server farm mainly on Linux) and programming (starting with PHP frontend work). I'll also be responsible for the community part of local.ch. We have a few ideas about contributing back to the Swiss blogosphere and that will be my job.

Thanks to the local.ch team for the great reception. I'm very excited to work on this project, especially knowing it since its first days.

Tech job market in Switzerland

Posted by Patrice Neff Thu, 09 Mar 2006

It appears that the Swiss job market in the tech sector is not as bad as I thought for some time. If you are a well-qualified individual you should have no problem finding a job - at least in the Web sector.

Why do I say this? Well, recently Bitflux had a job opening (now filled), local.ch has also been looking, as is now search.ch. namics has a few job openings as I'm sure have other agencies.

All of them have one thing in common which was not always true for our sector: they look for highly skilled workers (though namics also has internships in Baar/Zug and in St. Gallen). A few years ago, companies in our sector were adding people to their workforce like wild and thus had to sacrifice on quality. Now it's the opposite.

I for one believe that is a very welcome change and so do many of my friends. But I also know a few people who profited from the "more liberal" practices during the dot-com boom.

Anyway, I'll add a few articles in the following weeks on how to improve your chances for getting an employment in this area. It won't be rocket science but I hope I have a few tips to share. Those articles will be available in English and in German.

And I suggest you tag job openings on your weblogs with the "jobs" tag. That will allow job seekers to watch the tag (or subscribe to it).


Update 1: Forgot, that KAYWA also is looking for talent.
Update 2: And Google also is looking. (Via relab.ch).