Last week I had to solve a non-trivial problem for my background: clustering of content. I had to write a program which takes a bunch of search results and clusters them together by content. So similar results would go into the same group.
I started with Carrot2 – a Java framework for exactly that purpose. The only available documentation is the API reference and some examples. The API documentation contains 796 classes. That’s no typo, count them if you must. I spent literally two working days trying to get it running. I got it running somehow but got stuck when I had to customize text distance function.
That’s when I started to search for other packages. I found python-cluster. It exposes two classes (for the two different clustering algorithms) with a constructor and one method each. All I have to pass it is the list of results and a distance function.
I was up and running literally in less than an hour. Most of that I spent on a reasonable distance algorithm.
Not passing any judgment here. Both frameworks have their strengths. But I found it a very good example of the different philosophies in the two camps.