On Friday night I and couple of my colleagues decided to participate in RailsRumble2012 – competition of rails developers when in 48 hours a team of 4 people (or less) must create app from scratch, configure Linode VPS server and deploy their app. Then 65 judges select 10 winners + one people's choice award.
To select the idea for a project we used brainstorm – we spoke by one for 30 seconds pointing out good and bad qualities of idea. So after rejecting idea of personal checklist of useful todos, aggregator of products reviews we came up with GitFM. Also I want to mention that our team is geo separated – we live in Minsk (Belarus), Moscow (Russia), Prague (Czech Republic) and Helsinki (Finland).
In Minsk timezone rumble started at 3am. At 7 am we were discussing architecture and tools to create recommendation service – we spent couple of hours reading about 'collaborative filtering algorithms' which are used in Amazon, LastFM and many more to recommend relevant items. So we found out about Apache Mahout and since it is built on top of the Hadoop we decided to use it. With Hadoop we can calculate recommendation really fast and it is a key factor of our app.
So by now our server setup looks like this:
The core of our project is Rails application. It is responsible for fetching user and repo data from Github. To make it really fast we chose EM Synchrony by @igrigorik. Let me list couple of useful links for you to start using it right away:
jRuby daemon and Apache Mahout
When we need to generate recommendations for user Rails app pushes user id to Redis query which is processed by jRuby daemon. Why jRuby? Apache Mahout is written in #java so to make communication with it easy we wrote jRuby script which pulls user_id and then call method which writes recommendations to PostgreSQL database. To create jRuby daemon we used this generator, it is really good, great respect to @junegunn.
ApacheMahout implements algorithms of collaborative filtering and there are many specifications for it. So for first version of GitFM we used probably the simplest one. We uses Tanimoto coefficients for similarity calculation and we use binary model of user preference (i.e. user gives each repo a mark of 0 or 1). In these couple of days I fell in love with collaborative filtering techniques so let me share some links with you:
http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/ http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/ http://jaydonnell.com/blog/2011/10/21/collaborative-filtering-using-jruby-and-mahout/ http://code.google.com/p/unresyst/wiki/CreateMahoutRecommender
In fact GitFM is three applications communicating through #Redis. To send message that recommendations are ready we wrote special server – when signed user visits recommendations page we open connection to our server which listens to Redis via messages:* pattern and stores pairs of id-connection. When recommendations are ready message with user id is sent and our user receives recommendations via web socket. So to use these techniques you need to be familiar with:
So by the moment we have 3128 users in our database (probably there is much more already) and more than 95000 repos. We have processed more than 4000 recommendations requests.
Here is distribution of languages in GitFM:
To make our repo preference more smart – we need to consider forks, watched repos, followers and so on. Also after RailsRumble we are going setup another server for recommendations generation, we want to have relativistic-fast recommendations.
Vote for us!
If you like GitFM please vote for us in RailsRumble:
- you need to sign in with twitter
- press Favorite