Data science time!

cohort analysis plot

Take a look at this cohort analysis plot. It shows that users which registered in 2008 were active for couple of years and all later users (2009+ year) quickly abandoned using forum. After reading this post there will be no obstacle for you to use this technique in your project.


A cohort is a group of people who share a common characteristic or experience within a defined period (e.g., are born, are exposed to a drug ) ... Widipedia

Cohort analysis could be used in many cases – in medicine it is used for disease investigation, in case of e-commerce observing different cohorts behavior gives us understanding of user retention or customer lifetime.


Let's consider #rails app with #mysql database. This application is a forum where users write replies. For now we will write a cohort analysis code out of rails app, just a simple ruby script. Let's begin with our model definitions:

require 'active_record'

  adapter: "mysql2",
  host: "localhost",
  database: "database_name",
  user: "username",
  password: "password",
  pool: 5,
  timeout: 5000

class Reply < ActiveRecord::Base
  belongs_to :user

class User < ActiveRecord::Base
  has_many :replies

Above code allows us to write active record queries so let's write our cohort analysis SQL query. So first we need to define a cohort period quantum, i.e. one step on horizontal axis. For now let it be year (cohort analysis with year period will be interesting for venture capytalist for example, because if project grows with each year users should post more and more comments, it is a sign of good user retention).

With determined cohort period we need to group users by their "cohort date" (in our case it is a date when user left first reply) and count all replies in each cohort period for each cohort date. That is it.

replies = Reply.joins(:user).
          (SELECT user_id,
                  Min(created_at) AS cohortDate
           FROM replies
           GROUP BY user_id) AS cohorts ON replies.user_id = cohorts.user_id
        AND replies.user_id NOT IN
          (SELECT user_id
           FROM site_admins)").
            DATE_FORMAT(cohorts.cohortDate, '%Y') AS cohort_date_symbol,
            TIMESTAMPDIFF(YEAR, cohorts.cohortDate, replies.created_at) AS cohort_period")

stats = replies.inject({}) do |stats, reply|
  stats[reply.cohort_period] ||= {}
  stats[reply.cohort_period][reply.cohort_date_symbol] ||= 0
  stats[reply.cohort_period][reply.cohort_date_symbol] += 1

# {0=>{"200808"=>2005, "200810"=>204, "200809"=>1007} ... }

Note that in this query we determine cohort period as a difference between cohort date and created_at field of comment. It is done via TIMESTAMPDIFF #mysql function:

TIMESTAMPDIFF(unit, datetime_expr1, datetime_expr2)

list of available units

In #postgresql there is no TIMESTAMPDIFF but such a date subtraction with period could be done via EXTRACT function.

!! Note, that we could convert created_at field to cohort date symbol using ruby's strftime, but benchmark shows that it is 10% slower on a 100k users database.

We are almost there, let's just output all the cohort data to csv:

require 'csv'
file ='cohort_data.csv', 'w')

min_year = Reply.minimum(:created_at).year
max_year = Reply.maximum(:created_at).year
cohort_dates = (min_year..max_year).map(&:to_s)

title_row = ['cohortPeriod'] + cohort_dates
file << title_row

stats.each do |cohortPeriod, dates|
  arr = [cohortPeriod]
  cohort_dates.each do |date|
    arr << dates[date].to_i

  file << arr


Open cohort_data.csv and make a plot:

cohort analysis data cohort analysis plot


Bingo! As you can see building such plots is fun :) There are many opportunities for us to go further – we could play with different cohort periods, we could normalize our data and so on but it is a lil bit beyond of the scope of current article. Since we understand how cohort analysis works you may want to check out one of #ruby gems for building cohort analysis plots easy:

Full script available at