Merging duplicates from broken unique validation in Rails by makaroni4
Have you ever face a situation where unique validations in your #rails app didn't work? Under the cut a little story about this situation in project that I maintain and solution how broken records (duplicates) were fixed.
Uniqueness validation vs Race conditions
In your #rails app when you want some filed to be unique you usually do this:
class User < ActiveRecord::Base
validates :email, uniqueness: true
end
But as described in doc ActiveRecord does not guarantee that there will be no duplications due to the race conditions. And it is a fact.
To demonstrate this let's create a simple application with User model and try to imitate concurrent requests. We will create users create action and than make a number of requests using #em_synchrony.
rails g model User email:string
rake db:migrate
# app/controllers/users_controller.rb
class UsersController < ApplicationController
def create
User.create email: 'blah@blah.blah'
render :nothing => true
end
end
# app/controllers/application_controller.rb
class ApplicationController < ActionController::Base
# protect_from_forgery
end
# config/unicorn.rb
worker_processes 5
# run server with
RACK_ENV=none RAILS_ENV=development unicorn -c config/unicorn.rb -p 3000
NOTICE that since we will just make POST http requests I remove csrf tokens check be commenting protect_from_forgery. Never do it in production.
A good tool to imitate a lot of users making conrurrent request is #em_synchrony:
require "em-synchrony"
require "em-synchrony/em-http"
URL = 'http://localhost:3000/users'
EM.synchrony do
CONCURRENCY = 10
results = EM::Synchrony::Iterator.new((1..500), CONCURRENCY).map do |index, iter|
http = EventMachine::HttpRequest.new(URL).apost
http.callback do
puts "SUCCESS #{index}"
iter.return(http)
end
http.errback do
puts "ERROR #{http.response_header.status}"
iter.return(http)
end
end
EM.stop
end
After we launch this script you can see that we have 5 users in our database with the same email.
Users.count
# => 5
Fighting race conditions
To prevent creating of duplications the recipe is quite simple – just create unique index and protect dulications at database level:
add_index :users, :email, unique: true
Consequences of duplications
If you have a lot of User associations what could happen is different records belongs to different duplicates of the same user. It could be payments, comments anything and if you try to create index it will throw an exception because there are duplicates in you users table. What we need to do is to merge all user associations to one user and delete user duplicates. To check how many duplications you have is quite simple:
User.count(group: :email).select { |k,v| v > 1 }
# => {"blah@blah.blah"=>10}
Merging duplicates
Algorithm is pretty simple:
- create fresh backup (always create backups :) )
- find group of users with the same email
- choose one user to be saved (let it be the most recent updated)
- find all associations with user_id belongs to group of users
- replace this user_id with user_id of chosen user
- remove user duplicates
- run unique index migration
Here is rake task for has_one and has_many associations:
namespace :users do
task :merge_duplicates => :environment do
associations = [:has_one, :has_many].inject([]) do |names, assoc|
names += User.reflect_on_all_associations(assoc).map(&:name)
names
end
duplicate_emails = User.count(group: :email).select { |k, v| v > 1 }.keys
duplicate_emails.each do |email|
users = User.where(:email => email)
current_user = users.order('updated_at DESC').first
users.each do |user|
associations.each do |association|
next unless user.send(association)
user.send(association).update_all :user_id => current_user.id
end
end
users.keep_if { |u| u.id != current_user.id }
users.map(&:destroy)
end
end
end
The only trouble you can have here is if your project is under high load and if you fix some user associations and this user is active at the moment some other associations could be created. So it is probably better to first save user_ids somewhere and delete all duplicates in users, run unique index migration and after fix associations.
Hope that helps, the rule of thumb here – if you create validations or associtation always create index.
Comments
No comments yet