I have a board list on my website and there are quite a lot of them (30000). And of course there are always some spammers which duplicate adverts or change them slightly so I can't use uniqueness validation to detect duplicates.

Since text indexes are pretty heavy so the solution to my mind is to use text similarity algorithm and highlight posts which are really similar to others (at least it will make work of moderators much easier).

So check out amatch gem which contains implementations of several similarity algorithms. The link is to my fork of the gem since pull requests is not merged yet.

The use is quite simple:

require 'amatch'

 "pattern language".jarowinkler_similar("language of patterns")
 # => 0.672222222222222

Performance tests: performance

#ruby #text_similarity #amatch