I really appreciate Matt and Bob showing me significantly better ways to compare strings. Yet another perk of working for this company :)
I've been spending time with grails lately. One of my biggest beefs is 100% accuracy required 100% of the time when typing parameters to the "grails" application. Not a big deal (esp. with emacs M-/)... but what an opportunity to do something useful with cosine similarity!
All code on this page is licensed under the Apache 2 license.
The heart of the code. Please ignore the lack of error handling and optimization.
static def similarity(l_seq, r_seq, degree=2) {
def l_histo = countNgramFrequency(l_seq, degree)
def r_histo = countNgramFrequency(r_seq, degree)
dotProduct(l_histo, r_histo) /
Math.sqrt(dotProduct(l_histo, l_histo) *
dotProduct(r_histo, r_histo))
}
This probably isn't ideal Groovy, but I sure appreciated the syntax improvements to arrays and maps.
static def countNgramFrequency(sequence, degree) {
def histo = [:]
def items = sequence.size()
for (int i = 0; i + degree <= items; i++)
{
def gram = sequence[i..<(i + degree)]
histo[gram] = 1 + histo.get(gram, 0)
}
histo
}
Still nothing specific to strings here...
static def dotProduct(l_histo, r_histo) {
def sum = 0
l_histo.each { key, value ->
sum = sum + l_histo[key] * r_histo.get(key, 0)
}
sum
}
Finally, here's something useful:
static def stringSimilarity (l_str, r_str, degree=2) {
similarity(l_str.toLowerCase().toCharArray(),
r_str.toLowerCase().toCharArray(),
degree)
}
... okay, how to apply this to grails? Just one more bit of code:
static def mostSimilar(pattern, candidates, threshold=0) {
def topScore = 0
def bestFit = null
candidates.each { candidate ->
def score = stringSimilarity(pattern, candidate)
if (score > topScore) {
topScore = score
bestFit = candidate
}
}
if (topScore < threshold)
bestFit = null
bestFit
}
Drop this into GrailsScriptRunner.groovy and now this:
- grails create-domain-class
... is equivalent to all of these:
- grails craete-domain-class
- grails domain
- grails dom
To everyone who read this far, thank you :) And... one other observation. Strings are ordered sequences (of chars). What about file contents (lines) and directory contents (file names)?
static def fileSimilarity (l_file, r_file, degree=3) {
similarity(new File(l_file).readLines(),
new File(r_file).readLines(),
degree)
}
static def dirSimilarity(l_dir, r_dir, degree=3) {
def l_files = []
def r_files = []
new File(l_dir).eachFileRecurse { l_files << it.name }
new File(r_dir).eachFileRecurse { r_files << it.name }
similarity(l_files, r_files, degree)
}
edit 2007-12-05 -- size() method works for arrays and lists; easier than length and size properties. Contribution much appreciated!
-- seth.schroeder _at_ nearinfinity.com
0 TrackBacks
Listed below are links to blogs that reference this entry: Groovy cosine similarity in Grails.
TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/441



Leave a comment