Groovy cosine similarity in Grails

| | Comments (0) | TrackBacks (0)

I really appreciate Matt and Bob showing me significantly better ways to compare strings. Yet another perk of working for this company :)

I've been spending time with grails lately. One of my biggest beefs is 100% accuracy required 100% of the time when typing parameters to the "grails" application. Not a big deal (esp. with emacs M-/)... but what an opportunity to do something useful with cosine similarity!

All code on this page is licensed under the Apache 2 license.

The heart of the code. Please ignore the lack of error handling and optimization.

    static def similarity(l_seq, r_seq, degree=2) {
        def l_histo = countNgramFrequency(l_seq, degree)
        def r_histo = countNgramFrequency(r_seq, degree)
        
        dotProduct(l_histo, r_histo) /
            Math.sqrt(dotProduct(l_histo, l_histo) *
                      dotProduct(r_histo, r_histo))
    }

This probably isn't ideal Groovy, but I sure appreciated the syntax improvements to arrays and maps.

    static def countNgramFrequency(sequence, degree) {
        def histo = [:]
        def items = sequence.size()

        for (int i = 0; i + degree <= items; i++)
        {
            def gram = sequence[i..<(i + degree)]
            histo[gram] = 1 + histo.get(gram, 0)
        }
        histo
    }

Still nothing specific to strings here...

    static def dotProduct(l_histo, r_histo) {
        def sum = 0
        l_histo.each { key, value ->
            sum = sum + l_histo[key] * r_histo.get(key, 0)
        }
        sum
    }

Finally, here's something useful:

    static def stringSimilarity (l_str, r_str, degree=2) {
        similarity(l_str.toLowerCase().toCharArray(),
                   r_str.toLowerCase().toCharArray(),
                   degree)
    }

... okay, how to apply this to grails? Just one more bit of code:

    static def mostSimilar(pattern, candidates, threshold=0) {
        def topScore = 0
        def bestFit = null

        candidates.each { candidate ->
            def score = stringSimilarity(pattern, candidate)
            if (score > topScore) {
                topScore = score
                bestFit = candidate
            }
        }

        if (topScore < threshold)
            bestFit = null

        bestFit
    }

Drop this into GrailsScriptRunner.groovy and now this:

  • grails create-domain-class

... is equivalent to all of these:

  • grails craete-domain-class
  • grails domain
  • grails dom

To everyone who read this far, thank you :) And... one other observation. Strings are ordered sequences (of chars). What about file contents (lines) and directory contents (file names)?

    static def fileSimilarity (l_file, r_file, degree=3) {
        similarity(new File(l_file).readLines(),
                   new File(r_file).readLines(),
                   degree)
    }

    static def dirSimilarity(l_dir, r_dir, degree=3) {
        def l_files = []
        def r_files = []

        new File(l_dir).eachFileRecurse { l_files << it.name }
        new File(r_dir).eachFileRecurse { r_files << it.name }

        similarity(l_files, r_files, degree)
    }

edit 2007-12-05 -- size() method works for arrays and lists; easier than length and size properties. Contribution much appreciated!

-- seth.schroeder _at_ nearinfinity.com

Leave a comment


Type the characters you see in the picture above.

0 TrackBacks

Listed below are links to blogs that reference this entry: Groovy cosine similarity in Grails.

TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/441