Recently by Seth Schroeder

A whirlwind tour of Bloom filters

| | Comments (0) | TrackBacks (0)

What's a Bloom filter?

A Bloom filter is a memory and cpu efficient way to remember which things have been seen and which haven't. They are more familiar than programmers realize; the key set of a hash map is loosely a simple Bloom filter. I haven't had a need for this level of efficiency, but it's a neat tool to keep in mind.

A hash map works by applying a hash function to data, and associating the output with the input data. Considering a simple hash map which stores only values, and not buckets of values:

Traits of a hash map key set

  1. After an item has been stored, its existance will be known. It will never go missing. There is no chance of saying an item has not been stored after it has been stored.
  2. After an item has been stored, it may be identified as a different item. This happens when the hash function generates the same value for distinct data. So there is a risk of falsely confirming item B has been stored after only storing item A.
  3. The space required to hold these keys will be the product of the key size and key count. While key size is probably fixed, key count is variable and will lead to more space needed as more items are added.

Introducing Bloom filters

A Bloom filter shares traits 1 and 2 of a hash map key set, but not trait 3. The space required by a Bloom filter does not increase as more items are stored. Rather than processing data with one hash function and using the result as a lookup value, a Bloom filter uses multiple hash functions to identify booleans in a fixed size array.

Bloom filter traits

  1. Same as hash map key set trait 1
  2. Same as hash map key set trait 2
  3. Storage space is small, but depends on the desired false positive rate
  4. Lookup time depends on number of hash functions; can be run in parallel
  5. Items cannot be removed
These traits are thoroughly discussed by the articles in the reference section. They are required reading for serious implementations of a Bloom filter. That's outside the scope of this blog post -- so onto a trivial implementation:

Code

This tacky, easily improvable code was released to the public domain and promises nothing.
 0: class MediocreBloomFilter
 1:   M = 16                # number of boolean fields in filter
 2:   K = 3                 # number of hash functions to find boolean fields
 3: 
 4:   attr_accessor :v      # the filter, a vector of boolean fields
 5: 
 6:   def initialize()
 7:     @v = Array.new(M, 0)
 8:   end
 9: 
10:   def track_item(item)
11:     indicies_for?(item).each do |i|
12:       @v[i] = 1
13:     end
14:   end
15: 
16:   def item_tracked?(item)
17:     hits = 0
18:     indicies_for?(item).each do |i|
19:       hits = hits + @v[i]
20:     end
21:     hits == K
22:   end
23: 
24:   def indicies_for?(item)
25:     indicies = []
26:     md5 = Digest::MD5.new
27:     hash = ""
28: 
29:     md5.update(item.to_s)
30:     hash = md5.hexdigest
31: 
32:     K.times do |k|
33:       indicies << HEX.index(hash[k].chr.to_s.upcase)
34:     end
35: 
36:     indicies
37:   end
38: 
39:   HEX = [ "0", "1", "2", "3", "4", "5", "6", "7",
40:           "8", "9", "A", "B", "C", "D", "E", "F" ]
41: end
      
  • Line 7 sets the filter storage to 16 integers. It could have been two bytes if I wanted to do bitwise operations.
  • Line 24 runs the hash functions against the target data. Since this is tacky sample code only one hash function is executed (line 30), and multiple slices are taken from the value and converted into array indicies (at line 33).
  • Using integers rather than booleans has a nice side effect. It permits removal of items by using the integers as reference counts. I haven't tested it, but minor changes should be needed around lines 12 and 19, and a removal function to decrement the counts.
So that's all well and good, but let's take a quick look at the filter as it goes through some testing. Here is the filter immediately after creation: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Now let's add the string "abc" and inspect things:
  • item: "abc"
  • md5: 900150983cd24fb0d6963f7d28e17f72
  • indicies: [9, 0, 0]
  • filter: [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Adding "abd"...
  • item: "abd"
  • md5: 4911e516e5aa21d327512e0c8b197616
  • indicies: [4, 9, 1]
  • filter: [1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Finally, adding my name:
  • item: "seth schroeder"
  • md5: 2c38f5fd13873c139567551ca1b7496e
  • indicies: [2, 12, 3]
  • filter: [1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
This code is obviously weak, but enough to demonstrate the approach. One thing that stands out is how many hash functions should be used? False positives will be high when too few are used, and when too many are used. Fortunately smarter people have thought this out -- that's my final plea to please peruse the references.

References

  1. Using Bloom Filters
  2. Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol
  3. Burton Bloom, Space/time trade-offs in hash coding with allowable errors, CACM, 13(7):422-426, July 1970.
  4. Paul E. Black, "Bloom filter", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 16 May 2008. Available from: http://www.nist.gov/dads/HTML/bloomFilter.html

P.S.: If you've read this far THANK YOU. This entry has also been posted to my personal blog.

Problem 10 of Project Euler is to sum the prime numbers less than two million. Solutions are supposed to take no more than one minute, so a decent prime number detector is mandatory. I'm sure lots of free, high quality solutions are available, but I have this problem with DIY / NIH syndrome. Somehow I found the sieve of Eratosthenes and a quick, tacky swipe at it was good enough.

This process is to make a list of numbers from 2 through some limit. Start with the lowest number in the list, and remove all multiples from the list. Next time through, pick the next smallest number that hasn't been crossed off yet and eliminate its multiples. Stop when your current number meets the square root of the limit. After that, the remaining numbers are prime:

def eratosthenes_sieve(ceiling)
  vals = (0..ceiling).to_a
  vals[0..1] = [nil, nil]

  Math.sqrt(ceiling).floor.times do |val|
    next unless vals[val]
    
    ((val ** 2)..ceiling).step(val) do |mult|
      vals[mult] = nil
    end
  end
  
  vals
end

A big hash like this is useful for O(1) checking for prime/not prime, but the solution requires the sum of the primes:

def primes_through(ceiling)
  eratosthenes_sieve(ceiling).compact
end

sum = 0
primes_through(ARGV[0].to_i).each { |val| sum = sum + val }
puts sum

I'm sure there's a more Ruby-esque way to write that. I wanted to use Python for xrange goodness, but I'm still not at peace with its quirks. Groovy would have been nice, but the free editor support is still quite young. Ruby fit right in the gap.

Updated 2008-05-02: new Python implementation using sum() -- thanks Pete!

My blog post about the first problem in Project Euler went strange places. I was just trying to learn Python by solving a barely non-trivial problem. Maybe it was a full moon, not sure, but it wound up in a tangle of gcc, groovyc, and .elc. Happily, some good performance improvements were suggested and I've tried to incorporate them below.  

This is a really long post. Here's the skeleton:

  1. Runtime environment
  2. Results
  3. Code (alphabetically by language)
    1. C
    2. Groovy
    3. Lisp
    4. Perl
    5. Python
    6. Ruby
  4. Big O(rly)?
  5. Wrap up
  6. Extras

 

Runtime environment

  • Machine: Powerbook Pro, 10.5.2, 2x 2.16 GHz Core 2 Duo
  • Userspace programs: Finder, Terminal
  • Battery conservation mode: Better Performance (plugged in)
  • C: GCC 4.0.1 (Apple Inc. build 5465)
  • Groovy: 1.5.5 JVM: 1.5.0_13-119
  • Lisp: GNU Emacs 22.1.1
  • Perl: v5.8.8, built for darwin-thread-multi-2level
  • Python: 2.5.1
  • Ruby: 1.8.6 (2007-09-24 patchlevel 111) [universal-darwin9.0]
(top)  

Results

No cheesy graphs this time, just better organized numbers. Each row has one implementation of the problem. The 10**x columns show how many milliseconds it took the implementation to sum the multiples of three and five less than that number. Each implementation was run seven times for each column, and the average runtime is shown below. A test harness was written to execute the scripts automatically; I disabled the screensaver, started the tests, and went to bed.

Code 10**1 10**2 10**3 10**4 10**5 10**6 10**7 Comments
first.c55551165648
first.groovy928929106516432155799852732
second.groovy9201004118913531806324319620
third.groovy91488093611811421254512835
first.pl881026192188618845
second.pl88811363783062
first.py22222335160150614939
second.py147222124534224242
third.py22222224514054081
fourth.py21222122261561580Credit goes to Pete below
first.rb98920134142014288
second.rb98917100107510818
first.el2014110^1...6 not available
second.el8172Actually first.elc
instant.rb8888888

 

Code

 

C

#include <stdio.h>
#include <math.h>

#define multiple_of(a, b) ((a == 0 || b == 0) ? 0 : a % b == 0)

int main(int argc, char** argv)
{
    int exp = atoi(argv[1]);
    double sum = 0;
    int i;

    for (i = 3; i < pow(10, exp); i++)
        if (multiple_of(i, 3) || multiple_of(i, 5))
            sum += i;

    printf("%0.0f\n", sum);

    return 0;
}

 

Groovy

 
// first.groovy
exp = Integer.parseInt(args[0])

mof = { val, test -> val && test && test % val == 0 }
mof3 = mof.curry(3)
mof5 = mof.curry(5)

sum = 0 as BigDecimal

(3..<(10**exp)).each {
    if (mof3(it) || mof5(it)) {
        sum += it
    }
}

println sum

 
// second.groovy
def exp = Integer.parseInt(args[0])

def mof = { val, test -> val && test && test % val == 0 }
def mof3 = mof.curry(3)
def mof5 = mof.curry(5)

def sum = 0 as BigInteger

(3..<(10**exp)).each {
    if (mof3(it) || mof5(it)) {
        sum += it
    }
}

println sum

 
// third.groovy
def exp = Integer.parseInt(args[0])

def sum = 0 as BigInteger

for (int i = 3; i < 10 ** exp; i++)
    if (i % 3 == 0 || i % 5 == 0)
        sum += i

println sum
 

Lisp

; first.el
(defun multiple_of (a b)
  (= 0 (% a b))
)

(defun val-to-add (val)
  (cond
   ((multiple_of val 3) val)
   ((multiple_of val 5) val)
   (0)
   )
  )

(defun sum-of-3or5-multiples (exp)
  (let ((num 3) (sum 0.0))
    (while (< num (expt 10 exp))
      (setq sum (+ sum (val-to-add num)))
      (setq num (1+ num))
      )
    sum
    )
  )

(sum-of-3or5-multiples 7)
 

Perl

 
#!/usr/bin/env perl -w
# first.pl
use strict;

my $exp = $ARGV[0];
my $sum = 0;

sub multiple_of {
    my ($val, $tgt) = @_;
    return $val % $tgt == 0;
}

for my $i (3 ... 10 ** $exp - 1) {
    if (multiple_of($i, 3) || multiple_of($i, 5)) {
        $sum += $i
    }
}

print "$sum\n"

 
#!/usr/bin/env perl -w
# second.pl

use strict;

my $exp = $ARGV[0];
my $sum = 0;

for my $i (3 ... 10 ** $exp - 1) {
    if ($i % 3 == 0 || $i % 5 == 0) {
        $sum += $i
    }
}

print "$sum\n"
 

Python

 
# first.py
import sys

def multiple_checker(val):
    return lambda test: test and val and test % val == 0

mof3 = multiple_checker(3)
mof5 = multiple_checker(5)

sum = 0

for i in range(10 ** int(sys.argv[1])):
    if mof3(i) or mof5(i):
        sum = sum + i

print sum

 
# second.py
import sys

exp = int(sys.argv[1])
sum = 0

for mof3or5 in set(range(3, 10**exp, 3)) | set(range(5, 10**exp, 5)):
    sum = sum + mof3or5

print sum

 
# third.py
import sys

exp = int(sys.argv[1])
sum = 0

for mof3or5 in set(xrange(3, 10**exp, 3)) | set(xrange(5, 10**exp, 5)):
    sum = sum + mof3or5

print sum

 
# fourth.py
import sys

exp = int(sys.argv[1])
total = sum(xrange(3, 10**exp, 3)) + sum(xrange(5, 10**exp, 5)) - sum(xrange(15, 10**exp, 15))

print total
 

Ruby

 
# first.rb
exp = ARGV[0].to_i
sum = 0

def is_mult(large, small)
  large % small == 0
end

for i in 3..(10 ** exp - 1)
  next unless is_mult(i, 3) or is_mult(i, 5)
  sum = sum + i
end

puts sum

 
# second.rb
exp = ARGV[0].to_i
sum = 0

for i in 3..(10 ** exp - 1)
  next unless i % 3 == 0 or i % 5 == 0
  sum = sum + i
end

puts sum
(top)

 

Big O(rly)?

Well... sigh... all of the code above works correctly but is slow and naïve. All of the approaches above take more time as the largest multiple increases. I was ignorantly, happily tweaking code down this road until fortunately Bob clued me in. The best approach takes the same amount of time as the largest multiple increases. Why? There's a simple equation for this, calculating an arithmetic series.  

# instant.rb
ceil = 10 ** ARGV[0].to_i - 1

def series_of_multiples(val, ceiling)
  diff = val
  vals = ceiling / val
  last_val = val + diff * (vals - 1)
  (val + last_val) / 2.0 * vals
end

def som(v, c) series_of_multiples(v, c).to_i end

puts som(3, ceil) + som(5, ceil) - som(15, ceil)
 

Wrap up

  • I got a lot more out of this than a quick glance at Python. It was good to calculate a series or two, but much better to build a little test harness.
  • Groovy performance tested out well here. I would avoid using some of its features in tight loops. I'm confident that Monsieur LaForge et son amis will optimize that concern away eventually.
  • I thought I could love it, I really wanted to love it, it's loved by many for many good reasons, but Python started to chafe me. It's so hard to nail down exactly why... it's chock full of goodness, but just seems to need more Kool-Aid than Groovy and especially Ruby. Perl's Kool-Aid threshold is a well-known constant, at least until the Parrot changes things drastically.
  • Cosine similarity + arithmetic series = it's good to work with a friendly, helpful PhD. in math. Thanks Bob :-)

Extras

The Python test harness. Probably needs tweaking for Wintel boxes.

import sys
import re
from datetime import timedelta, datetime
from os import system

#
# prefix used to run scripts. Trailing whitespace matters!
#
vms = { 
        'el':           'emacs --script ',
        'elc':          'emacs --script ',
        'groovy':       'groovy ',
        'out':          './',
        'pl':           'perl ',
        'py':           'python ',
        'rb':           'ruby '
        }

# Configurable parameters
# 
DEFAULT_MIN = 1
DEFAULT_MAX = 7
min = DEFAULT_MIN       # min "max" val = 10 ** min
max = DEFAULT_MAX       # max "max" val = 10 ** max
reps = 7                # times to repeat each test, then average the results


# Helper method
def td_to_ms(delta):
    val = delta.microseconds / 1000.0
    val = val + delta.seconds * 1000
    val = val + delta.days * 24 * 60 * 60 * 1000
    return val

#
# "main"
#
for script in sys.argv[1:]:
    ext = script.split('.').pop()
    if not ext in vms:
        print 'ignoring', script, 'unrecognized extension'
        continue

    vm = vms[ext]

    # the lisp files hardcode the max values :(
    if re.search('^el', ext):
        max = min
    else: # reset to defaults, needed when running multiple types of tests
        max = DEFAULT_MAX
        min = DEFAULT_MIN

    # gotta like list comprehensions
    runs = [0 for x in range(max - min + 1)]

    for rep in xrange(reps):
        for exp in xrange(min, max + 1):
            start = datetime.now()
            system('%s%s %s >/dev/null' % (vm, script, exp))
            end = datetime.now()
            runs[exp - min] = runs[exp - min] + td_to_ms(end - start)

    for i in xrange(len(runs)):
        runs[i] = int(runs[i] / reps)

    print script.split('/').pop(), runs

    # see, isn't it strange how python leaves you hanging at the end
    # of the last scope in the file?

Nice, Python.

| | Comments (4) | TrackBacks (0)

Edit 2008-04-29: more and better results available here.

I've been waffling about glancing at Python. It comes off as a little ... dry? Indentation drives scope? I was afraid it would be another Objective-C; popular, modeled after successful languages, but just gets stuck in my mental parser. Even Visual Basic and Pascal (q.v. Inside Macintosh) passed that gate.

Anyways, Project Euler seemed like a good way to learn a new language. Problem #1 is summing multiples of 3 or 5 less than one thousand. Some basic iteration handles the task pretty easily -- might as well be all trendy and mix in lambda functions. And, while the hood is open, why not try it in a couple of languages and generate an unscientific performance graph?

So back to the implementation -- the iterative approach is easy. But Python 2.5-ish natively supports set operations like union and intersection. Combine that with a built-in method to generate numeric sequences and the problem is basically solved. For the win -- it outperformed the iterative approach! Oh, and every language tested except Python showed negative numbers when the sum scrolled past 2**31.

Groovy, Python, C, and Emacs Lisp competed for best performance. And by competed, I mean quick and dirty code (esp. Lisp) run a few times on my MacBook Pro. Graphics first, then the code.


perf graph


perf stats

The lisp numbers are the sketchiest -- (current-time) only spit out seconds.

Basic info: Groovy 1.5.5 binary release, Python 2.5.1 shipped with Leopard, GCC 4.0.1, Carbon Emacs Spring 2008


Python, iterative

import sys

def multiple_checker(val):
    return lambda test: test and val and test % val == 0

mof3 = multiple_checker(3)
mof5 = multiple_checker(5)
sum = 0

for i in range(10 ** int(sys.argv[1])):
    if mof3(i) or mof5(i):
        sum = sum + i

print sum

Python, set oriented
import sys

exp = int(sys.argv[1])
sum = 0

for mof3or5 in set(range(3, 10**exp, 3)) | set(range(5, 10**exp, 5)):
    sum = sum + mof3or5

print sum

Groovy, iterative
exp = Integer.parseInt(args[0])

mof = { val, test -> val && test && test % val == 0 }
mof3 = mof.curry(3)
mof5 = mof.curry(5)

sum = 0 as BigDecimal

(3..<(10**exp)).each {
    if (mof3(it) || mof5(it)) {
        sum += it
    }
}

println sum

Emacs Lisp, iterative (be nice, I'm really weak here)
(defun multiple_of (a b)
  (= 0 (% a b))
)

(defun val-to-add (val)
  (cond
   ((multiple_of val 3) val)
   ((multiple_of val 5) val)
   (0)
   )
  )

(defun sum-of-3or5-multiples (exp)
  (let ((num 3) (sum 0.0))
    (while (< num (expt 10 exp))
      (setq sum (+ sum (val-to-add num)))
      (setq num (1+ num))
      )
    sum
    )
  )

C, iterative. Not very much code here, would have probably taken more lines of Java!
#include <stdio.h>
#include <math.h>

#define multiple_of(a, b) ((a == 0 || b == 0) ? 0 : a % b == 0)

int main(int argc, char** argv)
{
    int exp = atoi(argv[1]);
    double sum = 0;
    int i;

    for (i = 3; i < pow(10, exp); i++)
        if (multiple_of(i, 3) || multiple_of(i, 5))
            sum += i;

    printf("%0.0f\n", sum);

    return 0;
}

A Grails plugin for fuzzy string matching

| | Comments (0) | TrackBacks (0)
Long ago at a company which shall not be named, I made the mistake of shunning contemporary technology. Never again! Since Near Infinity honors its training commitment I was able to attend and enjoy the Groovy / Grails NFJS conference.

Apache Commons Codec implemented soundex, double metaphone, base64, and more. in the spirit of Grails why not stitch them into java.lang.String?

import org.apache.commons.codec.language.*
import org.apache.commons.codec.net.*

class FuzzstrGrailsPlugin {
    def version = 0.1
    def dependsOn = [core:grails.util.GrailsUtil.getGrailsVersion()]
	
    def doWithDynamicMethods = {
        def encodingClosure = { it.encode(delegate) }
        def decodingClosure = { it.decode(delegate) }

        String.metaClass.toSoundex = encodingClosure.curry(new Soundex())
        String.metaClass.toRefinedSoundex = encodingClosure.curry(new RefinedSoundex())
        String.metaClass.toMetaphone = encodingClosure.curry(new Metaphone())
        String.metaClass.toDoubleMetaphone = encodingClosure.curry(new DoubleMetaphone())

        String.metaClass.toBase64 = encodingClosure.curry(new BCodec())
        String.metaClass.fromBase64 = decodingClosure.curry(new BCodec())

        String.metaClass.toQPrintable = encodingClosure.curry(new QCodec())
        String.metaClass.fromQPrintable = decodingClosure.curry(new QCodec())

        String.metaClass.toQuotedPrintable = encodingClosure.curry(new QuotedPrintableCodec())
        String.metaClass.fromQuotedPrintable = decodingClosure.curry(new QuotedPrintableCodec())

        String.metaClass.toUrlEncoded = encodingClosure.curry(new URLCodec())
        String.metaClass.fromUrlEncoded = decodingClosure.curry(new URLCodec())

        String.metaClass.similarTo = { Cosine.stringSimilarity(delegate, it) }
        String.metaClass.mostSimilarTo = { Cosine.mostSimilar(delegate, it) }
        String.metaClass.rankedSimilarity = { Cosine.horseShoes(delegate, it) }
    }
}

So what? Well, see for yourself:

Loading with installed plug-ins: ["fuzzstr"] ...
Groovy Shell (1.5.4, JVM: 1.5.0_13-119)
Type 'help' or '\h' for help.
-------------------------------------------------------------------------------
groovy:000> "literal".toSoundex()
===> L364
groovy:000> "literal".toDoubleMetaphone()
===> LTRL
groovy:000> "literal".toBase64().fromBase64()
===> literal
groovy:000> "literal".similarTo("litteral")
===> 93
groovy:000> "litteral".mostSimilarTo(['literature', 'litter', 'literal']) 
===> literal

The plugin's not ready to be released yet but hopefully within a few weeks; localization and caching need some love. Pondering whether GORM might benefit from some of these methods.

seth _dot_ schroeder _at_ nearinfinity _dot_ com

I believe that most software tests favor code over data. It might be an issue of tools; code testing tools are easy to use, have aggressive ide and framework support, and produce coverage metrics. Maybe better data testing tools would increase data coverage. Here's some code to demonstrate one need when testing lots of data: catch wrong results and absent results.

One approach to testing data involves these parts:

  1. Define the input data
  2. Define the expected output data
  3. Implement code to process the input data
  4. Generate the actual output data by evaluating #1 with #3
  5. Compare the results of #4 with #2

Easy as pie, right?

I think #5 is more interesting than it seems. The challenge is to report every difference between the expected output and the actual output. Detecting wrong results is easy. Detecting absent results is harder.

Blue results were found, but not expected. Yellow results were expected, but not found. Green results were found and expected. I expected this to work out easily as differences and intersections of sets, but hit a little snag in the tool implementation.

About this huge chunk of code:

  • Lines 6-8 define the input and expected output. The values are stored in a map, where the keys are the input and the values the output.
  • Thank Groovy! It makes passing code around so easy. Line 12 defines the test logic inline.
  • test1 demonstrates all good, no bad results
  • test2 demonstrates wrong results
  • test3 demonstrates absent results
  • licensed with the Apache 2 license

And putting the cart before the horse, here are the results of the code:

--Output from test1--
--Output from test2--
a: wanted A, got B
--Output from test3--
caught exception java.lang.Exception: c!?!?
c was ignored!

Enough, here's the code:

 1: class TestStuff extends GroovyTestCase {
 2:
 3:     def inputOutput = [:]
 4:  
 5:     void setUp() {
 6:         inputOutput['a'] = 'A'
 7:         inputOutput['b'] = 'B'
 8:         inputOutput['c'] = 'C'
 9:     }
10:    
11:     void test1() {
12:         assert verifyAll(inputOutput, { it.toUpperCase() } )
13:     }
14: 
15:     void test2() {
16:         assert verifyAll(inputOutput, {
17:             if (it == 'a') ++it
18:             it.toUpperCase()
19:         })
20:     }
21: 
22:     void test3() {
23:         assert verifyAll(inputOutput, {
24:             if (it == 'c') throw new Exception("c!?!?")
25:             it.toUpperCase()
26:         })
27:     }
28: 
29:     def verifyAll(expected, method) {
30:         def actual = runTests(expected, method)
31:         def missed = diffSets(expected.keySet(), actual.keySet())
32:         def unexpected = diffMaps(actual, expected)
33: 
34:         missed.each { println("${it} was ignored!") }
35: 
36:         unexpected.each { key, val ->
37:             println("${key}: wanted ${expected[key]}, got ${val}")
38:         }
39: 
40:         return actual == expected
41:     }
42: 
43:     def runTests(control, method) {
44:         def test = [:]
45: 
46:         control.each { input, value ->
47:             try { test[input] = method(input) }
48:             catch (e) { println ("caught exception ${e}") }
49:         }
50:             
51:         return test
52:     }
53: 
54:     def diffSets(a, b) {
55:         def diff = []
56:         a.each { if (!b.contains(it)) diff << it }
57:         return diff
58:     }
59: 
60:     def diffMaps(a, b) {
61:         def diff = [:]
62:         a.each { key, val -> if (a[key] != b[key]) diff[key] = val }
63:         return diff
64:     }

Wait wait wait -- real programs are built with classes!

// ... imagine Ahhbject.groovy (say it with a Boston accent :)
class Ahhbject {
    Object val
    String toString() { "${this.class.name}@${hashCode()} ${val}" }
}

65:     void test4() {
66:         inputOutput.a = new Ahhbject(val:'A')
67:         inputOutput.b = new Ahhbject(val:'B')
68:         inputOutput.c = new Ahhbject(val:'C')
69: 
70:         assert verifyAll(inputOutput, {
71:             new Ahhbject(val: it.toUpperCase())
72:         })
73:     }

--Output from test4--
a: wanted Ahhbject@3260319 A, got Ahhbject@1806345 A
b: wanted Ahhbject@9259311 B, got Ahhbject@12565457 B
c: wanted Ahhbject@12823094 C, got Ahhbject@14416801 C
    

Curses! It doesn't just work. The cause is one of my pet peeves -- a class which can't identify distinct but equivalent instances. Grails makes this easy to fix by bundling the Apache commons lang library. HashCodeBuilder and EqualsBuilder make quick work of implementing hashCode and equals:

    int hashCode() { HashCodeBuilder.reflectionHashCode(this) }
    boolean equals(Object obj) { EqualsBuilder.reflectionEquals(this, obj) }

Add those two lines and ta-da! Fit for data testing!

Groovy cosine similarity in Grails

| | Comments (0) | TrackBacks (0)

I really appreciate Matt and Bob showing me significantly better ways to compare strings. Yet another perk of working for this company :)

I've been spending time with grails lately. One of my biggest beefs is 100% accuracy required 100% of the time when typing parameters to the "grails" application. Not a big deal (esp. with emacs M-/)... but what an opportunity to do something useful with cosine similarity!

All code on this page is licensed under the Apache 2 license.

The heart of the code. Please ignore the lack of error handling and optimization.

    static def similarity(l_seq, r_seq, degree=2) {
        def l_histo = countNgramFrequency(l_seq, degree)
        def r_histo = countNgramFrequency(r_seq, degree)
        
        dotProduct(l_histo, r_histo) /
            Math.sqrt(dotProduct(l_histo, l_histo) *
                      dotProduct(r_histo, r_histo))
    }

This probably isn't ideal Groovy, but I sure appreciated the syntax improvements to arrays and maps.

    static def countNgramFrequency(sequence, degree) {
        def histo = [:]
        def items = sequence.size()

        for (int i = 0; i + degree <= items; i++)
        {
            def gram = sequence[i..<(i + degree)]
            histo[gram] = 1 + histo.get(gram, 0)
        }
        histo
    }

Still nothing specific to strings here...

    static def dotProduct(l_histo, r_histo) {
        def sum = 0
        l_histo.each { key, value ->
            sum = sum + l_histo[key] * r_histo.get(key, 0)
        }
        sum
    }

Finally, here's something useful:

    static def stringSimilarity (l_str, r_str, degree=2) {
        similarity(l_str.toLowerCase().toCharArray(),
                   r_str.toLowerCase().toCharArray(),
                   degree)
    }

... okay, how to apply this to grails? Just one more bit of code:

    static def mostSimilar(pattern, candidates, threshold=0) {
        def topScore = 0
        def bestFit = null

        candidates.each { candidate ->
            def score = stringSimilarity(pattern, candidate)
            if (score > topScore) {
                topScore = score
                bestFit = candidate
            }
        }

        if (topScore < threshold)
            bestFit = null

        bestFit
    }

Drop this into GrailsScriptRunner.groovy and now this:

  • grails create-domain-class

... is equivalent to all of these:

  • grails craete-domain-class
  • grails domain
  • grails dom

To everyone who read this far, thank you :) And... one other observation. Strings are ordered sequences (of chars). What about file contents (lines) and directory contents (file names)?

    static def fileSimilarity (l_file, r_file, degree=3) {
        similarity(new File(l_file).readLines(),
                   new File(r_file).readLines(),
                   degree)
    }

    static def dirSimilarity(l_dir, r_dir, degree=3) {
        def l_files = []
        def r_files = []

        new File(l_dir).eachFileRecurse { l_files << it.name }
        new File(r_dir).eachFileRecurse { r_files << it.name }

        similarity(l_files, r_files, degree)
    }

edit 2007-12-05 -- size() method works for arrays and lists; easier than length and size properties. Contribution much appreciated!

-- seth.schroeder _at_ nearinfinity.com

Earlier I looked at soundex as a more robust way of comparing strings. Soundex was more robust than case & space normalizing... but it missed somethings I wanted and found some things I didn't want. Here's an example based on a rather limited wine rack:

CREATE TABLE winerack (
      wine_type TEXT
    , ready DATE
    , soundex_val TEXT
);

INSERT INTO winerack (wine_type, ready, soundex_val)
    VALUES ('cabernet sauvingon', now() + interval '5 years', 'C165');

INSERT INTO winerack (wine_type, ready, soundex_val)
    VALUES ('cabermet sauvingon', now() + interval '6 years', 'C165');

INSERT INTO winerack (wine_type, ready, soundex_val)
    VALUES ('caberet sauvingon', now() + interval '7 years', 'C163');

INSERT INTO winerack (wine_type, ready, soundex_val)
    VALUES ('cabernet franc', now() + interval '8 years', 'C165');

Now I'd like to find all of the Cabernets Sauvignon. Using a literal match one bottle shows up. Three bottles show up using soundex... but not the right three bottles:

bigrams=# SELECT * FROM winerack WHERE soundex_val = 'C165';
     wine_type      |   ready    | soundex_val 
--------------------+------------+-------------
 cabernet sauvingon | 2012-10-08 | C165
 cabermet sauvingon | 2013-10-08 | C165
 cabernet franc     | 2015-10-08 | C165
(3 rows)

What happened? Well, soundex picked up the "cabermet" which was good. But it also picked up the Cabernet Franc which was bad. Also, it missed the "cabaret sauvingon".

Cabernet Franc was picked up because a soundex value contains at most four consonants. 'C', 'b', 'r', 'n' in this case. Nothing else matters afterwards... shudder to imagine "cabernet sausage" showing up in the search results!

Caberet Sauvignon was ignored because soundex doesn't compare strings. It's almost like a really primitive string hashing function. String in, string out. Easy to use though.

I approached Rob with this and he suggested I give character-level bigrams a try. Basically this finds how similar words are based on how often pairs of characters occur. Rob is being very patient with me as I work through the gory details, but this is my general understanding:

similarity(left_str, right_str)
    left_sum = dot_product(left_str, left_str);
    right_sum = dot_product(right_str, right_str);
    pair_sum = dot_product(left_str, right_str);
    return pair_sum / square_root(left_sum * right_sum);

The moment of truth -- did it outperform soundex?

bigrams=# select * from bigram_similarities;
      term_a       |      term_b       |    similarity     
-------------------+-------------------+-------------------
 cabernetsauvignon | cabernetsauvignon |                 1
 cabernetsauvignon | cabermetsauvignon |             0.875
 cabernetsauvignon | caberetsauvignon  | 0.903696114115064
 cabernetsauvignon | cabernetfranc     | 0.505181485540923

In this contrived case it worked out better. No clue further than that. It is easy to use like soundex:

bigrams=# select bigram_similarity('cabernet sauvignon', 'cabernet franc');
 bigram_similarity 
-------------------
 0.505181485540923

The infrastructure behind this is pretty gory. The code's not really polished and I have some concerns about performance. But here's a really neat way to generate a dot product using SQL:

CREATE TABLE bigrams ( term TEXT, bigram CHAR(2), instances INT );

SELECT
    SUM(lhs.instances * rhs.instances) AS dot_product
FROM
    bigrams lhs, bigrams rhs
WHERE
    lhs.bigram = rhs.bigram
        AND
    lhs.term = 'cabernet sauvignon'
        AND
    rhs.term = 'cabernet franc'

Views keep your SQL queries DRY

| | Comments (0) | TrackBacks (0)

A SQL view is a SELECT statement stored in the database which can be used like a table. So... what? Well, some of the benefits include:

  1. Syntax verified before deployment -- avoids errors from queries built at runtime.
  2. The RDBMS knows to expect it and could prepare an execution plan.
  3. Force users to use a view which excludes sensitive table data.
  4. Centralize important, common logic.
Number 4 is the strongest argument. It is the Don't Repeat Yourself argument for data. Duplicate code stinks! The fix is defining and calling "a single, unambiguous, authoritative representation within a system" (src). So why not treat data that well? Here's a silly example (in PostgreSQLish):

CREATE TABLE pantry (food TEXT, course TEXT, expiry DATE);
INSERT INTO pantry (food, course, expiry)
    VALUES ('crumpets', 'breakfast', now() + interval '2 weeks');
INSERT INTO pantry (food, course, expiry)
    VALUES ('canned ravioli', 'lunch', now() + interval '2 years');
INSERT INTO pantry (food, course, expiry)
    VALUES ('instant noodles', 'dinner', now() + interval '2 years');
INSERT INTO pantry (food, course, expiry)
    VALUES ('fruitcake', 'snack', now() + interval '2000 years');

Imagine these meal planning queries:

  • SELECT food FROM pantry WHERE course = 'breakfast' AND expiry > now();
  • SELECT food FROM pantry WHERE course = 'lunch' AND expiry > now();
  • SELECT food FROM pantry WHERE course = 'dinner' AND expiry > now();
  • SELECT food FROM pantry WHERE course = 'snack' AND expiry > now();

Well, that's pretty gory. No abstraction at all -- mystery meat queries which need unnecessary work to be understood. Two things need to be abstracted: course and expiry. No one wants stale food, so build that logic first:

CREATE VIEW fresh_food AS
    SELECT *
    FROM pantry
    WHERE expiry < now();

Voici! The per-course queries look like: SELECT food FROM fresh_food where course = 'foo'. But, why stop abstracting now? Why not this?

CREATE VIEW breakfast_menu AS
    SELECT *
    FROM fresh_food
    WHERE course = 'breakfast';

This example is almost too silly, but one more good point can be squeezed out yet. A key benefit of centralized logic is having one place to make bug fixes. For example, the typo in fresh_food which returns stale food. Fix the view DML and all client code immediately benefits without change.

Unfortunately in practice performance can easily suffer when using (and especially layering) views.

Coming soon -- part 2: Views made my queries SLOW!

One of the unpleasant tasks on my to-do list is improve the process of searching by name in a database. A parade of errors are just around the bend: transposed letters, stray punctuation, and even random flecks of whitespace. Given these silly tables and sample data: create table cafe_du_monde (     name varchar2(128),     beignets integer ); create table jackson_square (     name varchar2(128),     pictures_taken integer ); a.) insert into cafe_du_monde (name, beignets) values ('seth', 3); b.) insert into cafe_du_monde (name, beignets) values ('seth ', 6); c.) insert into jackson_square (name, pictures_taken) values ('Seth', 127); d.) insert into jackson_square (name, pictures_taken) values ('seht', 10); Ideally the results would be me eating 9 beignets and taking 137 pictures (let's not mention the cafe au laits :)

Method A: compare as-is

select name, sum(beignets), sum(pictures_taken)
from cafe_du_monde as cdm, jackson_square as js
where cdm.name = js.name
group by name;
Bzzzt. Not good. Look at all those rows! Selected: a, b, c, d

Method B: normalize case

select name, sum(beignets), sum(pictures_taken)
from cafe_du_monde as cdm, jackson_square as js
where lower(cdm.name) = lower(js.name)
group by name;
Selected: a+c, b, d

Method C: trim whitespace

select name, sum(beignets), sum(pictures_taken)
from cafe_du_monde as cdm, jackson_square as js
where trim(both from cdm.name) = trim(both from js.name)
group by name;
Selected: a+b, c, d

Methods B & C

(obvious query omitted) Selected: a+b+c, d

Method D: soundex

select name, sum(beignets), sum(pictures_taken)
from cafe_du_monde as cdm, jackson_square as js
where soundex(cdm.name) = soundex(js.name)
group by name;
Soundex value: S300 Selected: a+b+c+d Soundex works perfectly for this example. When I tried it with more realistic data I wasn't quite as happy. Matt introduced me to the Levenshtein algorithm for counting the number of edits needed to convert one string into another. (ex: cat -> mat = 1, cat -> dot = 2, cat -> dog = 3, cat -> gold = 4)

Method E: Comparing edit distance

levenshtein('seth', 'seth') => 0
levenshtein('seth', 'Seth') => 1
levenshtein('seth', 'seth ') => 1
levenshtein('seth', 'seht') => 2
damerau-levenshtein('seth', 'seht') => 1
The Levenshtein distance algorithm treats the 'ht' in 'seht' as two substitions. The Damerau-Levenshtein algorithm treats it as one swap. Otherwise it's the same, so Damerau-Levenshtein seems like the better method. Unhappily the initial edit distance results seem weaker than the case normalizing and whitespace trimming. Since the algorithms stumble over case and whitespace, so why not take care of those first?

Method F: normalizing case, trimming whitespace, & comparing edit distance

damerau-levenshtein(lower(trim('seth', 'seth'))) => 0
damerau-levenshtein(lower(trim('seth', 'Seth'))) => 0
damerau-levenshtein(lower(trim('seth', 'seth '))) => 0
damerau-levenshtein(lower(trim('seth', 'seht'))) => 1
I think this is better. An edit distance of 25% seems a little high, but the string is so short in this example. At this point I've got enough to start experimenting. The Levenshtein implementation on Wikipedia has to be seen to be believed. Here's a quick and dirty Ruby version. (ObFun: Since Ruby supports jagged arrays I decided to emulate two dimensional arrays with a set of [row,col] keys. A little strange looking but kinda neat). def levenshtein(s1, s2)   d = {}   (0..s1.size).each do |row|     d[[row, 0]] = row   end   (0..s2.size).each do |col|     d[[0, col]] = col    end   (1..s1.size).each do |i|     (1..s2.size).each do |j|       cost = 0       if (s1[i-1] != s2[j-1])         cost = 1       end       d[[i, j]] = [d[[i - 1, j]] + 1,                    d[[i, j - 1]] + 1,                    d[[i - 1, j - 1]] + cost                   ].min       next unless @@damerau       if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])         d[[i, j]] = [d[[i,j]],                      d[[i-2, j-2]] + cost                     ].min       end     end   end   return d[[s1.size, s2.size]] end

Programming challenge #1: 3n + 1

| | Comments (0) | TrackBacks (0)

One of the reasons I joined Near Infinity is the annual allowance of $500 for books, subscriptions, and software. This made it a no-brainer to get a copy of "Programming Challenges" by Steven Skiena and Miguel Revilla. I just wanted a challenge which wasn't following ornate business logic or deploying said software. The book is full solid but (hopefully :) tractable technical challenges. I really like Ruby and decided to use it despite the supported languages being C, C++, Java, and Pascal.

The first problem is called "The 3n + 1 problem." The core idea is to reduce a positive number to one by evaluating these steps:

  • one: stop
  • even number: divide by two
  • odd number: multiply by three and then add one
def step(val)
  if (val == 1)
    val
  elsif (val % 2 == 0)
    val / 2
  else
    val * 3 + 1
  end
end
The next piece is to count the number of steps taken. The rules are clear that the initial number and one are both to be counted as steps. Basically, count the posts in the fence, not the gaps.
def countSteps(val)
  steps = 1
  until (val == 1) do
    val = step(val)
    steps = steps + 1
  end
  steps
end
The final piece is to run a series of numbers through the code, and report which number required the most steps.
def maxSteps(range)
  max = 0
  range.each do |val|
    max = [max, countSteps(val)].max
  end
  max
end
Large numbers passing through this code won't have much effect on memory or stack usage; it looks CPU bound. I wondered how easy it would be to balance CPU usage with memory usage. Since each positive number always needs the same number of steps to reduce to one, it's safe to save the number with its step count in a map. While processing steps, check each result to see if it has been cached. A cache hit results in no further processing, except to cache the step count of the argument. The memory usage will grow with the number of unique arguments.
# elsewhere: @chain = { 1 => 1 }
def computeCycles(val)
  spin = 0
  origVal = val

  until (@chain[val]) do
    val = step(val)
    spin = spin + 1
  end

  unless (origVal == 1)
    @chain[origVal] = @chain[val] + spin 
  end

  @chain[origVal]
end
Here are some tacky metrics produced running on a cold cache while processing 1..200.
[...]
10 => spins used/avoided, %avoided, cache size = 28/49 -- 64, 10
[...]
100 => spins used/avoided, %avoided, cache size = 803/2439 -- 75, 100
[...]
200 => spins used/avoided, %avoided, cache size = 1393/7225 -- 84, 200
So, to wrap up Ruby was a mixed blessing here. Fortunately I was able to do non-trivial things without wasting time & tendons on basic infrastructure. Unfortunately the processing website doesn't accept Ruby :( Soon I hope to tackle more of these challenges ... maybe using Groovy and Lua.

Unicode code points vs code units

| | Comments (0) | TrackBacks (0)
What's the difference between a Unicode code point and a code unit? Roughly, code points are for people and code units are for computers. It takes five code points to say Hello, and they are: 72, 101, 108, 108, 111

Each code unit format has its own structure. Note that UTF-16LE needs an extra code unit.

UTF-8: 0x48, 0x65, 0x6c, 0x6c, 0x6f
UTF-16: 0x0048, 0x0065, 0x006c, 0x006c, 0x006f
UTF-16LE: 0xfffe, 0x4800, 0x6500, 0x6c00, 0x6c00, 0x6f00
UTF-32: 0x00000048, 0x00000065, 0x0000006c, 0x0000006c, 0x0000006f

So... what? Well, someone had a thinking cap on when they defined UTF-8. It's optimal for storing American English and safe for legacy software libraries. My favorite part is how the first bits in the first byte count the number of bytes remaining in the code unit.

UTF-16 is kinda strange when you really look at it. Storing it requires 100% more space than UTF-8 requires for American English. It's big/network endian by default... meaning most computers have to juggle bytes per character before using them. And it's totally incompatible with old ASCII libraries. Finally, it's not possible to rely on 1 utf-16 code unit per code point due to surrogate characters.

The Unicode 5.0 spec is gigantic. It's always stocked on the top shelf at Borders with the other oversized books. Read it and you may believe that ZWNBSP means something. :)

Grails adds layers -- cake or onion?

| | Comments (1) | TrackBacks (0)
Grails has big hopes. It hopes to find a place for Groovy in every tier of a Java-based webapp. It hopes to vanquish configuration files by enforcing implicit requirements / conventions. It hopes to simplify starting a respectable Java webapp by providing and configuring Hibernate, Spring, and multi-environment build files.

I hope the Grails dev team succeeds. JRuby is a well-intentioned, well-heeled, but misbegotten creation (q.v. V. Frankenstein). Should Groovy succeed, it should be the ultimate Java webapp framework. Ultimate in this sense: "being the last or concluding element of a series." Consider this diagram of a minimal Grails application -- can you find room for more layers?

structure of Grails webapp framework

All of those jar files are in the classpath and configured for immediate coding. Did I mention the grails commandline app? It's fantastic. It creates classes & test classes, runs tests, runs apps, and is a good engine for the development workflow. So what's the catch? Adding Grails into an existing project doesn't make much sense. The Grails skeleton would duplicate much of the existing infrastructure. Migrating existing code into a Grails skeleton can be done, but some refactoring will be needed. Here's my quick take based on some tinkering.

Controllers

Rewrite controllers. A controller in Grails lives for one request, unlike a long-lived Spring MVC controller. Grails has done a good job of improving the process of implementing a controller: autobinding the query string key:value pairs to instance fields, simple url mapping conventions, and more.

Services

Consider migrating .java business logic to Grails .groovy services. A Grails service class instance is long lived as in Spring MVC; it can also be injected into a Grails controller very easily. Java service classes with complicated transaction requirements should not be migrated. Transaction / rollback is applied to all Grails service methods by default. Disabling rollback affects all methods in a service class.

Views and layouts

Grails supports Groovy Server Pages (.gsp) and Java Server Pages (.jsp); Sitemesh can be used with both. Unfortunately support for custom JSP tag libraries will wait until version 0.6. GSP isn't bad. The conditional control flow tags are nice, but I really see no strong reason to prefer a .gsp to a .jsp. One notable item on the roadmap is support for Spring Web Flow.

Model / DAO / domain classes

Hibernate mapping files and annotations are well understood ways to control ORM. GORM configuration uses static class variables. Here is a partial list: optionals, transients, belongsTo, hasMany, mappedBy, embedded, constraints. Grails supports Groovy SQL but has plans to fork it eventually. GORM also injects id and version properties to domain objects at runtime.

Conclusion

Grails adds layers, but more cake-like than onion-like. Programmers can look forward to writing Groovy code for a multi-tier webapp with the dependencies already in place. Much existing code can be migrated in by adapting it to the Grails conventions; other parts may be included as .java files. The largest learning curve will probably be GORM. Finally, it's probably a good idea to let Grails mature for 3-6 months before deploying it into production.

Learning to code Groovy Java

| | Comments (0) | TrackBacks (0)

My introduction to Groovy

    Contents:
  1. Intro
  2. Code
  3. Results
  4. Code review
  5. Tools
  6. Conclusion
  7. Side notes

 

Idiomatic Groovy code looks and works like a mash-up of Java and Ruby... JRuby would be an intuitive name. After all, the "J" in the real JRuby has more to do with deployment and execution than look and feel of the code. Which do developers spend more time with?

The proof of the pudding is in the eating -- how Groovy is it? I decided to learn Groovy by writing a minimal RSS reader. It's no big chore in Java, but I was happily surprised how my Java and Ruby skills came into play. Please remember that this is my first shot at Groovy:


 

Grouping rss items by keyword

 1:  def groupByKeyword(groups) {
 2:      def stream;
 3:      def map = [:]
 4:
 5:      try {
 6:          stream = url.openStream();
 7:          def slurper = new XmlSlurper();
 8:          def rss = slurper.parse(stream);
 9:
10:          groups.each {
11:              def group = it.toUpperCase();
12:              map[group] = [];
13:
14:              rss.channel.item.each {
15:                  if (it.title.text().toUpperCase() =~ group) {
16:                      map[group] += it.title;
17:                  }
18:              };
19:          }
20:      } catch (e) {
21:          println e;
22:      } finally {
23:          if (stream)
24:              stream.close();
25:      }
26:
27:      return map;
28:  }

collecting parameters and printing the results

29:    static void main(args) {
30:
31:        def url = args[0];
32:
33:        def keywords = args[1..<args.length];
34:    
35:        def reader = new RssReader(url);
36:
37:        def groups = reader.groupByKeyword(keywords);
38:
39:        groups.entrySet().each {
40:            println "$it.key"
41:
42:            it.value.each {
43:                println "\t* $it"
44:            }
45:        }
46:    }

 

Results

Input:

http://digg.com/rss/indexprogramming.xml java[s] java[^s] groovy haskell erlang microsoft emacs lisp

Output:

GROOVY
ERLANG
JAVA[^S]
EMACS
LISP
JAVA[S]
    * Top 5 javascript frameworks
    * The Javascript Programming Language
HASKELL
MICROSOFT

Input:

http://digg.com/rss/indexprogramming.xml java[s] java[^s] groovy haskell erlang microsoft emacs lisp

Output:

GROOVY
ERLANG
JAVA[^S]
EMACS
LISP
    * pregexp: Portable Regular Expressions for Scheme and Common Lisp
JAVA[S]
    * JavaScript - the world's most ubiquitous computing runtime... and it's not going away soon.
HASKELL
    *  Haskell's HTTP library: how it sucks, how to fix it
MICROSOFT
    *  Microsoft funds questionable study attacking GPL 3 draft process

 

A mini code review

  1. Line #1 demonstrates several Groovy-isms. The line starts a non-static method definition. The return type and parameter type were omitted.
  2. An empty map is defined at Line #3. I forgot to end the statement with a semicolon, but that is optional in Groovy.
  3. Lines 6..9 are more Java-ish than Groovy. I was inclined to work this way due to immature tool support. The Groovy plugin for Eclipse reported syntax errors when the parens were removed, but the code ran fine. The java-mode in emacs couldn't indent code lacking parens and semicolons.
  4. Ruby coders will recognize line 10. Notice the "it" variable on line 11. Groovy automatically defined "it" is a reference to the item being iterated over.
  5. Groovy packs a lot of code into line 14. The Groovy chunk "rss.channel.item" directly matches the RSS structure of <rss><channel><item>. That shrink-wrapped coupling was minimal and readable. I couldn't find an obvious reason to prefer xpath or other xml navigation methods.
  6. Line 15 would have been merged into line 14, but I ran out of patience getting findAll to work.
  7. Line 16 uses += to add an item to an array. -= works as expected.
  8. The fading C/C++ programmer inside me was THRILLED to use if (reference) ... at line 23.
  9. This cheesy code depended on positional parameters. Given that, it was nice to slice them so easily at line 33. The ..< syntax defines a half-open interval [min, max) where min is included and max is excluded.
  10. Notice how "it" is defined in both the inner and outer loop near line 39. The C/C++ guy inside of me got all worked up about shadowing stack variables... that line of thinking is dead wrong about closures in Groovy.

 

Tools

Groovy is a very young language; its tools can't be judged fairly against Java's tools. The most significant problem I had was during debugging. Groovy is very involved in dispatching method calls, so a lot of debugging time will be spent stepping over / out of Groovy internals. That didn't work for me; the app seemed to resume processing. Other issues I noticed were lack of auto package import / cleanup, auto method completion, and the mistaken compile errors on missing parens.

The Groovy plugin did make it easy to start a new project. Installation worked great, and the dependencies were automatically added to the project after I created a Groovy class.


 

Conclusion

Pragmatic: The Groovy developers made a lot of welcome improvements to Java's syntax. It does leverage existing developer skills and would integrate tightly with existing Java code. The tools need a lot of features and some fixes.

Personal: I had bipolar Java/Ruby style swings while coding Groovy. It feels like almost an even compromise between the two... almost too flexible for my personality :). I would consider using it in new projects after Groovy 1.1 (annotations + more) has been released and the tools have matured.


 

Side notes:

  • Wow, that's a lot of .class files per lines of Groovy:
    $ wc -l *.groovy
          24 Hola.groovy
          90 RssReader.groovy
         114 total
    $ echo .groovy && ls *.groovy | wc -l; echo .class && ls *.class | wc - l;
    .groovy
           2
    .class
          14
    
  • Google Trends comparison of Groovy and JRuby
  • Groovy was started in early 2004! Version 1.0 was released almost three years later. This article describes the waning and waxing of progress implementing the language.

Quick and dirty SQL histogram

| | Comments (0) | TrackBacks (0)

Sometimes you really want a quick & dirty histogram while looking through a database:

  • when you suspect the mean value is misleading
  • when you want to understand how the values are distributed
  • ... and easily switch between different sources of values
  • ... without exporting data & switching applications

Here are the story scores from a recent front page of reddit.com: 175, 456, 140, 191, 230, 186, 134, 215, 171, 83, 102, 171, 182, 322, 193, 310, 338, 345, 174, 134, 92, 109, 241, 256, 132

A basic statistical query returns:

+-----+-----+-----+--------+
| max | min | avg | stddev |
+-----+-----+--------------+
| 456 |  83 | 203 |   90   |
+-----+-----+-----+--------+

The standard deviation seems awfully large. Maybe not many of the scores are close to the mean score of 203? What if another query could show the distribution?

+--------+----------+--------+----------+         +--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling |         | bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+         +--------+----------+--------+----------+
|      1 |        8 |     83 |      157 |         |      1 |        4 |     83 |      119 |
|      2 |       10 |    158 |      232 |         |      2 |        4 |    120 |      156 |
|      3 |        2 |    233 |      307 |         |      3 |        8 |    157 |      193 |
|      4 |        4 |    308 |      382 |         |      4 |        2 |    194 |      230 |
|      5 |        1 |    383 |      457 |         |      5 |        2 |    231 |      267 |
+--------+----------+--------+----------+         |      6 |        0 |    268 |      304 |
                                                  |      7 |        3 |    305 |      341 |
                                                  |      8 |        1 |    342 |      378 |
                                                  |      9 |        0 |    379 |      415 |
                                                  |     10 |        1 |    416 |      452 |
                                                  +--------+----------+--------+----------+

The histogram on the left has fewer, larger buckets. This is a lot more informative than the mean & stddev. The histogram on the right uses more, smaller buckets. Maybe this is too verbose? What if you wanted seven buckets?

        update dhg.bucket_count set num_buckets = 7;
        select * from dhg.results;

        +--------+----------+--------+----------+
        | bucket | contents | _floor | _ceiling |
        +--------+----------+--------+----------+
        |      1 |        7 |     83 |      135 |
        |      2 |        7 |    136 |      188 |
        |      3 |        5 |    189 |      241 |
        |      4 |        1 |    242 |      294 |
        |      5 |        4 |    295 |      347 |
        |      6 |        0 |    348 |      400 |
        |      7 |        1 |    401 |      453 |
        +--------+----------+--------+----------+

Instructions:

  1. Insert numbers to be analyzed:
  2. class="prettyprint"INSERT INTO dhg.source SELECT foo FROM bar;
  3. Choose how many buckets in the histogram
  4. UPDATE dhg.bucket_count SET num_buckets = 10;
  5. Read the results!
  6. SELECT * FROM dhg.results_full;

Materials:

All views, tables, and functions will live in a dynamic histogram (dhg) schema. The SQL is pretty minimal yet hopefully reasonably structured and commented. The MySQL flavor is larger due to an implementation of width_bucket.

WARNING: The implementation suffers from a variety of rounding errors and poor error handling. This is a quick and dirty solution for rough estimates only.

NOTE: Using more than 20 buckets will need a small tweak. Grep the SQL for empty_buckets.

I'd love to hear criticisms, comments, & suggestions!

"Data elements" are the focus of Section 5.2.1 of Dr. Fielding's dissertation. I think the focus is who does how much work to render the data. He lists three options: mostly server, mixed client & server, and mostly client:  Server                                                          Client |----------------------------------------------------------------------|       |                    |                                 |  1. Fixed format       2. encapsulated data & code    3. raw data & metadata     (jpeg)                 (json, javascript)             (html <img> tag?)

This is an important decision for an online spreadsheet. Should the server send only literal values (option one)? Should the client evaluate functions and resolve references? (option two)?

Option 1:

    pro: parsing cell values into a tree and traversing it is tricky. Writing it once in Java seems like a reasonable approach.

    con: client has no idea which cells are related to each other. That would really complicate features like copying and pasting cell references.

Option 2:

    con: reimplement much of cell parsing & traversal in Javascript. Using Rhino to test it outside a browser mitigates this a little.

    pro: client knows which cells are related.

Option 3:

    con: no idea how to apply this approach for a spreadsheet.

I chose option two. Eventually the client will probably need to work directly with cell references. Better to get that work out of the way ahead of time. Not all of the logic needs to be rewritten. Some of it is only needed on one side:

common:

  • build parse tree from cell values
  • iterate over parse tree

server specific:

  • update reference values: A1 => B1, $G3 => $G4
  • collapse the parse tree back into a flat string

client specific:

  • resolve reference values (A1 = 123.45, G3 = A1 = 123.45).
  • evaluate functions (=SUM(G3,A1) => 246.9).

I didn't expect so many neat challenges. Even the prototype implementation needed parsing, recursion, evaluating simple math expressions, tree traversal, base10 & base26 conversion, and more. But all that is meat for another post.

NFJS in general and JRuby in specific

| | Comments (0) | TrackBacks (0)
Several of the NearInfinity developers went to the Reston stop of the No Fluff Just Stuff conference. Some of the conferences were especially interesting. It was thrilling to hear strong, detailed skepticism of SOA from the expert panel. The arms race with .Net to host other programming languages has gone wild!
  • JHaskell.
  • Sleep. Quoted from the site: Sleep is heavily inspired by Perl with bits of Objective-C thrown in
  • Dynamic Java.
  • Groovy seems like Java "lite." The groovy compiler reportedly accepts most Java source, but also permits a more flexible yet terse syntax. This gained a lot of traction at the conference. Some people complained that the name sounded too much like "Ruby." Which leads to...
  • JRuby. The overall goal is to mate the slick Ruby syntax with the mature Java VM. I'm very hesitant to seriously consider it for a few reasons:
    1. The interpreter will variously introduce bugs and lack features. This will be an ongoing issue as Ruby is an advanced, relatively young language; 2.0 is under active development.
    2. The Ruby VM will get much better over time. When it does, the argument for using the excellent Java VM will be much weaker.
  • For better or worse, I didn't see a competitor to Fortran.NET

REST Network API for an online spreadsheet

| | Comments (0) | TrackBacks (0)

Every spreadsheet application needs to support the creating, reading, updating, and deleting of sheets, columns, rows, and cells. The network protocol for an online spreadsheet could easily treat sheets, columns, etc. as resources and offer CRUD operations for manipulating them.

The requests below were hand generated and sent via telnet. The responses were copy and pasted from the console window. A real frontend might use this API with XmlHttpRequest.

operation request response
Create a sheet
POST /fauxcel/finances HTTP/1.1
Host: 192.168.113.115:8080
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Length: 0
Date: Wed, 25 Apr 2007 19:26:39 GMT
    
Populate a cell
POST /fauxcel/finances/A/1 HTTP/1.1
Host: 192.168.113.115:8080
Content-Type: text/text;charset=utf-8
Content-Length: 34

{"value":"123.45", "type":"value"}
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Length: 0
Date: Wed, 25 Apr 2007 19:36:37 GMT
Insert a column
POST /fauxcel/finances/A HTTP/1.1
Host: 192.186.113.115:8080
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Length: 0
Date: Wed, 25 Apr 2007 19:36:37 GMT
Read a column
GET /fauxcel/finances/B HTTP/1.1
Host: 192.168.113.115:8080
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked
Date: Wed, 25 Apr 2007 19:39:38 GMT

53
{"name":"finances","cells":[{"col":"B","row":"1","type":"value","value":"123.45"}]}
0
Delete a row
DELETE /fauxcel/finances/1 HTTP/1.1
Host: 192.168.113.115:8080
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Length: 0
Date: Wed, 25 Apr 2007 19:41:26 GMT

Note that the back end moved the value of cell A1 into cell B1 when a column was inserted before A. More details on that in a following post!

REST vs. SOAP vs. POX vs. TBD

| | Comments (0) | TrackBacks (0)
I have jumped into the fray of trying to sort out REST from SOAP from POX. Here are my initial opinions:

REST: Fielding's introduction [1] lays out constraints and why they are important. I was hoping the subsequent chapter [2] focusing on REST and HTTP would have implementation level details on "what" to do. What I got out of it was Fielding reminiscing on the good, bad, and the ugly of HTTP 1.0 and 1.1. It was a serious mistake for me to expect a doctoral thesis to also be "REST for dummies." Still, the voice from on high seems to have defined the goal without clearing a trail to follow.

This site [3] was very helpful. I'm looking forward to adding this book [4] to my Safari bookshelf.

SOAP & WS-*: SOAP has a wealth of concrete documentation and libraries. Developers have spent a lot of time building tools to help the development process. All this sounds great! But... something about all the flux [5] and xml all the way down made my spidey-sense tingle.

I have written client code for an efficient, robust, and scalable family of network applications [6]. That protocol family exists because other parts of the business subsidised its architects, developers, testers, hardware, and bandwidth. Also necessary was the hands-off management style which gave the architecture time and space to grow naturally. This leads me to doubt success for the volunteer cooks [7] in the WS-I kitchen. Hopefully their contributions will improve the general state of web services.

POX: Plain ol' XML. Not sure this term means more than that... SOAP/WS-* are XML all the way down. XML may also be part of a REST solution -- ROX anyone? :)


[1] Chapter 5 of Fielding's thesis

[2] Chapter 6 of Fielding's thesis

[3] RestWiki

[4]