What's a Bloom filter?
A hash map works by applying a hash function to data, and associating the output with the input data. Considering a simple hash map which stores only values, and not buckets of values:
Traits of a hash map key set
- After an item has been stored, its existance will be known. It will never go missing. There is no chance of saying an item has not been stored after it has been stored.
- After an item has been stored, it may be identified as a different item. This happens when the hash function generates the same value for distinct data. So there is a risk of falsely confirming item B has been stored after only storing item A.
- The space required to hold these keys will be the product of the key size and key count. While key size is probably fixed, key count is variable and will lead to more space needed as more items are added.
Introducing Bloom filters
A Bloom filter shares traits 1 and 2 of a hash map key set, but not trait 3. The space required by a Bloom filter does not increase as more items are stored. Rather than processing data with one hash function and using the result as a lookup value, a Bloom filter uses multiple hash functions to identify booleans in a fixed size array.Bloom filter traits
- Same as hash map key set trait 1
- Same as hash map key set trait 2
- Storage space is small, but depends on the desired false positive rate
- Lookup time depends on number of hash functions; can be run in parallel
- Items cannot be removed
Code
This tacky, easily improvable code was released to the public domain and promises nothing.0: class MediocreBloomFilter 1: M = 16 # number of boolean fields in filter 2: K = 3 # number of hash functions to find boolean fields 3: 4: attr_accessor :v # the filter, a vector of boolean fields 5: 6: def initialize() 7: @v = Array.new(M, 0) 8: end 9: 10: def track_item(item) 11: indicies_for?(item).each do |i| 12: @v[i] = 1 13: end 14: end 15: 16: def item_tracked?(item) 17: hits = 0 18: indicies_for?(item).each do |i| 19: hits = hits + @v[i] 20: end 21: hits == K 22: end 23: 24: def indicies_for?(item) 25: indicies = [] 26: md5 = Digest::MD5.new 27: hash = "" 28: 29: md5.update(item.to_s) 30: hash = md5.hexdigest 31: 32: K.times do |k| 33: indicies << HEX.index(hash[k].chr.to_s.upcase) 34: end 35: 36: indicies 37: end 38: 39: HEX = [ "0", "1", "2", "3", "4", "5", "6", "7", 40: "8", "9", "A", "B", "C", "D", "E", "F" ] 41: end
- Line 7 sets the filter storage to 16 integers. It could have been two bytes if I wanted to do bitwise operations.
- Line 24 runs the hash functions against the target data. Since this is tacky sample code only one hash function is executed (line 30), and multiple slices are taken from the value and converted into array indicies (at line 33).
- Using integers rather than booleans has a nice side effect. It permits removal of items by using the integers as reference counts. I haven't tested it, but minor changes should be needed around lines 12 and 19, and a removal function to decrement the counts.
- item: "abc"
- md5: 900150983cd24fb0d6963f7d28e17f72
- indicies: [9, 0, 0]
- filter: [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
- item: "abd"
- md5: 4911e516e5aa21d327512e0c8b197616
- indicies: [4, 9, 1]
- filter: [1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
- item: "seth schroeder"
- md5: 2c38f5fd13873c139567551ca1b7496e
- indicies: [2, 12, 3]
- filter: [1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
References
- Using Bloom Filters
- Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol
- Burton Bloom, Space/time trade-offs in hash coding with allowable errors, CACM, 13(7):422-426, July 1970.
- Paul E. Black, "Bloom filter", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 16 May 2008. Available from: http://www.nist.gov/dads/HTML/bloomFilter.html
P.S.: If you've read this far THANK YOU. This entry has also been posted to my personal blog.
Problem 10 of Project Euler is to sum the prime numbers less than two million. Solutions are supposed to take no more than one minute, so a decent prime number detector is mandatory. I'm sure lots of free, high quality solutions are available, but I have this problem with DIY / NIH syndrome. Somehow I found the sieve of Eratosthenes and a quick, tacky swipe at it was good enough.
This process is to make a list of numbers from 2 through some limit. Start with the lowest number in the list, and remove all multiples from the list. Next time through, pick the next smallest number that hasn't been crossed off yet and eliminate its multiples. Stop when your current number meets the square root of the limit. After that, the remaining numbers are prime:
def eratosthenes_sieve(ceiling)
vals = (0..ceiling).to_a
vals[0..1] = [nil, nil]
Math.sqrt(ceiling).floor.times do |val|
next unless vals[val]
((val ** 2)..ceiling).step(val) do |mult|
vals[mult] = nil
end
end
vals
end
A big hash like this is useful for O(1) checking for prime/not prime, but the solution requires the sum of the primes:
def primes_through(ceiling)
eratosthenes_sieve(ceiling).compact
end
sum = 0
primes_through(ARGV[0].to_i).each { |val| sum = sum + val }
puts sum
I'm sure there's a more Ruby-esque way to write that. I wanted to use Python for xrange goodness, but I'm still not at peace with its quirks. Groovy would have been nice, but the free editor support is still quite young. Ruby fit right in the gap.
Updated 2008-05-02: new Python implementation using sum() -- thanks Pete!
My blog post about the first problem in Project Euler went strange places. I was just trying to learn Python by solving a barely non-trivial problem. Maybe it was a full moon, not sure, but it wound up in a tangle of gcc, groovyc, and .elc. Happily, some good performance improvements were suggested and I've tried to incorporate them below.
This is a really long post. Here's the skeleton:
- Runtime environment
- Results
- Code (alphabetically by language)
- Big O(rly)?
- Wrap up
- Extras
Runtime environment
- Machine: Powerbook Pro, 10.5.2, 2x 2.16 GHz Core 2 Duo
- Userspace programs: Finder, Terminal
- Battery conservation mode: Better Performance (plugged in)
- C: GCC 4.0.1 (Apple Inc. build 5465)
- Groovy: 1.5.5 JVM: 1.5.0_13-119
- Lisp: GNU Emacs 22.1.1
- Perl: v5.8.8, built for darwin-thread-multi-2level
- Python: 2.5.1
- Ruby: 1.8.6 (2007-09-24 patchlevel 111) [universal-darwin9.0]
Results
No cheesy graphs this time, just better organized numbers. Each row has one implementation of the problem. The 10**x columns show how many milliseconds it took the implementation to sum the multiples of three and five less than that number. Each implementation was run seven times for each column, and the average runtime is shown below. A test harness was written to execute the scripts automatically; I disabled the screensaver, started the tests, and went to bed.
| Code | 10**1 | 10**2 | 10**3 | 10**4 | 10**5 | 10**6 | 10**7 | Comments |
|---|---|---|---|---|---|---|---|---|
| first.c | 5 | 5 | 5 | 5 | 11 | 65 | 648 | |
| first.groovy | 928 | 929 | 1065 | 1643 | 2155 | 7998 | 52732 | |
| second.groovy | 920 | 1004 | 1189 | 1353 | 1806 | 3243 | 19620 | |
| third.groovy | 914 | 880 | 936 | 1181 | 1421 | 2545 | 12835 | |
| first.pl | 8 | 8 | 10 | 26 | 192 | 1886 | 18845 | |
| second.pl | 8 | 8 | 8 | 11 | 36 | 378 | 3062 | |
| first.py | 22 | 22 | 23 | 35 | 160 | 1506 | 14939 | |
| second.py | 147 | 22 | 21 | 24 | 53 | 422 | 4242 | |
| third.py | 22 | 22 | 22 | 24 | 51 | 405 | 4081 | |
| fourth.py | 21 | 22 | 21 | 22 | 26 | 156 | 1580 | Credit goes to Pete below |
| first.rb | 9 | 8 | 9 | 20 | 134 | 1420 | 14288 | |
| second.rb | 9 | 8 | 9 | 17 | 100 | 1075 | 10818 | |
| first.el | 20141 | 10^1...6 not available | ||||||
| second.el | 8172 | Actually first.elc | ||||||
| instant.rb | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
Code
C
#include <stdio.h>
#include <math.h>
#define multiple_of(a, b) ((a == 0 || b == 0) ? 0 : a % b == 0)
int main(int argc, char** argv)
{
int exp = atoi(argv[1]);
double sum = 0;
int i;
for (i = 3; i < pow(10, exp); i++)
if (multiple_of(i, 3) || multiple_of(i, 5))
sum += i;
printf("%0.0f\n", sum);
return 0;
}
Groovy
// first.groovy
exp = Integer.parseInt(args[0])
mof = { val, test -> val && test && test % val == 0 }
mof3 = mof.curry(3)
mof5 = mof.curry(5)
sum = 0 as BigDecimal
(3..<(10**exp)).each {
if (mof3(it) || mof5(it)) {
sum += it
}
}
println sum
// second.groovy
def exp = Integer.parseInt(args[0])
def mof = { val, test -> val && test && test % val == 0 }
def mof3 = mof.curry(3)
def mof5 = mof.curry(5)
def sum = 0 as BigInteger
(3..<(10**exp)).each {
if (mof3(it) || mof5(it)) {
sum += it
}
}
println sum
// third.groovy
def exp = Integer.parseInt(args[0])
def sum = 0 as BigInteger
for (int i = 3; i < 10 ** exp; i++)
if (i % 3 == 0 || i % 5 == 0)
sum += i
println sum
Lisp
; first.el
(defun multiple_of (a b)
(= 0 (% a b))
)
(defun val-to-add (val)
(cond
((multiple_of val 3) val)
((multiple_of val 5) val)
(0)
)
)
(defun sum-of-3or5-multiples (exp)
(let ((num 3) (sum 0.0))
(while (< num (expt 10 exp))
(setq sum (+ sum (val-to-add num)))
(setq num (1+ num))
)
sum
)
)
(sum-of-3or5-multiples 7)
Perl
#!/usr/bin/env perl -w
# first.pl
use strict;
my $exp = $ARGV[0];
my $sum = 0;
sub multiple_of {
my ($val, $tgt) = @_;
return $val % $tgt == 0;
}
for my $i (3 ... 10 ** $exp - 1) {
if (multiple_of($i, 3) || multiple_of($i, 5)) {
$sum += $i
}
}
print "$sum\n"
#!/usr/bin/env perl -w
# second.pl
use strict;
my $exp = $ARGV[0];
my $sum = 0;
for my $i (3 ... 10 ** $exp - 1) {
if ($i % 3 == 0 || $i % 5 == 0) {
$sum += $i
}
}
print "$sum\n"
Python
# first.py
import sys
def multiple_checker(val):
return lambda test: test and val and test % val == 0
mof3 = multiple_checker(3)
mof5 = multiple_checker(5)
sum = 0
for i in range(10 ** int(sys.argv[1])):
if mof3(i) or mof5(i):
sum = sum + i
print sum
# second.py
import sys
exp = int(sys.argv[1])
sum = 0
for mof3or5 in set(range(3, 10**exp, 3)) | set(range(5, 10**exp, 5)):
sum = sum + mof3or5
print sum
# third.py
import sys
exp = int(sys.argv[1])
sum = 0
for mof3or5 in set(xrange(3, 10**exp, 3)) | set(xrange(5, 10**exp, 5)):
sum = sum + mof3or5
print sum
# fourth.py import sys exp = int(sys.argv[1]) total = sum(xrange(3, 10**exp, 3)) + sum(xrange(5, 10**exp, 5)) - sum(xrange(15, 10**exp, 15)) print total
Ruby
# first.rb exp = ARGV[0].to_i sum = 0 def is_mult(large, small) large % small == 0 end for i in 3..(10 ** exp - 1) next unless is_mult(i, 3) or is_mult(i, 5) sum = sum + i end puts sum
# second.rb exp = ARGV[0].to_i sum = 0 for i in 3..(10 ** exp - 1) next unless i % 3 == 0 or i % 5 == 0 sum = sum + i end puts sum(top)
Big O(rly)?
Well... sigh... all of the code above works correctly but is slow and naïve. All of the approaches above take more time as the largest multiple increases. I was ignorantly, happily tweaking code down this road until fortunately Bob clued me in. The best approach takes the same amount of time as the largest multiple increases. Why? There's a simple equation for this, calculating an arithmetic series.
# instant.rb ceil = 10 ** ARGV[0].to_i - 1 def series_of_multiples(val, ceiling) diff = val vals = ceiling / val last_val = val + diff * (vals - 1) (val + last_val) / 2.0 * vals end def som(v, c) series_of_multiples(v, c).to_i end puts som(3, ceil) + som(5, ceil) - som(15, ceil)
Wrap up
- I got a lot more out of this than a quick glance at Python. It was good to calculate a series or two, but much better to build a little test harness.
- Groovy performance tested out well here. I would avoid using some of its features in tight loops. I'm confident that Monsieur LaForge et son amis will optimize that concern away eventually.
- I thought I could love it, I really wanted to love it, it's loved by many for many good reasons, but Python started to chafe me. It's so hard to nail down exactly why... it's chock full of goodness, but just seems to need more Kool-Aid than Groovy and especially Ruby. Perl's Kool-Aid threshold is a well-known constant, at least until the Parrot changes things drastically.
- Cosine similarity + arithmetic series = it's good to work with a friendly, helpful PhD. in math. Thanks Bob :-)
Extras
The Python test harness. Probably needs tweaking for Wintel boxes.
import sys
import re
from datetime import timedelta, datetime
from os import system
#
# prefix used to run scripts. Trailing whitespace matters!
#
vms = {
'el': 'emacs --script ',
'elc': 'emacs --script ',
'groovy': 'groovy ',
'out': './',
'pl': 'perl ',
'py': 'python ',
'rb': 'ruby '
}
# Configurable parameters
#
DEFAULT_MIN = 1
DEFAULT_MAX = 7
min = DEFAULT_MIN # min "max" val = 10 ** min
max = DEFAULT_MAX # max "max" val = 10 ** max
reps = 7 # times to repeat each test, then average the results
# Helper method
def td_to_ms(delta):
val = delta.microseconds / 1000.0
val = val + delta.seconds * 1000
val = val + delta.days * 24 * 60 * 60 * 1000
return val
#
# "main"
#
for script in sys.argv[1:]:
ext = script.split('.').pop()
if not ext in vms:
print 'ignoring', script, 'unrecognized extension'
continue
vm = vms[ext]
# the lisp files hardcode the max values :(
if re.search('^el', ext):
max = min
else: # reset to defaults, needed when running multiple types of tests
max = DEFAULT_MAX
min = DEFAULT_MIN
# gotta like list comprehensions
runs = [0 for x in range(max - min + 1)]
for rep in xrange(reps):
for exp in xrange(min, max + 1):
start = datetime.now()
system('%s%s %s >/dev/null' % (vm, script, exp))
end = datetime.now()
runs[exp - min] = runs[exp - min] + td_to_ms(end - start)
for i in xrange(len(runs)):
runs[i] = int(runs[i] / reps)
print script.split('/').pop(), runs
# see, isn't it strange how python leaves you hanging at the end
# of the last scope in the file?
Edit 2008-04-29: more and better results available here.
I've been waffling about glancing at Python. It comes off as a little ... dry? Indentation drives scope? I was afraid it would be another Objective-C; popular, modeled after successful languages, but just gets stuck in my mental parser. Even Visual Basic and Pascal (q.v. Inside Macintosh) passed that gate.
Anyways, Project Euler seemed like a good way to learn a new language. Problem #1 is summing multiples of 3 or 5 less than one thousand. Some basic iteration handles the task pretty easily -- might as well be all trendy and mix in lambda functions. And, while the hood is open, why not try it in a couple of languages and generate an unscientific performance graph?
So back to the implementation -- the iterative approach is easy. But Python 2.5-ish natively supports set operations like union and intersection. Combine that with a built-in method to generate numeric sequences and the problem is basically solved. For the win -- it outperformed the iterative approach! Oh, and every language tested except Python showed negative numbers when the sum scrolled past 2**31.
Groovy, Python, C, and Emacs Lisp competed for best performance. And by competed, I mean quick and dirty code (esp. Lisp) run a few times on my MacBook Pro. Graphics first, then the code.
The lisp numbers are the sketchiest -- (current-time) only spit out seconds.
Basic info: Groovy 1.5.5 binary release, Python 2.5.1 shipped with Leopard, GCC 4.0.1, Carbon Emacs Spring 2008
Python, iterative
import sys
def multiple_checker(val):
return lambda test: test and val and test % val == 0
mof3 = multiple_checker(3)
mof5 = multiple_checker(5)
sum = 0
for i in range(10 ** int(sys.argv[1])):
if mof3(i) or mof5(i):
sum = sum + i
print sum
Python, set oriented
import sys
exp = int(sys.argv[1])
sum = 0
for mof3or5 in set(range(3, 10**exp, 3)) | set(range(5, 10**exp, 5)):
sum = sum + mof3or5
print sum
Groovy, iterative
exp = Integer.parseInt(args[0])
mof = { val, test -> val && test && test % val == 0 }
mof3 = mof.curry(3)
mof5 = mof.curry(5)
sum = 0 as BigDecimal
(3..<(10**exp)).each {
if (mof3(it) || mof5(it)) {
sum += it
}
}
println sum
Emacs Lisp, iterative (be nice, I'm really weak here)
(defun multiple_of (a b)
(= 0 (% a b))
)
(defun val-to-add (val)
(cond
((multiple_of val 3) val)
((multiple_of val 5) val)
(0)
)
)
(defun sum-of-3or5-multiples (exp)
(let ((num 3) (sum 0.0))
(while (< num (expt 10 exp))
(setq sum (+ sum (val-to-add num)))
(setq num (1+ num))
)
sum
)
)
C, iterative. Not very much code here, would have probably taken more lines of Java!
#include <stdio.h>
#include <math.h>
#define multiple_of(a, b) ((a == 0 || b == 0) ? 0 : a % b == 0)
int main(int argc, char** argv)
{
int exp = atoi(argv[1]);
double sum = 0;
int i;
for (i = 3; i < pow(10, exp); i++)
if (multiple_of(i, 3) || multiple_of(i, 5))
sum += i;
printf("%0.0f\n", sum);
return 0;
}
Apache Commons Codec implemented soundex, double metaphone, base64, and more. in the spirit of Grails why not stitch them into java.lang.String?
import org.apache.commons.codec.language.*
import org.apache.commons.codec.net.*
class FuzzstrGrailsPlugin {
def version = 0.1
def dependsOn = [core:grails.util.GrailsUtil.getGrailsVersion()]
def doWithDynamicMethods = {
def encodingClosure = { it.encode(delegate) }
def decodingClosure = { it.decode(delegate) }
String.metaClass.toSoundex = encodingClosure.curry(new Soundex())
String.metaClass.toRefinedSoundex = encodingClosure.curry(new RefinedSoundex())
String.metaClass.toMetaphone = encodingClosure.curry(new Metaphone())
String.metaClass.toDoubleMetaphone = encodingClosure.curry(new DoubleMetaphone())
String.metaClass.toBase64 = encodingClosure.curry(new BCodec())
String.metaClass.fromBase64 = decodingClosure.curry(new BCodec())
String.metaClass.toQPrintable = encodingClosure.curry(new QCodec())
String.metaClass.fromQPrintable = decodingClosure.curry(new QCodec())
String.metaClass.toQuotedPrintable = encodingClosure.curry(new QuotedPrintableCodec())
String.metaClass.fromQuotedPrintable = decodingClosure.curry(new QuotedPrintableCodec())
String.metaClass.toUrlEncoded = encodingClosure.curry(new URLCodec())
String.metaClass.fromUrlEncoded = decodingClosure.curry(new URLCodec())
String.metaClass.similarTo = { Cosine.stringSimilarity(delegate, it) }
String.metaClass.mostSimilarTo = { Cosine.mostSimilar(delegate, it) }
String.metaClass.rankedSimilarity = { Cosine.horseShoes(delegate, it) }
}
}
So what? Well, see for yourself:
Loading with installed plug-ins: ["fuzzstr"] ...
Groovy Shell (1.5.4, JVM: 1.5.0_13-119)
Type 'help' or '\h' for help.
-------------------------------------------------------------------------------
groovy:000> "literal".toSoundex()
===> L364
groovy:000> "literal".toDoubleMetaphone()
===> LTRL
groovy:000> "literal".toBase64().fromBase64()
===> literal
groovy:000> "literal".similarTo("litteral")
===> 93
groovy:000> "litteral".mostSimilarTo(['literature', 'litter', 'literal'])
===> literal
The plugin's not ready to be released yet but hopefully within a few weeks; localization and caching need some love. Pondering whether GORM might benefit from some of these methods.
seth _dot_ schroeder _at_ nearinfinity _dot_ com
I believe that most software tests favor code over data. It might be an issue of tools; code testing tools are easy to use, have aggressive ide and framework support, and produce coverage metrics. Maybe better data testing tools would increase data coverage. Here's some code to demonstrate one need when testing lots of data: catch wrong results and absent results.
One approach to testing data involves these parts:
- Define the input data
- Define the expected output data
- Implement code to process the input data
- Generate the actual output data by evaluating #1 with #3
- Compare the results of #4 with #2
Easy as pie, right?

I think #5 is more interesting than it seems. The challenge is to report every difference between the expected output and the actual output. Detecting wrong results is easy. Detecting absent results is harder.

Blue results were found, but not expected. Yellow results were expected, but not found. Green results were found and expected. I expected this to work out easily as differences and intersections of sets, but hit a little snag in the tool implementation.
About this huge chunk of code:
- Lines 6-8 define the input and expected output. The values are stored in a map, where the keys are the input and the values the output.
- Thank Groovy! It makes passing code around so easy. Line 12 defines the test logic inline.
- test1 demonstrates all good, no bad results
- test2 demonstrates wrong results
- test3 demonstrates absent results
- licensed with the Apache 2 license
And putting the cart before the horse, here are the results of the code:
--Output from test1-- --Output from test2-- a: wanted A, got B --Output from test3-- caught exception java.lang.Exception: c!?!? c was ignored!
Enough, here's the code:
1: class TestStuff extends GroovyTestCase { 2: 3: def inputOutput = [:] 4: 5: void setUp() { 6: inputOutput['a'] = 'A' 7: inputOutput['b'] = 'B' 8: inputOutput['c'] = 'C' 9: } 10: 11: void test1() { 12: assert verifyAll(inputOutput, { it.toUpperCase() } ) 13: } 14: 15: void test2() { 16: assert verifyAll(inputOutput, { 17: if (it == 'a') ++it 18: it.toUpperCase() 19: }) 20: } 21: 22: void test3() { 23: assert verifyAll(inputOutput, { 24: if (it == 'c') throw new Exception("c!?!?") 25: it.toUpperCase() 26: }) 27: } 28: 29: def verifyAll(expected, method) { 30: def actual = runTests(expected, method) 31: def missed = diffSets(expected.keySet(), actual.keySet()) 32: def unexpected = diffMaps(actual, expected) 33: 34: missed.each { println("${it} was ignored!") } 35: 36: unexpected.each { key, val -> 37: println("${key}: wanted ${expected[key]}, got ${val}") 38: } 39: 40: return actual == expected 41: } 42: 43: def runTests(control, method) { 44: def test = [:] 45: 46: control.each { input, value -> 47: try { test[input] = method(input) } 48: catch (e) { println ("caught exception ${e}") } 49: } 50: 51: return test 52: } 53: 54: def diffSets(a, b) { 55: def diff = [] 56: a.each { if (!b.contains(it)) diff << it } 57: return diff 58: } 59: 60: def diffMaps(a, b) { 61: def diff = [:] 62: a.each { key, val -> if (a[key] != b[key]) diff[key] = val } 63: return diff 64: }
Wait wait wait -- real programs are built with classes!
// ... imagine Ahhbject.groovy (say it with a Boston accent :)
class Ahhbject {
Object val
String toString() { "${this.class.name}@${hashCode()} ${val}" }
}
65: void test4() {
66: inputOutput.a = new Ahhbject(val:'A')
67: inputOutput.b = new Ahhbject(val:'B')
68: inputOutput.c = new Ahhbject(val:'C')
69:
70: assert verifyAll(inputOutput, {
71: new Ahhbject(val: it.toUpperCase())
72: })
73: }
--Output from test4--
a: wanted Ahhbject@3260319 A, got Ahhbject@1806345 A
b: wanted Ahhbject@9259311 B, got Ahhbject@12565457 B
c: wanted Ahhbject@12823094 C, got Ahhbject@14416801 C
Curses! It doesn't just work. The cause is one of my pet peeves -- a class which can't identify distinct but equivalent instances. Grails makes this easy to fix by bundling the Apache commons lang library. HashCodeBuilder and EqualsBuilder make quick work of implementing hashCode and equals:
int hashCode() { HashCodeBuilder.reflectionHashCode(this) }
boolean equals(Object obj) { EqualsBuilder.reflectionEquals(this, obj) }
Add those two lines and ta-da! Fit for data testing!
I really appreciate Matt and Bob showing me significantly better ways to compare strings. Yet another perk of working for this company :)
I've been spending time with grails lately. One of my biggest beefs is 100% accuracy required 100% of the time when typing parameters to the "grails" application. Not a big deal (esp. with emacs M-/)... but what an opportunity to do something useful with cosine similarity!
All code on this page is licensed under the Apache 2 license.
The heart of the code. Please ignore the lack of error handling and optimization.
static def similarity(l_seq, r_seq, degree=2) {
def l_histo = countNgramFrequency(l_seq, degree)
def r_histo = countNgramFrequency(r_seq, degree)
dotProduct(l_histo, r_histo) /
Math.sqrt(dotProduct(l_histo, l_histo) *
dotProduct(r_histo, r_histo))
}
This probably isn't ideal Groovy, but I sure appreciated the syntax improvements to arrays and maps.
static def countNgramFrequency(sequence, degree) {
def histo = [:]
def items = sequence.size()
for (int i = 0; i + degree <= items; i++)
{
def gram = sequence[i..<(i + degree)]
histo[gram] = 1 + histo.get(gram, 0)
}
histo
}
Still nothing specific to strings here...
static def dotProduct(l_histo, r_histo) {
def sum = 0
l_histo.each { key, value ->
sum = sum + l_histo[key] * r_histo.get(key, 0)
}
sum
}
Finally, here's something useful:
static def stringSimilarity (l_str, r_str, degree=2) {
similarity(l_str.toLowerCase().toCharArray(),
r_str.toLowerCase().toCharArray(),
degree)
}
... okay, how to apply this to grails? Just one more bit of code:
static def mostSimilar(pattern, candidates, threshold=0) {
def topScore = 0
def bestFit = null
candidates.each { candidate ->
def score = stringSimilarity(pattern, candidate)
if (score > topScore) {
topScore = score
bestFit = candidate
}
}
if (topScore < threshold)
bestFit = null
bestFit
}
Drop this into GrailsScriptRunner.groovy and now this:
- grails create-domain-class
... is equivalent to all of these:
- grails craete-domain-class
- grails domain
- grails dom
To everyone who read this far, thank you :) And... one other observation. Strings are ordered sequences (of chars). What about file contents (lines) and directory contents (file names)?
static def fileSimilarity (l_file, r_file, degree=3) {
similarity(new File(l_file).readLines(),
new File(r_file).readLines(),
degree)
}
static def dirSimilarity(l_dir, r_dir, degree=3) {
def l_files = []
def r_files = []
new File(l_dir).eachFileRecurse { l_files << it.name }
new File(r_dir).eachFileRecurse { r_files << it.name }
similarity(l_files, r_files, degree)
}
edit 2007-12-05 -- size() method works for arrays and lists; easier than length and size properties. Contribution much appreciated!
-- seth.schroeder _at_ nearinfinity.com
Earlier I looked at soundex as a more robust way of comparing strings. Soundex was more robust than case & space normalizing... but it missed somethings I wanted and found some things I didn't want. Here's an example based on a rather limited wine rack:
CREATE TABLE winerack (
wine_type TEXT
, ready DATE
, soundex_val TEXT
);
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabernet sauvingon', now() + interval '5 years', 'C165');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabermet sauvingon', now() + interval '6 years', 'C165');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('caberet sauvingon', now() + interval '7 years', 'C163');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabernet franc', now() + interval '8 years', 'C165');
Now I'd like to find all of the Cabernets Sauvignon. Using a literal match one bottle shows up. Three bottles show up using soundex... but not the right three bottles:
bigrams=# SELECT * FROM winerack WHERE soundex_val = 'C165';
wine_type | ready | soundex_val
--------------------+------------+-------------
cabernet sauvingon | 2012-10-08 | C165
cabermet sauvingon | 2013-10-08 | C165
cabernet franc | 2015-10-08 | C165
(3 rows)
What happened? Well, soundex picked up the "cabermet" which was good. But it also picked up the Cabernet Franc which was bad. Also, it missed the "cabaret sauvingon".
Cabernet Franc was picked up because a soundex value contains at most four consonants. 'C', 'b', 'r', 'n' in this case. Nothing else matters afterwards... shudder to imagine "cabernet sausage" showing up in the search results!
Caberet Sauvignon was ignored because soundex doesn't compare strings. It's almost like a really primitive string hashing function. String in, string out. Easy to use though.
I approached Rob with this and he suggested I give character-level bigrams a try. Basically this finds how similar words are based on how often pairs of characters occur. Rob is being very patient with me as I work through the gory details, but this is my general understanding:
similarity(left_str, right_str)
left_sum = dot_product(left_str, left_str);
right_sum = dot_product(right_str, right_str);
pair_sum = dot_product(left_str, right_str);
return pair_sum / square_root(left_sum * right_sum);
The moment of truth -- did it outperform soundex?
bigrams=# select * from bigram_similarities;
term_a | term_b | similarity
-------------------+-------------------+-------------------
cabernetsauvignon | cabernetsauvignon | 1
cabernetsauvignon | cabermetsauvignon | 0.875
cabernetsauvignon | caberetsauvignon | 0.903696114115064
cabernetsauvignon | cabernetfranc | 0.505181485540923
In this contrived case it worked out better. No clue further than that. It is easy to use like soundex:
bigrams=# select bigram_similarity('cabernet sauvignon', 'cabernet franc');
bigram_similarity
-------------------
0.505181485540923
The infrastructure behind this is pretty gory. The code's not really polished and I have some concerns about performance. But here's a really neat way to generate a dot product using SQL:
CREATE TABLE bigrams ( term TEXT, bigram CHAR(2), instances INT );
SELECT
SUM(lhs.instances * rhs.instances) AS dot_product
FROM
bigrams lhs, bigrams rhs
WHERE
lhs.bigram = rhs.bigram
AND
lhs.term = 'cabernet sauvignon'
AND
rhs.term = 'cabernet franc'
A SQL view is a SELECT statement stored in the database which can be used like a table. So... what? Well, some of the benefits include:
- Syntax verified before deployment -- avoids errors from queries built at runtime.
- The RDBMS knows to expect it and could prepare an execution plan.
- Force users to use a view which excludes sensitive table data.
- Centralize important, common logic.
CREATE TABLE pantry (food TEXT, course TEXT, expiry DATE);
INSERT INTO pantry (food, course, expiry)
VALUES ('crumpets', 'breakfast', now() + interval '2 weeks');
INSERT INTO pantry (food, course, expiry)
VALUES ('canned ravioli', 'lunch', now() + interval '2 years');
INSERT INTO pantry (food, course, expiry)
VALUES ('instant noodles', 'dinner', now() + interval '2 years');
INSERT INTO pantry (food, course, expiry)
VALUES ('fruitcake', 'snack', now() + interval '2000 years');
Imagine these meal planning queries:
SELECT food FROM pantry WHERE course = 'breakfast' AND expiry > now();SELECT food FROM pantry WHERE course = 'lunch' AND expiry > now();SELECT food FROM pantry WHERE course = 'dinner' AND expiry > now();SELECT food FROM pantry WHERE course = 'snack' AND expiry > now();
Well, that's pretty gory. No abstraction at all -- mystery meat queries which need unnecessary work to be understood. Two things need to be abstracted: course and expiry. No one wants stale food, so build that logic first:
CREATE VIEW fresh_food AS
SELECT *
FROM pantry
WHERE expiry < now();
Voici! The per-course queries look like: SELECT food FROM fresh_food where course = 'foo'. But, why stop abstracting now? Why not this?
CREATE VIEW breakfast_menu AS
SELECT *
FROM fresh_food
WHERE course = 'breakfast';
This example is almost too silly, but one more good point can be squeezed out yet. A key benefit of centralized logic is having one place to make bug fixes. For example, the typo in fresh_food which returns stale food. Fix the view DML and all client code immediately benefits without change.
Unfortunately in practice performance can easily suffer when using (and especially layering) views.
Coming soon -- part 2: Views made my queries SLOW!
create table cafe_du_monde (
name varchar2(128),
beignets integer
);
create table jackson_square (
name varchar2(128),
pictures_taken integer
);
a.) insert into cafe_du_monde (name, beignets) values ('seth', 3);
b.) insert into cafe_du_monde (name, beignets) values ('seth ', 6);
c.) insert into jackson_square (name, pictures_taken) values ('Seth', 127);
d.) insert into jackson_square (name, pictures_taken) values ('seht', 10);
Ideally the results would be me eating 9 beignets and taking 137 pictures (let's not mention the cafe au laits :)
Method A: compare as-is
select name, sum(beignets), sum(pictures_taken) from cafe_du_monde as cdm, jackson_square as js where cdm.name = js.name group by name;Bzzzt. Not good. Look at all those rows! Selected: a, b, c, d
Method B: normalize case
select name, sum(beignets), sum(pictures_taken) from cafe_du_monde as cdm, jackson_square as js where lower(cdm.name) = lower(js.name) group by name;Selected: a+c, b, d
Method C: trim whitespace
select name, sum(beignets), sum(pictures_taken) from cafe_du_monde as cdm, jackson_square as js where trim(both from cdm.name) = trim(both from js.name) group by name;Selected: a+b, c, d
Methods B & C
(obvious query omitted) Selected: a+b+c, dMethod D: soundex
select name, sum(beignets), sum(pictures_taken) from cafe_du_monde as cdm, jackson_square as js where soundex(cdm.name) = soundex(js.name) group by name;Soundex value: S300 Selected: a+b+c+d Soundex works perfectly for this example. When I tried it with more realistic data I wasn't quite as happy. Matt introduced me to the Levenshtein algorithm for counting the number of edits needed to convert one string into another. (ex: cat -> mat = 1, cat -> dot = 2, cat -> dog = 3, cat -> gold = 4)
Method E: Comparing edit distance
levenshtein('seth', 'seth') => 0
levenshtein('seth', 'Seth') => 1
levenshtein('seth', 'seth ') => 1
levenshtein('seth', 'seht') => 2
damerau-levenshtein('seth', 'seht') => 1
The Levenshtein distance algorithm treats the 'ht' in 'seht' as two substitions. The Damerau-Levenshtein algorithm treats it as one swap. Otherwise it's the same, so Damerau-Levenshtein seems like the better method.
Unhappily the initial edit distance results seem weaker than the case normalizing and whitespace trimming. Since the algorithms stumble over case and whitespace, so why not take care of those first?
Method F: normalizing case, trimming whitespace, & comparing edit distance
damerau-levenshtein(lower(trim('seth', 'seth'))) => 0
damerau-levenshtein(lower(trim('seth', 'Seth'))) => 0
damerau-levenshtein(lower(trim('seth', 'seth '))) => 0
damerau-levenshtein(lower(trim('seth', 'seht'))) => 1
I think this is better. An edit distance of 25% seems a little high, but the string is so short in this example. At this point I've got enough to start experimenting.
The Levenshtein implementation on Wikipedia has to be seen to be believed. Here's a quick and dirty Ruby version. (ObFun: Since Ruby supports jagged arrays I decided to emulate two dimensional arrays with a set of [row,col] keys. A little strange looking but kinda neat).
def levenshtein(s1, s2)
d = {}
(0..s1.size).each do |row|
d[[row, 0]] = row
end
(0..s2.size).each do |col|
d[[0, col]] = col
end
(1..s1.size).each do |i|
(1..s2.size).each do |j|
cost = 0
if (s1[i-1] != s2[j-1])
cost = 1
end
d[[i, j]] = [d[[i - 1, j]] + 1,
d[[i, j - 1]] + 1,
d[[i - 1, j - 1]] + cost
].min
next unless @@damerau
if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])
d[[i, j]] = [d[[i,j]],
d[[i-2, j-2]] + cost
].min
end
end
end
return d[[s1.size, s2.size]]
end
One of the reasons I joined Near Infinity is the annual allowance of $500 for books, subscriptions, and software. This made it a no-brainer to get a copy of "Programming Challenges" by Steven Skiena and Miguel Revilla. I just wanted a challenge which wasn't following ornate business logic or deploying said software. The book is full solid but (hopefully :) tractable technical challenges. I really like Ruby and decided to use it despite the supported languages being C, C++, Java, and Pascal.
The first problem is called "The 3n + 1 problem." The core idea is to reduce a positive number to one by evaluating these steps:
- one: stop
- even number: divide by two
- odd number: multiply by three and then add one
def step(val)
if (val == 1)
val
elsif (val % 2 == 0)
val / 2
else
val * 3 + 1
end
end
The next piece is to count the number of steps taken. The rules are clear that the initial number and one are both to be counted as steps. Basically, count the posts in the fence, not the gaps.
def countSteps(val)
steps = 1
until (val == 1) do
val = step(val)
steps = steps + 1
end
steps
end
The final piece is to run a series of numbers through the code, and report which number required the most steps.
def maxSteps(range)
max = 0
range.each do |val|
max = [max, countSteps(val)].max
end
max
end
Large numbers passing through this code won't have much effect on memory or stack usage; it looks CPU bound. I wondered how easy it would be to balance CPU usage with memory usage. Since each positive number always needs the same number of steps to reduce to one, it's safe to save the number with its step count in a map. While processing steps, check each result to see if it has been cached. A cache hit results in no further processing, except to cache the step count of the argument. The memory usage will grow with the number of unique arguments.
# elsewhere: @chain = { 1 => 1 }
def computeCycles(val)
spin = 0
origVal = val
until (@chain[val]) do
val = step(val)
spin = spin + 1
end
unless (origVal == 1)
@chain[origVal] = @chain[val] + spin
end
@chain[origVal]
end
Here are some tacky metrics produced running on a cold cache while processing 1..200.
[...] 10 => spins used/avoided, %avoided, cache size = 28/49 -- 64, 10 [...] 100 => spins used/avoided, %avoided, cache size = 803/2439 -- 75, 100 [...] 200 => spins used/avoided, %avoided, cache size = 1393/7225 -- 84, 200So, to wrap up Ruby was a mixed blessing here. Fortunately I was able to do non-trivial things without wasting time & tendons on basic infrastructure. Unfortunately the processing website doesn't accept Ruby :( Soon I hope to tackle more of these challenges ... maybe using Groovy and Lua.
72, 101, 108, 108, 111
Each code unit format has its own structure. Note that UTF-16LE needs an extra code unit.
UTF-8: 0x48, 0x65, 0x6c, 0x6c, 0x6f
UTF-16: 0x0048, 0x0065, 0x006c, 0x006c, 0x006f
UTF-16LE: 0xfffe, 0x4800, 0x6500, 0x6c00, 0x6c00, 0x6f00
UTF-32: 0x00000048, 0x00000065, 0x0000006c, 0x0000006c, 0x0000006f
So... what? Well, someone had a thinking cap on when they defined UTF-8. It's optimal for storing American English and safe for legacy software libraries. My favorite part is how the first bits in the first byte count the number of bytes remaining in the code unit.
UTF-16 is kinda strange when you really look at it. Storing it requires 100% more space than UTF-8 requires for American English. It's big/network endian by default... meaning most computers have to juggle bytes per character before using them. And it's totally incompatible with old ASCII libraries. Finally, it's not possible to rely on 1 utf-16 code unit per code point due to surrogate characters.
The Unicode 5.0 spec is gigantic. It's always stocked on the top shelf at Borders with the other oversized books. Read it and you may believe that ZWNBSP means something. :)
I hope the Grails dev team succeeds. JRuby is a well-intentioned, well-heeled, but misbegotten creation (q.v. V. Frankenstein). Should Groovy succeed, it should be the ultimate Java webapp framework. Ultimate in this sense: "being the last or concluding element of a series." Consider this diagram of a minimal Grails application -- can you find room for more layers?
All of those jar files are in the classpath and configured for immediate coding. Did I mention the grails commandline app? It's fantastic. It creates classes & test classes, runs tests, runs apps, and is a good engine for the development workflow. So what's the catch? Adding Grails into an existing project doesn't make much sense. The Grails skeleton would duplicate much of the existing infrastructure. Migrating existing code into a Grails skeleton can be done, but some refactoring will be needed. Here's my quick take based on some tinkering.
Controllers
Rewrite controllers. A controller in Grails lives for one request, unlike a long-lived Spring MVC controller. Grails has done a good job of improving the process of implementing a controller: autobinding the query string key:value pairs to instance fields, simple url mapping conventions, and more.Services
Consider migrating .java business logic to Grails .groovy services. A Grails service class instance is long lived as in Spring MVC; it can also be injected into a Grails controller very easily. Java service classes with complicated transaction requirements should not be migrated. Transaction / rollback is applied to all Grails service methods by default. Disabling rollback affects all methods in a service class.Views and layouts
Grails supports Groovy Server Pages (.gsp) and Java Server Pages (.jsp); Sitemesh can be used with both. Unfortunately support for custom JSP tag libraries will wait until version 0.6. GSP isn't bad. The conditional control flow tags are nice, but I really see no strong reason to prefer a .gsp to a .jsp. One notable item on the roadmap is support for Spring Web Flow.Model / DAO / domain classes
Hibernate mapping files and annotations are well understood ways to control ORM. GORM configuration uses static class variables. Here is a partial list:optionals, transients, belongsTo, hasMany, mappedBy, embedded, constraints. Grails supports Groovy SQL but has plans to fork it eventually. GORM also injects id and version properties to domain objects at runtime.
Conclusion
Grails adds layers, but more cake-like than onion-like. Programmers can look forward to writing Groovy code for a multi-tier webapp with the dependencies already in place. Much existing code can be migrated in by adapting it to the Grails conventions; other parts may be included as .java files. The largest learning curve will probably be GORM. Finally, it's probably a good idea to let Grails mature for 3-6 months before deploying it into production.My introduction to Groovy
- Contents:
- Intro
- Code
- Results
- Code review
- Tools
- Conclusion
- Side notes
Idiomatic Groovy code looks and works like a mash-up of Java and Ruby... JRuby would be an intuitive name. After all, the "J" in the real JRuby has more to do with deployment and execution than look and feel of the code. Which do developers spend more time with?
The proof of the pudding is in the eating -- how Groovy is it? I decided to learn Groovy by writing a minimal RSS reader. It's no big chore in Java, but I was happily surprised how my Java and Ruby skills came into play. Please remember that this is my first shot at Groovy:
Grouping rss items by keyword
1: def groupByKeyword(groups) { 2: def stream; 3: def map = [:] 4: 5: try { 6: stream = url.openStream(); 7: def slurper = new XmlSlurper(); 8: def rss = slurper.parse(stream); 9: 10: groups.each { 11: def group = it.toUpperCase(); 12: map[group] = []; 13: 14: rss.channel.item.each { 15: if (it.title.text().toUpperCase() =~ group) { 16: map[group] += it.title; 17: } 18: }; 19: } 20: } catch (e) { 21: println e; 22: } finally { 23: if (stream) 24: stream.close(); 25: } 26: 27: return map; 28: }
collecting parameters and printing the results
29: static void main(args) { 30: 31: def url = args[0]; 32: 33: def keywords = args[1..<args.length]; 34: 35: def reader = new RssReader(url); 36: 37: def groups = reader.groupByKeyword(keywords); 38: 39: groups.entrySet().each { 40: println "$it.key" 41: 42: it.value.each { 43: println "\t* $it" 44: } 45: } 46: }
Results
Input:
http://digg.com/rss/indexprogramming.xml java[s] java[^s] groovy haskell erlang microsoft emacs lisp
Output:
GROOVY
ERLANG
JAVA[^S]
EMACS
LISP
JAVA[S]
* Top 5 javascript frameworks
* The Javascript Programming Language
HASKELL
MICROSOFT
Input:
http://digg.com/rss/indexprogramming.xml java[s] java[^s] groovy haskell erlang microsoft emacs lisp
Output:
GROOVY
ERLANG
JAVA[^S]
EMACS
LISP
* pregexp: Portable Regular Expressions for Scheme and Common Lisp
JAVA[S]
* JavaScript - the world's most ubiquitous computing runtime... and it's not going away soon.
HASKELL
* Haskell's HTTP library: how it sucks, how to fix it
MICROSOFT
* Microsoft funds questionable study attacking GPL 3 draft process
A mini code review
- Line #1 demonstrates several Groovy-isms. The line starts a non-static method definition. The return type and parameter type were omitted.
- An empty map is defined at Line #3. I forgot to end the statement with a semicolon, but that is optional in Groovy.
- Lines 6..9 are more Java-ish than Groovy. I was inclined to work this way due to immature tool support. The Groovy plugin for Eclipse reported syntax errors when the parens were removed, but the code ran fine. The java-mode in emacs couldn't indent code lacking parens and semicolons.
- Ruby coders will recognize line 10. Notice the "it" variable on line 11. Groovy automatically defined "it" is a reference to the item being iterated over.
- Groovy packs a lot of code into line 14. The Groovy chunk "
rss.channel.item" directly matches the RSS structure of <rss><channel><item>. That shrink-wrapped coupling was minimal and readable. I couldn't find an obvious reason to prefer xpath or other xml navigation methods. - Line 15 would have been merged into line 14, but I ran out of patience getting
findAllto work. - Line 16 uses
+=to add an item to an array.-=works as expected. - The fading C/C++ programmer inside me was THRILLED to use
if (reference) ...at line 23. - This cheesy code depended on positional parameters. Given that, it was nice to slice them so easily at line 33. The
..<syntax defines a half-open interval [min, max) where min is included and max is excluded. - Notice how "
it" is defined in both the inner and outer loop near line 39. The C/C++ guy inside of me got all worked up about shadowing stack variables... that line of thinking is dead wrong about closures in Groovy.
Tools
Groovy is a very young language; its tools can't be judged fairly against Java's tools. The most significant problem I had was during debugging. Groovy is very involved in dispatching method calls, so a lot of debugging time will be spent stepping over / out of Groovy internals. That didn't work for me; the app seemed to resume processing. Other issues I noticed were lack of auto package import / cleanup, auto method completion, and the mistaken compile errors on missing parens.
The Groovy plugin did make it easy to start a new project. Installation worked great, and the dependencies were automatically added to the project after I created a Groovy class.
Conclusion
Pragmatic: The Groovy developers made a lot of welcome improvements to Java's syntax. It does leverage existing developer skills and would integrate tightly with existing Java code. The tools need a lot of features and some fixes.
Personal: I had bipolar Java/Ruby style swings while coding Groovy. It feels like almost an even compromise between the two... almost too flexible for my personality :). I would consider using it in new projects after Groovy 1.1 (annotations + more) has been released and the tools have matured.
Side notes:
- Wow, that's a lot of .class files per lines of Groovy:
$ wc -l *.groovy 24 Hola.groovy 90 RssReader.groovy 114 total $ echo .groovy && ls *.groovy | wc -l; echo .class && ls *.class | wc - l; .groovy 2 .class 14 - Google Trends comparison of Groovy and JRuby
- Groovy was started in early 2004! Version 1.0 was released almost three years later. This article describes the waning and waxing of progress implementing the language.
Sometimes you really want a quick & dirty histogram while looking through a database:
- when you suspect the mean value is misleading
- when you want to understand how the values are distributed
- ... and easily switch between different sources of values
- ... without exporting data & switching applications
Here are the story scores from a recent front page of reddit.com: 175, 456, 140, 191, 230, 186, 134, 215, 171, 83, 102, 171, 182, 322, 193, 310, 338, 345, 174, 134, 92, 109, 241, 256, 132
A basic statistical query returns:
+-----+-----+-----+--------+ | max | min | avg | stddev | +-----+-----+--------------+ | 456 | 83 | 203 | 90 | +-----+-----+-----+--------+
The standard deviation seems awfully large. Maybe not many of the scores are close to the mean score of 203? What if another query could show the distribution?
+--------+----------+--------+----------+ +--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling | | bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+ +--------+----------+--------+----------+
| 1 | 8 | 83 | 157 | | 1 | 4 | 83 | 119 |
| 2 | 10 | 158 | 232 | | 2 | 4 | 120 | 156 |
| 3 | 2 | 233 | 307 | | 3 | 8 | 157 | 193 |
| 4 | 4 | 308 | 382 | | 4 | 2 | 194 | 230 |
| 5 | 1 | 383 | 457 | | 5 | 2 | 231 | 267 |
+--------+----------+--------+----------+ | 6 | 0 | 268 | 304 |
| 7 | 3 | 305 | 341 |
| 8 | 1 | 342 | 378 |
| 9 | 0 | 379 | 415 |
| 10 | 1 | 416 | 452 |
+--------+----------+--------+----------+
The histogram on the left has fewer, larger buckets. This is a lot more informative than the mean & stddev. The histogram on the right uses more, smaller buckets. Maybe this is too verbose? What if you wanted seven buckets?
update dhg.bucket_count set num_buckets = 7;
select * from dhg.results;
+--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+
| 1 | 7 | 83 | 135 |
| 2 | 7 | 136 | 188 |
| 3 | 5 | 189 | 241 |
| 4 | 1 | 242 | 294 |
| 5 | 4 | 295 | 347 |
| 6 | 0 | 348 | 400 |
| 7 | 1 | 401 | 453 |
+--------+----------+--------+----------+
Instructions:
- Insert numbers to be analyzed:
- Choose how many buckets in the histogram
- Read the results!
class="prettyprint"INSERT INTO dhg.source SELECT foo FROM bar;
UPDATE dhg.bucket_count SET num_buckets = 10;
SELECT * FROM dhg.results_full;
Materials:
All views, tables, and functions will live in a dynamic histogram (dhg) schema. The SQL is pretty minimal yet hopefully reasonably structured and commented. The MySQL flavor is larger due to an implementation of width_bucket.
WARNING: The implementation suffers from a variety of rounding errors and poor error handling. This is a quick and dirty solution for rough estimates only.
NOTE: Using more than 20 buckets will need a small tweak. Grep the SQL for empty_buckets.
- PostgreSQL DDL (may work with Oracle 9+)
- MySQL DDL
- starter data: scores from reddit, digg, and comment counts from slashdot
I'd love to hear criticisms, comments, & suggestions!
Server Client
|----------------------------------------------------------------------|
| | |
1. Fixed format 2. encapsulated data & code 3. raw data & metadata
(jpeg) (json, javascript) (html <img> tag?)
This is an important decision for an online spreadsheet. Should the server send only literal values (option one)? Should the client evaluate functions and resolve references? (option two)?
Option 1:
pro: parsing cell values into a tree and traversing it is tricky. Writing it once in Java seems like a reasonable approach.
con: client has no idea which cells are related to each other. That would really complicate features like copying and pasting cell references.
Option 2:
con: reimplement much of cell parsing & traversal in Javascript. Using Rhino to test it outside a browser mitigates this a little.
pro: client knows which cells are related.
Option 3:
con: no idea how to apply this approach for a spreadsheet.
I chose option two. Eventually the client will probably need to work directly with cell references. Better to get that work out of the way ahead of time. Not all of the logic needs to be rewritten. Some of it is only needed on one side:
common:
- build parse tree from cell values
- iterate over parse tree
server specific:
- update reference values: A1 => B1, $G3 => $G4
- collapse the parse tree back into a flat string
client specific:
- resolve reference values (A1 = 123.45, G3 = A1 = 123.45).
- evaluate functions (=SUM(G3,A1) => 246.9).
I didn't expect so many neat challenges. Even the prototype implementation needed parsing, recursion, evaluating simple math expressions, tree traversal, base10 & base26 conversion, and more. But all that is meat for another post.
- JHaskell.
- Sleep. Quoted from the site: Sleep is heavily inspired by Perl with bits of Objective-C thrown in
- Dynamic Java.
- Groovy seems like Java "lite." The groovy compiler reportedly accepts most Java source, but also permits a more flexible yet terse syntax. This gained a lot of traction at the conference. Some people complained that the name sounded too much like "Ruby." Which leads to...
- JRuby. The overall goal is to mate the slick Ruby syntax with the mature Java VM. I'm very hesitant to seriously consider it for a few reasons:
- The interpreter will variously introduce bugs and lack features. This will be an ongoing issue as Ruby is an advanced, relatively young language; 2.0 is under active development.
- The Ruby VM will get much better over time. When it does, the argument for using the excellent Java VM will be much weaker.
- For better or worse, I didn't see a competitor to Fortran.NET
Every spreadsheet application needs to support the creating, reading, updating, and deleting of sheets, columns, rows, and cells. The network protocol for an online spreadsheet could easily treat sheets, columns, etc. as resources and offer CRUD operations for manipulating them.
The requests below were hand generated and sent via telnet. The responses were copy and pasted from the console window. A real frontend might use this API with XmlHttpRequest.
| operation | request | response |
|---|---|---|
| Create a sheet |
POST /fauxcel/finances HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Length: 0
Date: Wed, 25 Apr 2007 19:26:39 GMT
|
| Populate a cell |
POST /fauxcel/finances/A/1 HTTP/1.1
Host: 192.168.113.115:8080
Content-Type: text/text;charset=utf-8
Content-Length: 34
{"value":"123.45", "type":"value"}
|
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:36:37 GMT |
| Insert a column |
POST /fauxcel/finances/A HTTP/1.1 Host: 192.186.113.115:8080 |
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:36:37 GMT |
| Read a column |
GET /fauxcel/finances/B HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked
Date: Wed, 25 Apr 2007 19:39:38 GMT
53
{"name":"finances","cells":[{"col":"B","row":"1","type":"value","value":"123.45"}]}
0
|
| Delete a row |
DELETE /fauxcel/finances/1 HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:41:26 GMT |
Note that the back end moved the value of cell A1 into cell B1 when a column was inserted before A. More details on that in a following post!
REST: Fielding's introduction [1] lays out constraints and why they are important. I was hoping the subsequent chapter [2] focusing on REST and HTTP would have implementation level details on "what" to do. What I got out of it was Fielding reminiscing on the good, bad, and the ugly of HTTP 1.0 and 1.1. It was a serious mistake for me to expect a doctoral thesis to also be "REST for dummies." Still, the voice from on high seems to have defined the goal without clearing a trail to follow.
This site [3] was very helpful. I'm looking forward to adding this book [4] to my Safari bookshelf.
SOAP & WS-*: SOAP has a wealth of concrete documentation and libraries. Developers have spent a lot of time building tools to help the development process. All this sounds great! But... something about all the flux [5] and xml all the way down made my spidey-sense tingle.
I have written client code for an efficient, robust, and scalable family of network applications [6]. That protocol family exists because other parts of the business subsidised its architects, developers, testers, hardware, and bandwidth. Also necessary was the hands-off management style which gave the architecture time and space to grow naturally. This leads me to doubt success for the volunteer cooks [7] in the WS-I kitchen. Hopefully their contributions will improve the general state of web services.
POX: Plain ol' XML. Not sure this term means more than that... SOAP/WS-* are XML all the way down. XML may also be part of a REST solution -- ROX anyone? :)
[1] Chapter 5 of Fielding's thesis

