Saturday December 01, 2007 Seth Schroeder
Earlier I looked at soundex as a more robust way of comparing strings. Soundex was more robust than case & space normalizing... but it missed somethings I wanted and found some things I didn't want. Here's an example based on a rather limited wine rack:
CREATE TABLE winerack (
wine_type TEXT
, ready DATE
, soundex_val TEXT
);
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabernet sauvingon', now() + interval '5 years', 'C165');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabermet sauvingon', now() + interval '6 years', 'C165');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('caberet sauvingon', now() + interval '7 years', 'C163');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabernet franc', now() + interval '8 years', 'C165');
Now I'd like to find all of the Cabernets Sauvignon. Using a literal match one bottle shows up. Three bottles show up using soundex... but not the right three bottles:
bigrams=# SELECT * FROM winerack WHERE soundex_val = 'C165';
wine_type | ready | soundex_val
--------------------+------------+-------------
cabernet sauvingon | 2012-10-08 | C165
cabermet sauvingon | 2013-10-08 | C165
cabernet franc | 2015-10-08 | C165
(3 rows)
What happened? Well, soundex picked up the "cabermet" which was good. But it also picked up the Cabernet Franc which was bad. Also, it missed the "cabaret sauvingon".
Cabernet Franc was picked up because a soundex value contains at most four consonants. 'C', 'b', 'r', 'n' in this case. Nothing else matters afterwards... shudder to imagine "cabernet sausage" showing up in the search results!
Caberet Sauvignon was ignored because soundex doesn't compare strings. It's almost like a really primitive string hashing function. String in, string out. Easy to use though.
I approached Rob with this and he suggested I give character-level bigrams a try. Basically this finds how similar words are based on how often pairs of characters occur. Rob is being very patient with me as I work through the gory details, but this is my general understanding:
similarity(left_str, right_str)
left_sum = dot_product(left_str, left_str);
right_sum = dot_product(right_str, right_str);
pair_sum = dot_product(left_str, right_str);
return pair_sum / square_root(left_sum * right_sum);
The moment of truth -- did it outperform soundex?
bigrams=# select * from bigram_similarities;
term_a | term_b | similarity
-------------------+-------------------+-------------------
cabernetsauvignon | cabernetsauvignon | 1
cabernetsauvignon | cabermetsauvignon | 0.875
cabernetsauvignon | caberetsauvignon | 0.903696114115064
cabernetsauvignon | cabernetfranc | 0.505181485540923
In this contrived case it worked out better. No clue further than that. It is easy to use like soundex:
bigrams=# select bigram_similarity('cabernet sauvignon', 'cabernet franc');
bigram_similarity
-------------------
0.505181485540923
The infrastructure behind this is pretty gory. The code's not really polished and I have some concerns about performance. But here's a really neat way to generate a dot product using SQL:
CREATE TABLE bigrams ( term TEXT, bigram CHAR(2), instances INT );
SELECT
SUM(lhs.instances * rhs.instances) AS dot_product
FROM
bigrams lhs, bigrams rhs
WHERE
lhs.bigram = rhs.bigram
AND
lhs.term = 'cabernet sauvignon'
AND
rhs.term = 'cabernet franc'
A SQL view is a SELECT statement stored in the database which can be used like a table. So... what? Well, some of the benefits include:
- Syntax verified before deployment -- avoids errors from queries built at runtime.
- The RDBMS knows to expect it and could prepare an execution plan.
- Force users to use a view which excludes sensitive table data.
- Centralize important, common logic.
CREATE TABLE pantry (food TEXT, course TEXT, expiry DATE);
INSERT INTO pantry (food, course, expiry)
VALUES ('crumpets', 'breakfast', now() + interval '2 weeks');
INSERT INTO pantry (food, course, expiry)
VALUES ('canned ravioli', 'lunch', now() + interval '2 years');
INSERT INTO pantry (food, course, expiry)
VALUES ('instant noodles', 'dinner', now() + interval '2 years');
INSERT INTO pantry (food, course, expiry)
VALUES ('fruitcake', 'snack', now() + interval '2000 years');
Imagine these meal planning queries:
SELECT food FROM pantry WHERE course = 'breakfast' AND expiry > now();SELECT food FROM pantry WHERE course = 'lunch' AND expiry > now();SELECT food FROM pantry WHERE course = 'dinner' AND expiry > now();SELECT food FROM pantry WHERE course = 'snack' AND expiry > now();
Well, that's pretty gory. No abstraction at all -- mystery meat queries which need unnecessary work to be understood. Two things need to be abstracted: course and expiry. No one wants stale food, so build that logic first:
CREATE VIEW fresh_food AS
SELECT *
FROM pantry
WHERE expiry < now();
Voici! The per-course queries look like: SELECT food FROM fresh_food where course = 'foo'. But, why stop abstracting now? Why not this?
CREATE VIEW breakfast_menu AS
SELECT *
FROM fresh_food
WHERE course = 'breakfast';
This example is almost too silly, but one more good point can be squeezed out yet. A key benefit of centralized logic is having one place to make bug fixes. For example, the typo in fresh_food which returns stale food. Fix the view DML and all client code immediately benefits without change.
Unfortunately in practice performance can easily suffer when using (and especially layering) views.
Coming soon -- part 2: Views made my queries SLOW!
72, 101, 108, 108, 111
Each code unit format has its own structure. Note that UTF-16LE needs an extra code unit.
UTF-8: 0x48, 0x65, 0x6c, 0x6c, 0x6f
UTF-16: 0x0048, 0x0065, 0x006c, 0x006c, 0x006f
UTF-16LE: 0xfffe, 0x4800, 0x6500, 0x6c00, 0x6c00, 0x6f00
UTF-32: 0x00000048, 0x00000065, 0x0000006c, 0x0000006c, 0x0000006f
So... what? Well, someone had a thinking cap on when they defined UTF-8. It's optimal for storing American English and safe for legacy software libraries. My favorite part is how the first bits in the first byte count the number of bytes remaining in the code unit.
UTF-16 is kinda strange when you really look at it. Storing it requires 100% more space than UTF-8 requires for American English. It's big/network endian by default... meaning most computers have to juggle bytes per character before using them. And it's totally incompatible with old ASCII libraries. Finally, it's not possible to rely on 1 utf-16 code unit per code point due to surrogate characters.
The Unicode 5.0 spec is gigantic. It's always stocked on the top shelf at Borders with the other oversized books. Read it and you may believe that ZWNBSP means something. :)
Grails has big hopes. It hopes to find a place for Groovy in every tier of a Java-based webapp. It hopes to vanquish configuration files by enforcing implicit requirements / conventions. It hopes to simplify starting a respectable Java webapp by providing and configuring Hibernate, Spring, and multi-environment build files.
I hope the Grails dev team succeeds. JRuby is a well-intentioned, well-heeled, but misbegotten creation (q.v. V. Frankenstein). Should Groovy succeed, it should be the ultimate Java webapp framework. Ultimate in this sense: "being the last or concluding element of a series." Consider this diagram of a minimal Grails application -- can you find room for more layers?

All of those jar files are in the classpath and configured for immediate coding. Did I mention the grails commandline app? It's fantastic. It creates classes & test classes, runs tests, runs apps, and is a good engine for the development workflow.
So what's the catch? Adding Grails into an existing project doesn't make much sense. The Grails skeleton would duplicate much of the existing infrastructure. Migrating existing code into a Grails skeleton can be done, but some refactoring will be needed. Here's my quick take based on some tinkering.
Controllers
Rewrite controllers. A controller in Grails lives for one request, unlike a long-lived Spring MVC controller. Grails has done a good job of improving the process of implementing a controller: autobinding the query string key:value pairs to instance fields, simple url mapping conventions, and more.Services
Consider migrating .java business logic to Grails .groovy services. A Grails service class instance is long lived as in Spring MVC; it can also be injected into a Grails controller very easily. Java service classes with complicated transaction requirements should not be migrated. Transaction / rollback is applied to all Grails service methods by default. Disabling rollback affects all methods in a service class.Views and layouts
Grails supports Groovy Server Pages (.gsp) and Java Server Pages (.jsp); Sitemesh can be used with both. Unfortunately support for custom JSP tag libraries will wait until version 0.6. GSP isn't bad. The conditional control flow tags are nice, but I really see no strong reason to prefer a .gsp to a .jsp. One notable item on the roadmap is support for Spring Web Flow.Model / DAO / domain classes
Hibernate mapping files and annotations are well understood ways to control ORM. GORM configuration uses static class variables. Here is a partial list:optionals, transients, belongsTo, hasMany, mappedBy, embedded, constraints. Grails supports Groovy SQL but has plans to fork it eventually. GORM also injects id and version properties to domain objects at runtime.
Conclusion
Grails adds layers, but more cake-like than onion-like. Programmers can look forward to writing Groovy code for a multi-tier webapp with the dependencies already in place. Much existing code can be migrated in by adapting it to the Grails conventions; other parts may be included as .java files. The largest learning curve will probably be GORM. Finally, it's probably a good idea to let Grails mature for 3-6 months before deploying it into production.My introduction to Groovy
- Contents:
- Intro
- Code
- Results
- Code review
- Tools
- Conclusion
- Side notes
Idiomatic Groovy code looks and works like a mash-up of Java and Ruby... JRuby would be an intuitive name. After all, the "J" in the real JRuby has more to do with deployment and execution than look and feel of the code. Which do developers spend more time with?
The proof of the pudding is in the eating -- how Groovy is it? I decided to learn Groovy by writing a minimal RSS reader. It's no big chore in Java, but I was happily surprised how my Java and Ruby skills came into play. Please remember that this is my first shot at Groovy:
Grouping rss items by keyword
1: def groupByKeyword(groups) { 3: def map = [:] 4: 5: try { 6: stream = url.openStream(); 7: def slurper = new XmlSlurper(); 8: def rss = slurper.parse(stream); 9: 10: groups.each { 11: def group = it.toUpperCase(); 12: map[group] = []; 13: 14: rss.channel.item.each { 15: if (it.title.text().toUpperCase() =~ group) { 16: map[group] += it.title; 17: } 18: }; 19: } 20: } catch (e) { 21: println e; 22: } finally { 23: if (stream) 24: stream.close(); 25: } 26: 27: return map; 28: }
collecting parameters and printing the results
29: static void main(args) { 30: 31: def url = args[0]; 32: 33: def keywords = args[1..<args.length]; 34: 35: def reader = new RssReader(url); 36: 37: def groups = reader.groupByKeyword(keywords); 38: 39: groups.entrySet().each { 40: println "$it.key" 41: 42: it.value.each { 43: println "\t* $it" 44: } 45: } 46: }
Results
Input:
http://digg.com/rss/indexprogramming.xml java[s] java[^s] groovy haskell erlang microsoft emacs lisp
Output:
GROOVY
ERLANG
JAVA[^S]
EMACS
LISP
JAVA[S]
* Top 5 javascript frameworks
* The Javascript Programming Language
HASKELL
MICROSOFT
Input:
http://digg.com/rss/indexprogramming.xml java[s] java[^s] groovy haskell erlang microsoft emacs lisp
Output:
GROOVY
ERLANG
JAVA[^S]
EMACS
LISP
* pregexp: Portable Regular Expressions for Scheme and Common Lisp
JAVA[S]
* JavaScript - the world's most ubiquitous computing runtime... and it's not going away soon.
HASKELL
* Haskell's HTTP library: how it sucks, how to fix it
MICROSOFT
* Microsoft funds questionable study attacking GPL 3 draft process
A mini code review
- Line #1 demonstrates several Groovy-isms. The line starts a non-static method definition. The return type and parameter type were omitted.
- An empty map is defined at Line #3. I forgot to end the statement with a semicolon, but that is optional in Groovy.
- Lines 6..9 are more Java-ish than Groovy. I was inclined to work this way due to immature tool support. The Groovy plugin for Eclipse reported syntax errors when the parens were removed, but the code ran fine. The java-mode in emacs couldn't indent code lacking parens and semicolons.
- Ruby coders will recognize line 10. Notice the "it" variable on line 11. Groovy automatically defined "it" is a reference to the item being iterated over.
- Groovy packs a lot of code into line 14. The Groovy chunk "
rss.channel.item" directly matches the RSS structure of <rss><channel><item>. That shrink-wrapped coupling was minimal and readable. I couldn't find an obvious reason to prefer xpath or other xml navigation methods. - Line 15 would have been merged into line 14, but I ran out of patience getting
findAllto work. - Line 16 uses
+=to add an item to an array.-=works as expected. - The fading C/C++ programmer inside me was THRILLED to use
if (reference) ...at line 23. - This cheesy code depended on positional parameters. Given that, it was nice to slice them so easily at line 33. The
..<syntax defines a half-open interval [min, max) where min is included and max is excluded. - Notice how "
it" is defined in both the inner and outer loop near line 39. The C/C++ guy inside of me got all worked up about shadowing stack variables... that line of thinking is dead wrong about closures in Groovy.
Tools
Groovy is a very young language; its tools can't be judged fairly against Java's tools. The most significant problem I had was during debugging. Groovy is very involved in dispatching method calls, so a lot of debugging time will be spent stepping over / out of Groovy internals. That didn't work for me; the app seemed to resume processing. Other issues I noticed were lack of auto package import / cleanup, auto method completion, and the mistaken compile errors on missing parens.
The Groovy plugin did make it easy to start a new project. Installation worked great, and the dependencies were automatically added to the project after I created a Groovy class.
Conclusion
Pragmatic: The Groovy developers made a lot of welcome improvements to Java's syntax. It does leverage existing developer skills and would integrate tightly with existing Java code. The tools need a lot of features and some fixes.
Personal: I had bipolar Java/Ruby style swings while coding Groovy. It feels like almost an even compromise between the two... almost too flexible for my personality :). I would consider using it in new projects after Groovy 1.1 (annotations + more) has been released and the tools have matured.
Side notes:
- Wow, that's a lot of .class files per lines of Groovy:
$ wc -l *.groovy 24 Hola.groovy 90 RssReader.groovy 114 total $ echo .groovy && ls *.groovy | wc -l; echo .class && ls *.class | wc - l; .groovy 2 .class 14 - Google Trends comparison of Groovy and JRuby
- Groovy was started in early 2004! Version 1.0 was released almost three years later. This article describes the waning and waxing of progress implementing the language.
Sometimes you really want a quick & dirty histogram while looking through a database:
- when you suspect the mean value is misleading
- when you want to understand how the values are distributed
- ... and easily switch between different sources of values
- ... without exporting data & switching applications
Here are the story scores from a recent front page of reddit.com: 175, 456, 140, 191, 230, 186, 134, 215, 171, 83, 102, 171, 182, 322, 193, 310, 338, 345, 174, 134, 92, 109, 241, 256, 132
A basic statistical query returns:
+-----+-----+-----+--------+ | max | min | avg | stddev | +-----+-----+--------------+ | 456 | 83 | 203 | 90 | +-----+-----+-----+--------+
The standard deviation seems awfully large. Maybe not many of the scores are close to the mean score of 203? What if another query could show the distribution?
+--------+----------+--------+----------+ +--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling | | bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+ +--------+----------+--------+----------+
| 1 | 8 | 83 | 157 | | 1 | 4 | 83 | 119 |
| 2 | 10 | 158 | 232 | | 2 | 4 | 120 | 156 |
| 3 | 2 | 233 | 307 | | 3 | 8 | 157 | 193 |
| 4 | 4 | 308 | 382 | | 4 | 2 | 194 | 230 |
| 5 | 1 | 383 | 457 | | 5 | 2 | 231 | 267 |
+--------+----------+--------+----------+ | 6 | 0 | 268 | 304 |
| 7 | 3 | 305 | 341 |
| 8 | 1 | 342 | 378 |
| 9 | 0 | 379 | 415 |
| 10 | 1 | 416 | 452 |
+--------+----------+--------+----------+
The histogram on the left has fewer, larger buckets. This is a lot more informative than the mean & stddev. The histogram on the right uses more, smaller buckets. Maybe this is too verbose? What if you wanted seven buckets?
update dhg.bucket_count set num_buckets = 7;
select * from dhg.results;
+--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+
| 1 | 7 | 83 | 135 |
| 2 | 7 | 136 | 188 |
| 3 | 5 | 189 | 241 |
| 4 | 1 | 242 | 294 |
| 5 | 4 | 295 | 347 |
| 6 | 0 | 348 | 400 |
| 7 | 1 | 401 | 453 |
+--------+----------+--------+----------+
Instructions:
- Insert numbers to be analyzed:
- Choose how many buckets in the histogram
- Read the results!
INSERT INTO dhg.source SELECT foo FROM bar;
UPDATE dhg.bucket_count SET num_buckets = 10;
SELECT * FROM dhg.results_full;
Materials:
All views, tables, and functions will live in a dynamic histogram (dhg) schema. The SQL is pretty minimal yet hopefully reasonably structured and commented. The MySQL flavor is larger due to an implementation of width_bucket.
WARNING: The implementation suffers from a variety of rounding errors and poor error handling. This is a quick and dirty solution for rough estimates only.
NOTE: Using more than 20 buckets will need a small tweak. Grep the SQL for empty_buckets.
- PostgreSQL DDL (may work with Oracle 9+)
- MySQL DDL
- starter data: scores from reddit, digg, and comment counts from slashdot
I'd love to hear criticisms, comments, & suggestions!
"Data elements" are the focus of Section 5.2.1 of Dr. Fielding's dissertation. I think the focus is who does how much work to render the data. He lists three options: mostly server, mixed client & server, and mostly client:
Server Client
|----------------------------------------------------------------------|
| | |
1. Fixed format 2. encapsulated data & code 3. raw data & metadata
(jpeg) (json, javascript) (html <img> tag?)
This is an important decision for an online spreadsheet. Should the server send only literal values (option one)? Should the client evaluate functions and resolve references? (option two)?
Option 1:
pro: parsing cell values into a tree and traversing it is tricky. Writing it once in Java seems like a reasonable approach.
con: client has no idea which cells are related to each other. That would really complicate features like copying and pasting cell references.
Option 2:
con: reimplement much of cell parsing & traversal in Javascript. Using Rhino to test it outside a browser mitigates this a little.
pro: client knows which cells are related.
Option 3:
con: no idea how to apply this approach for a spreadsheet.
I chose option two. Eventually the client will probably need to work directly with cell references. Better to get that work out of the way ahead of time. Not all of the logic needs to be rewritten. Some of it is only needed on one side:
common:
* build parse tree from cell values
* iterate over parse tree
server specific:
* update reference values: A1 => B1, $G3 => $G4
* collapse the parse tree back into a flat string
client specific:
* resolve reference values (A1 = 123.45, G3 = A1 = 123.45).
* evaluate functions (=SUM(G3,A1) => 246.9).
I didn't expect so many neat challenges. Even the prototype implementation needed parsing, recursion, evaluating simple math expressions, tree traversal, base10 & base26 conversion, and more. But all that is meat for another post.
- JHaskell.
- Sleep. Quoted from the site: Sleep is heavily inspired by Perl with bits of Objective-C thrown in
- Dynamic Java.
- Groovy seems like Java "lite." The groovy compiler reportedly accepts most Java source, but also permits a more flexible yet terse syntax. This gained a lot of traction at the conference. Some people complained that the name sounded too much like "Ruby." Which leads to...
- JRuby. The overall goal is to mate the slick Ruby syntax with the mature Java VM. I'm very hesitant to seriously consider it for a few reasons:
- The interpreter will variously introduce bugs and lack features. This will be an ongoing issue as Ruby is an advanced, relatively young language; 2.0 is under active development.
- The Ruby VM will get much better over time. When it does, the argument for using the excellent Java VM will be much weaker.
- For better or worse, I didn't see a competitor to Fortran.NET
Every spreadsheet application needs to support the creating, reading, updating, and deleting of sheets, columns, rows, and cells. The network protocol for an online spreadsheet could easily treat sheets, columns, etc. as resources and offer CRUD operations for manipulating them.
The requests below were hand generated and sent via telnet. The responses were copy and pasted from the console window (w/ a little pretty printing). A real frontend might use this API with XmlHttpRequest.
| operation | request | response |
|---|---|---|
| Create a sheet |
POST /fauxcel/finances HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Length: 0
Date: Wed, 25 Apr 2007 19:26:39 GMT
|
| Populate a cell |
POST /fauxcel/finances/A/1 HTTP/1.1
Host: 192.168.113.115:8080
Content-Type: text/text;charset=utf-8
Content-Length: 34
{"value":"123.45", "type":"value"}
|
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:36:37 GMT |
| Insert a column |
POST /fauxcel/finances/A HTTP/1.1 Host: 192.186.113.115:8080 |
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:36:37 GMT |
| Read a column |
GET /fauxcel/finances/B HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked
Date: Wed, 25 Apr 2007 19:39:38 GMT
53
{"name":"finances","cells": [
{
"col":"B",
"row":"1",
"type":"value",
"value":"123.45"
}
]}
0
|
| Delete a row |
DELETE /fauxcel/finances/1 HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:41:26 GMT |
Note that the back end moved the value of cell A1 into cell B1 when a column was inserted before A. More details on that in a following post!

