Tuesday May 15, 2007 Seth Schroeder
Sometimes you really want a quick & dirty histogram while looking through a database:
- when you suspect the mean value is misleading
- when you want to understand how the values are distributed
- ... and easily switch between different sources of values
- ... without exporting data & switching applications
Here are the story scores from a recent front page of reddit.com: 175, 456, 140, 191, 230, 186, 134, 215, 171, 83, 102, 171, 182, 322, 193, 310, 338, 345, 174, 134, 92, 109, 241, 256, 132
A basic statistical query returns:
+-----+-----+-----+--------+ | max | min | avg | stddev | +-----+-----+--------------+ | 456 | 83 | 203 | 90 | +-----+-----+-----+--------+
The standard deviation seems awfully large. Maybe not many of the scores are close to the mean score of 203? What if another query could show the distribution?
+--------+----------+--------+----------+ +--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling | | bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+ +--------+----------+--------+----------+
| 1 | 8 | 83 | 157 | | 1 | 4 | 83 | 119 |
| 2 | 10 | 158 | 232 | | 2 | 4 | 120 | 156 |
| 3 | 2 | 233 | 307 | | 3 | 8 | 157 | 193 |
| 4 | 4 | 308 | 382 | | 4 | 2 | 194 | 230 |
| 5 | 1 | 383 | 457 | | 5 | 2 | 231 | 267 |
+--------+----------+--------+----------+ | 6 | 0 | 268 | 304 |
| 7 | 3 | 305 | 341 |
| 8 | 1 | 342 | 378 |
| 9 | 0 | 379 | 415 |
| 10 | 1 | 416 | 452 |
+--------+----------+--------+----------+
The histogram on the left has fewer, larger buckets. This is a lot more informative than the mean & stddev. The histogram on the right uses more, smaller buckets. Maybe this is too verbose? What if you wanted seven buckets?
update dhg.bucket_count set num_buckets = 7;
select * from dhg.results;
+--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+
| 1 | 7 | 83 | 135 |
| 2 | 7 | 136 | 188 |
| 3 | 5 | 189 | 241 |
| 4 | 1 | 242 | 294 |
| 5 | 4 | 295 | 347 |
| 6 | 0 | 348 | 400 |
| 7 | 1 | 401 | 453 |
+--------+----------+--------+----------+
Instructions:
- Insert numbers to be analyzed:
- Choose how many buckets in the histogram
- Read the results!
INSERT INTO dhg.source SELECT foo FROM bar;
UPDATE dhg.bucket_count SET num_buckets = 10;
SELECT * FROM dhg.results_full;
Materials:
All views, tables, and functions will live in a dynamic histogram (dhg) schema. The SQL is pretty minimal yet hopefully reasonably structured and commented. The MySQL flavor is larger due to an implementation of width_bucket.
WARNING: The implementation suffers from a variety of rounding errors and poor error handling. This is a quick and dirty solution for rough estimates only.
NOTE: Using more than 20 buckets will need a small tweak. Grep the SQL for empty_buckets.
- PostgreSQL DDL (may work with Oracle 9+)
- MySQL DDL
- starter data: scores from reddit, digg, and comment counts from slashdot
I'd love to hear criticisms, comments, & suggestions!
"Data elements" are the focus of Section 5.2.1 of Dr. Fielding's dissertation. I think the focus is who does how much work to render the data. He lists three options: mostly server, mixed client & server, and mostly client:
Server Client
|----------------------------------------------------------------------|
| | |
1. Fixed format 2. encapsulated data & code 3. raw data & metadata
(jpeg) (json, javascript) (html <img> tag?)
This is an important decision for an online spreadsheet. Should the server send only literal values (option one)? Should the client evaluate functions and resolve references? (option two)?
Option 1:
pro: parsing cell values into a tree and traversing it is tricky. Writing it once in Java seems like a reasonable approach.
con: client has no idea which cells are related to each other. That would really complicate features like copying and pasting cell references.
Option 2:
con: reimplement much of cell parsing & traversal in Javascript. Using Rhino to test it outside a browser mitigates this a little.
pro: client knows which cells are related.
Option 3:
con: no idea how to apply this approach for a spreadsheet.
I chose option two. Eventually the client will probably need to work directly with cell references. Better to get that work out of the way ahead of time. Not all of the logic needs to be rewritten. Some of it is only needed on one side:
common:
* build parse tree from cell values
* iterate over parse tree
server specific:
* update reference values: A1 => B1, $G3 => $G4
* collapse the parse tree back into a flat string
client specific:
* resolve reference values (A1 = 123.45, G3 = A1 = 123.45).
* evaluate functions (=SUM(G3,A1) => 246.9).
I didn't expect so many neat challenges. Even the prototype implementation needed parsing, recursion, evaluating simple math expressions, tree traversal, base10 & base26 conversion, and more. But all that is meat for another post.
- JHaskell.
- Sleep. Quoted from the site: Sleep is heavily inspired by Perl with bits of Objective-C thrown in
- Dynamic Java.
- Groovy seems like Java "lite." The groovy compiler reportedly accepts most Java source, but also permits a more flexible yet terse syntax. This gained a lot of traction at the conference. Some people complained that the name sounded too much like "Ruby." Which leads to...
- JRuby. The overall goal is to mate the slick Ruby syntax with the mature Java VM. I'm very hesitant to seriously consider it for a few reasons:
- The interpreter will variously introduce bugs and lack features. This will be an ongoing issue as Ruby is an advanced, relatively young language; 2.0 is under active development.
- The Ruby VM will get much better over time. When it does, the argument for using the excellent Java VM will be much weaker.
- For better or worse, I didn't see a competitor to Fortran.NET
Every spreadsheet application needs to support the creating, reading, updating, and deleting of sheets, columns, rows, and cells. The network protocol for an online spreadsheet could easily treat sheets, columns, etc. as resources and offer CRUD operations for manipulating them.
The requests below were hand generated and sent via telnet. The responses were copy and pasted from the console window (w/ a little pretty printing). A real frontend might use this API with XmlHttpRequest.
| operation | request | response |
|---|---|---|
| Create a sheet |
POST /fauxcel/finances HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Length: 0
Date: Wed, 25 Apr 2007 19:26:39 GMT
|
| Populate a cell |
POST /fauxcel/finances/A/1 HTTP/1.1
Host: 192.168.113.115:8080
Content-Type: text/text;charset=utf-8
Content-Length: 34
{"value":"123.45", "type":"value"}
|
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:36:37 GMT |
| Insert a column |
POST /fauxcel/finances/A HTTP/1.1 Host: 192.186.113.115:8080 |
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:36:37 GMT |
| Read a column |
GET /fauxcel/finances/B HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked
Date: Wed, 25 Apr 2007 19:39:38 GMT
53
{"name":"finances","cells": [
{
"col":"B",
"row":"1",
"type":"value",
"value":"123.45"
}
]}
0
|
| Delete a row |
DELETE /fauxcel/finances/1 HTTP/1.1 Host: 192.168.113.115:8080 |
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Length: 0 Date: Wed, 25 Apr 2007 19:41:26 GMT |
Note that the back end moved the value of cell A1 into cell B1 when a column was inserted before A. More details on that in a following post!

