Sometimes you really want a quick & dirty histogram while looking through a database:
- when you suspect the mean value is misleading
- when you want to understand how the values are distributed
- ... and easily switch between different sources of values
- ... without exporting data & switching applications
Here are the story scores from a recent front page of reddit.com: 175, 456, 140, 191, 230, 186, 134, 215, 171, 83, 102, 171, 182, 322, 193, 310, 338, 345, 174, 134, 92, 109, 241, 256, 132
A basic statistical query returns:
+-----+-----+-----+--------+ | max | min | avg | stddev | +-----+-----+--------------+ | 456 | 83 | 203 | 90 | +-----+-----+-----+--------+
The standard deviation seems awfully large. Maybe not many of the scores are close to the mean score of 203? What if another query could show the distribution?
+--------+----------+--------+----------+ +--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling | | bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+ +--------+----------+--------+----------+
| 1 | 8 | 83 | 157 | | 1 | 4 | 83 | 119 |
| 2 | 10 | 158 | 232 | | 2 | 4 | 120 | 156 |
| 3 | 2 | 233 | 307 | | 3 | 8 | 157 | 193 |
| 4 | 4 | 308 | 382 | | 4 | 2 | 194 | 230 |
| 5 | 1 | 383 | 457 | | 5 | 2 | 231 | 267 |
+--------+----------+--------+----------+ | 6 | 0 | 268 | 304 |
| 7 | 3 | 305 | 341 |
| 8 | 1 | 342 | 378 |
| 9 | 0 | 379 | 415 |
| 10 | 1 | 416 | 452 |
+--------+----------+--------+----------+
The histogram on the left has fewer, larger buckets. This is a lot more informative than the mean & stddev. The histogram on the right uses more, smaller buckets. Maybe this is too verbose? What if you wanted seven buckets?
update dhg.bucket_count set num_buckets = 7;
select * from dhg.results;
+--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+
| 1 | 7 | 83 | 135 |
| 2 | 7 | 136 | 188 |
| 3 | 5 | 189 | 241 |
| 4 | 1 | 242 | 294 |
| 5 | 4 | 295 | 347 |
| 6 | 0 | 348 | 400 |
| 7 | 1 | 401 | 453 |
+--------+----------+--------+----------+
Instructions:
- Insert numbers to be analyzed:
- Choose how many buckets in the histogram
- Read the results!
class="prettyprint"INSERT INTO dhg.source SELECT foo FROM bar;
UPDATE dhg.bucket_count SET num_buckets = 10;
SELECT * FROM dhg.results_full;
Materials:
All views, tables, and functions will live in a dynamic histogram (dhg) schema. The SQL is pretty minimal yet hopefully reasonably structured and commented. The MySQL flavor is larger due to an implementation of width_bucket.
WARNING: The implementation suffers from a variety of rounding errors and poor error handling. This is a quick and dirty solution for rough estimates only.
NOTE: Using more than 20 buckets will need a small tweak. Grep the SQL for empty_buckets.
- PostgreSQL DDL (may work with Oracle 9+)
- MySQL DDL
- starter data: scores from reddit, digg, and comment counts from slashdot
I'd love to hear criticisms, comments, & suggestions!
0 TrackBacks
Listed below are links to blogs that reference this entry: Quick and dirty SQL histogram.
TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/497



Leave a comment