Quick and dirty SQL histogram

| | Comments (0) | TrackBacks (0)

Sometimes you really want a quick & dirty histogram while looking through a database:

  • when you suspect the mean value is misleading
  • when you want to understand how the values are distributed
  • ... and easily switch between different sources of values
  • ... without exporting data & switching applications

Here are the story scores from a recent front page of reddit.com: 175, 456, 140, 191, 230, 186, 134, 215, 171, 83, 102, 171, 182, 322, 193, 310, 338, 345, 174, 134, 92, 109, 241, 256, 132

A basic statistical query returns:

+-----+-----+-----+--------+
| max | min | avg | stddev |
+-----+-----+--------------+
| 456 |  83 | 203 |   90   |
+-----+-----+-----+--------+

The standard deviation seems awfully large. Maybe not many of the scores are close to the mean score of 203? What if another query could show the distribution?

+--------+----------+--------+----------+         +--------+----------+--------+----------+
| bucket | contents | _floor | _ceiling |         | bucket | contents | _floor | _ceiling |
+--------+----------+--------+----------+         +--------+----------+--------+----------+
|      1 |        8 |     83 |      157 |         |      1 |        4 |     83 |      119 |
|      2 |       10 |    158 |      232 |         |      2 |        4 |    120 |      156 |
|      3 |        2 |    233 |      307 |         |      3 |        8 |    157 |      193 |
|      4 |        4 |    308 |      382 |         |      4 |        2 |    194 |      230 |
|      5 |        1 |    383 |      457 |         |      5 |        2 |    231 |      267 |
+--------+----------+--------+----------+         |      6 |        0 |    268 |      304 |
                                                  |      7 |        3 |    305 |      341 |
                                                  |      8 |        1 |    342 |      378 |
                                                  |      9 |        0 |    379 |      415 |
                                                  |     10 |        1 |    416 |      452 |
                                                  +--------+----------+--------+----------+

The histogram on the left has fewer, larger buckets. This is a lot more informative than the mean & stddev. The histogram on the right uses more, smaller buckets. Maybe this is too verbose? What if you wanted seven buckets?

        update dhg.bucket_count set num_buckets = 7;
        select * from dhg.results;

        +--------+----------+--------+----------+
        | bucket | contents | _floor | _ceiling |
        +--------+----------+--------+----------+
        |      1 |        7 |     83 |      135 |
        |      2 |        7 |    136 |      188 |
        |      3 |        5 |    189 |      241 |
        |      4 |        1 |    242 |      294 |
        |      5 |        4 |    295 |      347 |
        |      6 |        0 |    348 |      400 |
        |      7 |        1 |    401 |      453 |
        +--------+----------+--------+----------+

Instructions:

  1. Insert numbers to be analyzed:
  2. class="prettyprint"INSERT INTO dhg.source SELECT foo FROM bar;
  3. Choose how many buckets in the histogram
  4. UPDATE dhg.bucket_count SET num_buckets = 10;
  5. Read the results!
  6. SELECT * FROM dhg.results_full;

Materials:

All views, tables, and functions will live in a dynamic histogram (dhg) schema. The SQL is pretty minimal yet hopefully reasonably structured and commented. The MySQL flavor is larger due to an implementation of width_bucket.

WARNING: The implementation suffers from a variety of rounding errors and poor error handling. This is a quick and dirty solution for rough estimates only.

NOTE: Using more than 20 buckets will need a small tweak. Grep the SQL for empty_buckets.

I'd love to hear criticisms, comments, & suggestions!

Leave a comment


Type the characters you see in the picture above.

0 TrackBacks

Listed below are links to blogs that reference this entry: Quick and dirty SQL histogram.

TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/497