Finding similar strings using character level bigrams

| | Comments (1) | TrackBacks (0)

Earlier I looked at soundex as a more robust way of comparing strings. Soundex was more robust than case & space normalizing... but it missed somethings I wanted and found some things I didn't want. Here's an example based on a rather limited wine rack:

CREATE TABLE winerack (
      wine_type TEXT
    , ready DATE
    , soundex_val TEXT
);

INSERT INTO winerack (wine_type, ready, soundex_val)
    VALUES ('cabernet sauvingon', now() + interval '5 years', 'C165');

INSERT INTO winerack (wine_type, ready, soundex_val)
    VALUES ('cabermet sauvingon', now() + interval '6 years', 'C165');

INSERT INTO winerack (wine_type, ready, soundex_val)
    VALUES ('caberet sauvingon', now() + interval '7 years', 'C163');

INSERT INTO winerack (wine_type, ready, soundex_val)
    VALUES ('cabernet franc', now() + interval '8 years', 'C165');

Now I'd like to find all of the Cabernets Sauvignon. Using a literal match one bottle shows up. Three bottles show up using soundex... but not the right three bottles:

bigrams=# SELECT * FROM winerack WHERE soundex_val = 'C165';
     wine_type      |   ready    | soundex_val 
--------------------+------------+-------------
 cabernet sauvingon | 2012-10-08 | C165
 cabermet sauvingon | 2013-10-08 | C165
 cabernet franc     | 2015-10-08 | C165
(3 rows)

What happened? Well, soundex picked up the "cabermet" which was good. But it also picked up the Cabernet Franc which was bad. Also, it missed the "cabaret sauvingon".

Cabernet Franc was picked up because a soundex value contains at most four consonants. 'C', 'b', 'r', 'n' in this case. Nothing else matters afterwards... shudder to imagine "cabernet sausage" showing up in the search results!

Caberet Sauvignon was ignored because soundex doesn't compare strings. It's almost like a really primitive string hashing function. String in, string out. Easy to use though.

I approached Rob with this and he suggested I give character-level bigrams a try. Basically this finds how similar words are based on how often pairs of characters occur. Rob is being very patient with me as I work through the gory details, but this is my general understanding:

similarity(left_str, right_str)
    left_sum = dot_product(left_str, left_str);
    right_sum = dot_product(right_str, right_str);
    pair_sum = dot_product(left_str, right_str);
    return pair_sum / square_root(left_sum * right_sum);

The moment of truth -- did it outperform soundex?

bigrams=# select * from bigram_similarities;
      term_a       |      term_b       |    similarity     
-------------------+-------------------+-------------------
 cabernetsauvignon | cabernetsauvignon |                 1
 cabernetsauvignon | cabermetsauvignon |             0.875
 cabernetsauvignon | caberetsauvignon  | 0.903696114115064
 cabernetsauvignon | cabernetfranc     | 0.505181485540923

In this contrived case it worked out better. No clue further than that. It is easy to use like soundex:

bigrams=# select bigram_similarity('cabernet sauvignon', 'cabernet franc');
 bigram_similarity 
-------------------
 0.505181485540923

The infrastructure behind this is pretty gory. The code's not really polished and I have some concerns about performance. But here's a really neat way to generate a dot product using SQL:

CREATE TABLE bigrams ( term TEXT, bigram CHAR(2), instances INT );

SELECT
    SUM(lhs.instances * rhs.instances) AS dot_product
FROM
    bigrams lhs, bigrams rhs
WHERE
    lhs.bigram = rhs.bigram
        AND
    lhs.term = 'cabernet sauvignon'
        AND
    rhs.term = 'cabernet franc'

1 Comments

Anurag said:

Hi Seth,

Thanks for your post on 'Finding similar strings using character level bigrams'.

I am also working on the similar project and would like to know how did you populate similarity column (what SQL did you use) in bigram_similarities table?

Thanks in advance ,
Anurag

Leave a comment


Type the characters you see in the picture above.

0 TrackBacks

Listed below are links to blogs that reference this entry: Finding similar strings using character level bigrams.

TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/430