Earlier I looked at soundex as a more robust way of comparing strings. Soundex was more robust than case & space normalizing... but it missed somethings I wanted and found some things I didn't want. Here's an example based on a rather limited wine rack:
CREATE TABLE winerack (
wine_type TEXT
, ready DATE
, soundex_val TEXT
);
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabernet sauvingon', now() + interval '5 years', 'C165');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabermet sauvingon', now() + interval '6 years', 'C165');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('caberet sauvingon', now() + interval '7 years', 'C163');
INSERT INTO winerack (wine_type, ready, soundex_val)
VALUES ('cabernet franc', now() + interval '8 years', 'C165');
Now I'd like to find all of the Cabernets Sauvignon. Using a literal match one bottle shows up. Three bottles show up using soundex... but not the right three bottles:
bigrams=# SELECT * FROM winerack WHERE soundex_val = 'C165';
wine_type | ready | soundex_val
--------------------+------------+-------------
cabernet sauvingon | 2012-10-08 | C165
cabermet sauvingon | 2013-10-08 | C165
cabernet franc | 2015-10-08 | C165
(3 rows)
What happened? Well, soundex picked up the "cabermet" which was good. But it also picked up the Cabernet Franc which was bad. Also, it missed the "cabaret sauvingon".
Cabernet Franc was picked up because a soundex value contains at most four consonants. 'C', 'b', 'r', 'n' in this case. Nothing else matters afterwards... shudder to imagine "cabernet sausage" showing up in the search results!
Caberet Sauvignon was ignored because soundex doesn't compare strings. It's almost like a really primitive string hashing function. String in, string out. Easy to use though.
I approached Rob with this and he suggested I give character-level bigrams a try. Basically this finds how similar words are based on how often pairs of characters occur. Rob is being very patient with me as I work through the gory details, but this is my general understanding:
similarity(left_str, right_str)
left_sum = dot_product(left_str, left_str);
right_sum = dot_product(right_str, right_str);
pair_sum = dot_product(left_str, right_str);
return pair_sum / square_root(left_sum * right_sum);
The moment of truth -- did it outperform soundex?
bigrams=# select * from bigram_similarities;
term_a | term_b | similarity
-------------------+-------------------+-------------------
cabernetsauvignon | cabernetsauvignon | 1
cabernetsauvignon | cabermetsauvignon | 0.875
cabernetsauvignon | caberetsauvignon | 0.903696114115064
cabernetsauvignon | cabernetfranc | 0.505181485540923
In this contrived case it worked out better. No clue further than that. It is easy to use like soundex:
bigrams=# select bigram_similarity('cabernet sauvignon', 'cabernet franc');
bigram_similarity
-------------------
0.505181485540923
The infrastructure behind this is pretty gory. The code's not really polished and I have some concerns about performance. But here's a really neat way to generate a dot product using SQL:
CREATE TABLE bigrams ( term TEXT, bigram CHAR(2), instances INT );
SELECT
SUM(lhs.instances * rhs.instances) AS dot_product
FROM
bigrams lhs, bigrams rhs
WHERE
lhs.bigram = rhs.bigram
AND
lhs.term = 'cabernet sauvignon'
AND
rhs.term = 'cabernet franc'
1 Comment
Leave a comment
0 TrackBacks
Listed below are links to blogs that reference this entry: Finding similar strings using character level bigrams.
TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/430



Hi Seth,
Thanks for your post on 'Finding similar strings using character level bigrams'.
I am also working on the similar project and would like to know how did you populate similarity column (what SQL did you use) in bigram_similarities table?
Thanks in advance ,
Anurag