One of the unpleasant tasks on my to-do list is improve the process of searching by name in a database. A parade of errors are just around the bend: transposed letters, stray punctuation, and even random flecks of whitespace.
Given these silly tables and sample data:
create table cafe_du_monde (
name varchar2(128),
beignets integer
);
create table jackson_square (
name varchar2(128),
pictures_taken integer
);
a.) insert into cafe_du_monde (name, beignets) values ('seth', 3);
b.) insert into cafe_du_monde (name, beignets) values ('seth ', 6);
c.) insert into jackson_square (name, pictures_taken) values ('Seth', 127);
d.) insert into jackson_square (name, pictures_taken) values ('seht', 10);
Ideally the results would be me eating 9 beignets and taking 137 pictures (let's not mention the cafe au laits :)
Method A: compare as-is
select name, sum(beignets), sum(pictures_taken) from cafe_du_monde as cdm, jackson_square as js where cdm.name = js.name group by name;Bzzzt. Not good. Look at all those rows! Selected: a, b, c, d
Method B: normalize case
select name, sum(beignets), sum(pictures_taken) from cafe_du_monde as cdm, jackson_square as js where lower(cdm.name) = lower(js.name) group by name;Selected: a+c, b, d
Method C: trim whitespace
select name, sum(beignets), sum(pictures_taken) from cafe_du_monde as cdm, jackson_square as js where trim(both from cdm.name) = trim(both from js.name) group by name;Selected: a+b, c, d
Methods B & C
(obvious query omitted) Selected: a+b+c, dMethod D: soundex
select name, sum(beignets), sum(pictures_taken) from cafe_du_monde as cdm, jackson_square as js where soundex(cdm.name) = soundex(js.name) group by name;Soundex value: S300 Selected: a+b+c+d Soundex works perfectly for this example. When I tried it with more realistic data I wasn't quite as happy. Matt introduced me to the Levenshtein algorithm for counting the number of edits needed to convert one string into another. (ex: cat -> mat = 1, cat -> dot = 2, cat -> dog = 3, cat -> gold = 4)
Method E: Comparing edit distance
levenshtein('seth', 'seth') => 0
levenshtein('seth', 'Seth') => 1
levenshtein('seth', 'seth ') => 1
levenshtein('seth', 'seht') => 2
damerau-levenshtein('seth', 'seht') => 1
The Levenshtein distance algorithm treats the 'ht' in 'seht' as two substitions. The Damerau-Levenshtein algorithm treats it as one swap. Otherwise it's the same, so Damerau-Levenshtein seems like the better method.
Unhappily the initial edit distance results seem weaker than the case normalizing and whitespace trimming. Since the algorithms stumble over case and whitespace, so why not take care of those first?
Method F: normalizing case, trimming whitespace, & comparing edit distance
damerau-levenshtein(lower(trim('seth', 'seth'))) => 0
damerau-levenshtein(lower(trim('seth', 'Seth'))) => 0
damerau-levenshtein(lower(trim('seth', 'seth '))) => 0
damerau-levenshtein(lower(trim('seth', 'seht'))) => 1
I think this is better. An edit distance of 25% seems a little high, but the string is so short in this example. At this point I've got enough to start experimenting.
The Levenshtein implementation on Wikipedia has to be seen to be believed. Here's a quick and dirty Ruby version. (ObFun: Since Ruby supports jagged arrays I decided to emulate two dimensional arrays with a set of [row,col] keys. A little strange looking but kinda neat).
def levenshtein(s1, s2)
d = {}
(0..s1.size).each do |row|
d[[row, 0]] = row
end
(0..s2.size).each do |col|
d[[0, col]] = col
end
(1..s1.size).each do |i|
(1..s2.size).each do |j|
cost = 0
if (s1[i-1] != s2[j-1])
cost = 1
end
d[[i, j]] = [d[[i - 1, j]] + 1,
d[[i, j - 1]] + 1,
d[[i - 1, j - 1]] + cost
].min
next unless @@damerau
if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])
d[[i, j]] = [d[[i,j]],
d[[i-2, j-2]] + cost
].min
end
end
end
return d[[s1.size, s2.size]]
end
0 TrackBacks
Listed below are links to blogs that reference this entry: Counting beignets: Soundex, Levenshtein, or case & space bashing?.
TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/404



Leave a comment