Levenshtein distanceIn information theory, the Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution. It is named after the Russian scientist Vladimir Levenshtein, who considered this distance in 1965. It is useful in applications that need to determine how similar two strings are, such as spell checkers. For example, the Levenshtein distance between "kitten" and "sitting" is 3, since these three edits change one into the other, and there's no way to do it with less than three edits:
It can be considered a generalization of the Hamming distance, which is used for strings of the same length and only considers substitution edits. There are also further generalizations of the Levenshtein distance that consider, for example, exchanging two characters as an operation.
The AlgorithmA commonly-used algorithm for computing the Levenshtein distance involves the use of an (n + 1) × (m + 1) matrix, where n and m are the lengths of the two strings. Here is pseudocode for it:
int LevenshteinDistance(char s[1..n], char t[1..m])
declare int d[0..n,0..m]
declare int i, j, cost
for i := 0 to n
d[i,0] := i
for j := 0 to m
d[0,j] := j
for i := 1 to n
for j := 1 to m
if s[i] = t[j] then cost := 0
else cost := 1
d[i,j] := minimum(d[i-1,j ] + 1, // insertion
d[i, j-1] + 1, // deletion
d[i-1,j-1] + cost) // substitution
return d[n,m]
Possible ImprovementsPossible improvements to this algorithm include:
Proof of CorrectnessThe invariant is that we can transform the initial segment
This proof fails to validate that the number placed in Upper and lower boundsThe Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
Sample implementationsHaskell
min3 :: Int->Int->Int->Int
min3 x y z = min x (min y z)
cost :: Char->Char->Int
cost a b
| a == b = 0
| otherwise = 1
sumCosts :: String->Char->Int
sumCosts a = 0
sumCosts (s:ss) a = cost s a + sumCosts ss a
editDistance :: String->String->Int
editDistance = 0
editDistance s = sumCosts s '-'
editDistance t = sumCosts t '-'
editDistance (s:ss) (t:ts) = min3 (cost s t + editDistance ss ts)
(cost s '-' + editDistance ss (t:ts))
(cost '-' t + editDistance (s:ss) ts)
External links
de:Lewenstein-Distanz
Categories: Algorithms on strings |
|
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia article. Browse Wikipedia for more information. |