d ← L ##.dist R ⍝ Levenshtein distance.
From Jay Foad, [dist] returns the "distance" between simple vectors L and R.
The O(m×n) algorithm compares sequences, calculating the minimum number of edits
required to transform one into the other. The notional costs of the edit operat-
ions are:
Insert item: 1
Delete item: 1
Replace item: 1
Copy item: 0
For example, comparing vectors 'Sunday' and 'Saturday' implies the following
table, where each item is:
0 + its NW neighbour, if the corresponding items match.
1 + the minimum of its N, W and NW neighbours, otherwise.
S a t u r d a y
S 0 1 2 3 4 5 6 7
u 1 1 2 2 3 4 5 6
n 2 2 2 3 3 4 5 6
d 3 3 3 3 4 3 4 5
a 4 3 4 4 4 4 3 4
y 5 4 4 5 5 5 4 3
Each number in the table is the edit distance between the sub-strings above and
to its left. For example, the 2 in the second row, fourth column, is the dist-
ance from 'Su' to 'Satu': two copies and two insertions.
Note that, as the cost of insertion and deletion is the same, the function is
commutative. 'Satu' → 'Su' : two copies and two deletions.
The Levenshtein distance between the two sequences is thus the lower-right table
item (3 in the above example).
See: http://en.wikipedia.org/wiki/Levenshtein_distance
Technical note:
dist←{ ⍝ Levenshtein distance.
a←(n+1)⍴(⍴⍺)+n←⍴⍵ ⍝ first row of matrix
f←⍵{⌊\⍵⌊(⊃⍵),(¯1↓⍵)-1+⍺=⍺⍺} ⍝ iteration step
z←⊃f/(⌽⍺),⊂a ⍝ last row of matrix
⊃⌽z
}
[dist] calculates a row of the matrix at a time. The / iterates over successive
characters from ⍺, generating a new matrix row each time. The ⌊\ is key to calc-
ulating a whole row quickly.
Examples
'Sunday' dist 'Saturday' ⍝ distance between strings.
3
'sitting' dist 'kitten' ⍝ examples from Wikipedia.
3
mons ← 'January' 'February' 'March'
mons,← 'April' 'May' 'June'
mons,← 'July' 'August' 'September'
mons,← 'October' 'November' 'December'
mons ∘.dist mons ⍝ distances between months
0 4 6 7 5 5 4 6 9 7 8 8
4 0 7 7 6 7 6 7 8 7 8 7
6 7 0 4 3 5 5 6 9 7 8 8
7 7 4 0 5 5 5 5 8 7 8 8
5 6 3 5 0 4 3 6 9 7 8 8
5 7 5 5 4 0 2 5 8 6 7 7
4 6 5 5 3 2 0 5 9 7 8 8
6 7 6 5 6 5 5 0 9 7 8 8
9 8 9 8 9 8 9 9 0 5 4 3
7 7 7 7 7 6 7 7 5 0 5 4
8 8 8 8 8 7 8 8 4 5 0 3
8 7 8 8 8 7 8 8 3 4 3 0
fuzzy←{({⍵⍳⌊/⍵}(lcase ⍺)∘dist∘lcase¨⍵)⊃⍵} ⍝ fuzzy selection
fuzzy∘mons¨ 'dcmbr' 'marching' 'febury' ⍝ fuzzy matches
December March February
See also: lcase
Back to: contents
Back to: Workspaces