String similarity metrics in Python

This video explains
String similarity metrics in Python

--

Become part of the top 3% of the developers by applying to Toptal
https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Music Box Puzzles

--

Chapters
00:00 Question
01:10 Accepted answer (Score 26)
01:52 Answer 2 (Score 97)
02:16 Answer 3 (Score 18)
03:02 Answer 4 (Score 9)
04:26 Thank you

--

Full question
https://stackoverflow.com/questions/1471...

Question links:
[en.wikipedia]: http://en.m.wikipedia.org/wiki/Category:...
[Levenshtein distance]: https://code.google.com/p/pylevenshtein/

Accepted answer links:
[http://web.archive.org/web/2008122423435...]: http://web.archive.org/web/2008122423435...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #string #algorithm #levenshteindistance #editdistance

#avk47

ANSWER 1

Score 99

I realize it's not the same thing, but this is close enough:

>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

You can make this as a function

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False

ACCEPTED ANSWER

Score 26

There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

Hamming distance
Levenshtein distance
Needleman-Wunch distance or Sellers Algorithm
and many more...

ANSWER 3

Score 9

I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) "fast enough" (dynamic programming algo) and "whoooosh" (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.

Threshold: you should treat as "positive" only those cases where distance < (1 - X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.

N.B. Your ape/apple example has distance 2, so X is 0.6 ... I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty

ANSWER 4

Score 6

Is that what you mean?

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

look at http://docs.python.org/library/difflib.html#difflib.get_close_matches