String similarity metrics in Python
String similarity metrics in Python
--
Become part of the top 3% of the developers by applying to Toptal
https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Music Box Puzzles
--
Chapters
00:00 Question
01:10 Accepted answer (Score 26)
01:52 Answer 2 (Score 97)
02:16 Answer 3 (Score 18)
03:02 Answer 4 (Score 9)
04:26 Thank you
--
Full question
https://stackoverflow.com/questions/1471...
Question links:
[en.wikipedia]: http://en.m.wikipedia.org/wiki/Category:...
[Levenshtein distance]: https://code.google.com/p/pylevenshtein/
Accepted answer links:
[http://web.archive.org/web/2008122423435...]: http://web.archive.org/web/2008122423435...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #string #algorithm #levenshteindistance #editdistance
#avk47
ANSWER 1
Score 99
I realize it's not the same thing, but this is close enough:
>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095
You can make this as a function
def similar(seq1, seq2):
return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9
>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False
ACCEPTED ANSWER
Score 26
There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.
http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Here's a bit of the list:
- Hamming distance
- Levenshtein distance
- Needleman-Wunch distance or Sellers Algorithm
- and many more...
ANSWER 3
Score 9
I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) "fast enough" (dynamic programming algo) and "whoooosh" (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.
Threshold: you should treat as "positive" only those cases where distance < (1 - X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.
N.B. Your ape/apple example has distance 2, so X is 0.6 ... I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty
ANSWER 4
Score 6
Is that what you mean?
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
look at http://docs.python.org/library/difflib.html#difflib.get_close_matches