The Python Oracle

Python ordered list search versus searching sets for list of objects

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Horror Game Menu Looping

--

Chapters
00:00 Python Ordered List Search Versus Searching Sets For List Of Objects
01:30 Accepted Answer Score 3
03:22 Thank you

--

Full question
https://stackoverflow.com/questions/5172...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #list #search #set

#avk47



ACCEPTED ANSWER

Score 3


TL;DR, when using fuzzy comparison techniques, sets and sorting can be very difficult to work with without some normalization method. You can try to be smart about reducing search spaces as much as possible, but care should be taken to do it consistently.

If a class defines __eq__ and not __hash__, it is not hashable.

For instance, consider the following class

class Name:
    def __init__(self, first, last):
        self.first = first
        self.last = last

    def __repr__(self):
        return f'{self.first} {self.last}'

    def __eq__(self, other):
        return (self.first == other.first) and (self.last == other.last)

Now, if you were to try to create a set with these elements

>>> {Name('Neil', 'Stackoverflow-user')}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'Name'

So, in the case of Name, you would simply define a __hash__ method. However, in your case, this is more difficult since you have fuzzy equality semantics. The only way I can think of to get around this is to have a normalization function that you can prove will be consistent, and use the normalized string instead of the actual string as part of your hash. Take Floats as dictionary keys as an example of needing to normalize in order to use a "fuzzy" type like floats as keys.

For sorting and binary searching, since you are fuzzy-searching, you still need to be careful with things like binary searching. As an example, assume you say equality is determined by being within a certain range of Levenshtein distances. Then book and hook will similar to each other (distance = 1), but hack with a distance of 2, will be closer to hook. So how will you define a good sorting algorithm for fuzzy searching in this case?

One thing to try would be to use some form of group-by/bucketing, like a dictionary of the type Dict[int, List[MyObj]], where instances of MyObj are classified by their one constant, the self.integer field. Then you can try comparing smaller sub-lists. This would at least reduce search spaces by clustering.