The Python Oracle

Finding Intersection Between a Matrix and a Vector, by Row

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Secret Catacombs

--

Chapters
00:00 Finding Intersection Between A Matrix And A Vector, By Row
01:06 Accepted Answer Score 2
01:53 Answer 2 Score 2
02:27 Answer 3 Score 1
02:52 Answer 4 Score 1
03:39 Thank you

--

Full question
https://stackoverflow.com/questions/5517...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #numpy

#avk47



ACCEPTED ANSWER

Score 2


For matr and vec as arrays, here's one with np.searchsorted -

def count_in_rowwise(matr,vec):
    sidx = vec.argsort()
    idx = np.searchsorted(vec,matr,sorter=sidx)
    idx[idx==len(vec)] = 0
    return (vec[sidx[idx]] == matr).sum(1)

With a comparatively smaller vec, we can pre-sort it and use, to give us an alternative one to compute the row-counts, like so -

def count_in_rowwise_v2(matr,vec,assume_sorted=False):
    if assume_sorted==1:
        sorted_vec = vec
    else:
        sorted_vec = np.sort(vec)
    idx = np.searchsorted(sorted_vec,matr)
    idx[idx==len(sorted_vec)] = 0
    return (sorted_vec[idx] == matr).sum(1)

The above solution(s) works on generic inputs(numbers or strings alike). To solve our specific case of strings, we could optimize it further by converting the strings to numbers by using np.unique and then re-using count_in_rowwise/count_in_rowwise_v2 and that will give us our second approach, like so -

u,ids = np.unique(matr, return_inverse=True)
out = count_in_rowwise(ids.reshape(matr.shape),ids[np.searchsorted(u,vec)])



ANSWER 2

Score 2


You could use set intersection to speed things up a bit. Here's a comparison:

Your present solution with list comprehensions:

%%timeit
print([sum([y in vec for y in row]) for row in matr])
#Output
[3,2,1]
20 µs ± 1.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Proposed solution with set intersection in list comprehension:

%%timeit
print([len(set(row).intersection(vec)) for row in matr])
#Output:
[3,2,1]
17.8 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And if the vec is also a set, we get even better efficiency:

%%timeit
vec = {'a', 'c', 'f', 'b'}
print([len(set(row).intersection(vec)) for row in matr])
#Output:
[3, 2, 1]
16.6 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



ANSWER 3

Score 1


Here's a simple readable solution with np.isin() (docs):

np.sum(np.isin(matr, vec), axis=1)

As a bonus, you can just use np.isin() without the summing if you want to get which elements of the matrix are in the vectors:

>>> np.isin(matr, vec)
array([[ True,  True,  True, False, False],
       [ True, False, False,  True, False],
       [ True, False, False, False, False]])

which shows why summing along the rows produces the desired output.




ANSWER 4

Score 1


Let's look at the speed of your current algorithm. According to the python wiki, checking if an item is in an array like y in vec is O(n), meaning worst case, it has to go through every element in vec. Since you're doing that check for every element of your matrix, your total number of operations is numRows * numCols * vecLen, which is O(n^3).

A faster way to do it is to construct a dictionary for vec to optimize lookups because dictionaries are O(1) instead of O(n), meaning they can do your check in 1 operation, no matter how long vec is:

vecDict = dict([(x, 1) for x in vec])

So, your new time complexity is (numRows * numCols) + vecLen, which is O(n^2), which I think is as fast as you can get for your data.

[sum([y in vecDict for y in row]) for row in matr]