Computing Jaccard Similarity in Python
Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Cosmic Puzzle
--
Chapters
00:00 Question
01:24 Accepted answer (Score 6)
03:27 Answer 2 (Score 3)
03:46 Thank you
--
Full question
https://stackoverflow.com/questions/4057...
Accepted answer links:
[np.einsum]: https://docs.scipy.org/doc/numpy/referen...
[related Q&A]: https://stackoverflow.com/questions/9939...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #performance #numpy #vectorization #datamining
#avk47
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Cosmic Puzzle
--
Chapters
00:00 Question
01:24 Accepted answer (Score 6)
03:27 Answer 2 (Score 3)
03:46 Thank you
--
Full question
https://stackoverflow.com/questions/4057...
Accepted answer links:
[np.einsum]: https://docs.scipy.org/doc/numpy/referen...
[related Q&A]: https://stackoverflow.com/questions/9939...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #performance #numpy #vectorization #datamining
#avk47
ACCEPTED ANSWER
Score 6
Here's a vectorized approach -
# Get the row, col indices that are to be set in output array
r,c = np.tril_indices(ndocs,-1)
# Use those indicees to slice out respective columns
p1 = rawdata[:,c]
p2 = rawdata[:,r]
# Perform n11 and n00 vectorized computations across all indexed columns
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)
# Finally, setup output array and set final division computations
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)
Alternative way to compute n11v and n00v with np.einsum -
n11v = np.einsum('ij,ij->j',(p1==1),(p2==1).astype(int))
n00v = np.einsum('ij,ij->j',(p1==0),(p2==0).astype(int))
If rawdata consists of 0s and 1s only, a simpler way to get them would be -
n11v = np.einsum('ij,ij->j',p1,p2)
n00v = np.einsum('ij,ij->j',1-p1,1-p2)
Benchmarking
Function definitions -
def original_app(rawdata, ndocs, nfeats):
tru_sim = np.zeros((ndocs,ndocs))
for i in range(0,ndocs):
tru_sim[i,i]=1
for j in range(i+1,ndocs):
tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
return tru_sim
def vectorized_app(rawdata, ndocs, nfeats):
r,c = np.tril_indices(ndocs,-1)
p1 = rawdata[:,c]
p2 = rawdata[:,r]
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)
return out
Verification and timings -
In [6]: # Setup inputs
...: rawdata = (np.random.rand(20,10000)>0.2).astype(int)
...: rawdata = np.transpose(rawdata)
...: ndocs = rawdata.shape[1]
...: nwords = rawdata.shape[0]
...: nfeats = 5
...:
In [7]: # Verify results
...: out1 = original_app(rawdata, ndocs, nfeats)
...: out2 = vectorized_app(rawdata, ndocs, nfeats)
...: print np.allclose(out1,out2)
...:
True
In [8]: %timeit original_app(rawdata, ndocs, nfeats)
1 loops, best of 3: 8.72 s per loop
In [9]: %timeit vectorized_app(rawdata, ndocs, nfeats)
10 loops, best of 3: 27.6 ms per loop
Some magical 300x+ speedup there!
So, why is it so fast? Well, there are lots of factors involved, most important one being the fact NumPy arrays are built for performance and optimized for vectorized computations. With the proposed approach we are harnessing it quite well and thus seeing such speedups.
Here's one related Q&A that talk about these performance criteria in detail.
ANSWER 2
Score 3
To compute Jaccard, use:
def jaccard(x,y):
x = np.asarray(x, np.bool) # Not necessary, if you keep your data
y = np.asarray(y, np.bool) # in a boolean array already!
return np.double(np.bitwise_and(x, y).sum()) / np.double(np.bitwise_or(x, y).sum())
print jaccard([1,1,0,0,0],[0,1,0,0,1])
>>> 0.33333333333333331