The Python Oracle

doing better than numpy's in1d mask function: ordered arrays?

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Realization

--

Chapters
00:00 Doing Better Than Numpy'S In1d Mask Function: Ordered Arrays?
01:42 Answer 1 Score 1
02:30 Accepted Answer Score 3
03:00 Thank you

--

Full question
https://stackoverflow.com/questions/1527...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #performance #sorting #numpy #mask

#avk47



ACCEPTED ANSWER

Score 3


I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].

Create some test data first:

import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))

subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)

Then convert you data in to a DataFrame:

import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs) 
df.ix[subsampleIDs]

DataFrame use a hashtable to map the index to it's location, it's very fast.




ANSWER 2

Score 1


Often this kind of indexing is best performed using a DB (with proper column-indexing).

Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.

EDIT:

Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).

Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python