Filter rows of a numpy array?

This video explains
Filter rows of a numpy array?

--

Become part of the top 3% of the developers by applying to Toptal
https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: City Beneath the Waves Looping

--

Chapters
00:00 Question
00:52 Accepted answer (Score 41)
02:02 Thank you

--

Full question
https://stackoverflow.com/questions/2615...

Accepted answer links:
[boolean indexing]: http://docs.scipy.org/doc/numpy/user/bas...
[np.apply_along_axis]: http://docs.scipy.org/doc/numpy/referenc...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #numpy #filter

#avk47

ACCEPTED ANSWER

Score 41

Ideally, you would be able to implement a vectorized version of your function and use that to do boolean indexing. For the vast majority of problems this is the right solution. Numpy provides quite a few functions that can act over various axes as well as all the basic operations and comparisons, so most useful conditions should be vectorizable.

import numpy as np

x = np.random.randn(20, 3)
x_new = x[np.sum(x, axis=1) > .5]

If you are absolutely sure that you can't do the above, I would suggest using a list comprehension (or np.apply_along_axis) to create an array of bools to index with.

def myfunc(row):
    return sum(row) > .5

bool_arr = np.array([myfunc(row) for row in x])
x_new = x[bool_arr]

This will get the job done in a relatively clean way, but will be significantly slower than a vectorized version. An example:

x = np.random.randn(5000, 200)

%timeit x[np.sum(x, axis=1) > .5]
# 100 loops, best of 3: 5.71 ms per loop

%timeit x[np.array([myfunc(row) for row in x])]
# 1 loops, best of 3: 217 ms per loop

ANSWER 2

Score 0

As @Roger Fan mentioned, applying a function row-wise should really be done in a vectorized fashion on the entire array. The canonical way to filter is to construct a boolean mask and apply it on the array. That said, if it happens that the function is so complex that vectorization is not possible, it's better/faster to convert the array into a Python list (especially if it uses Python functions such as sum()) and apply the function on it.

msk = arr.sum(axis=1)>10                # best way to create a boolean mask

msk = [f(row) for row in arr.tolist()]  # second best way
#                            ^^^^^^^^   <---- convert to list

filtered_arr = arr[msk]                 # filtered via boolean indexing

A working example and a performance test

As you can see from the timeit test below, looping over a list (arr.tolist()) is much faster than looping over a numpy array (arr), partly because Python's sum() and not np.sum() is called in the function f(). That said, the vectorized method is much faster than both.

def f(row):
    if sum(row)>10: return True
    else: return False
    
arr = np.random.rand(10000, 200)

%timeit arr[[f(row) for row in arr]]
# 260 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr[[f(row) for row in arr.tolist()]]
# 114 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr[arr.sum(axis=1)>10]
# 10.8 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)