The Python Oracle

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Mysterious Puzzle

--

Chapters
00:00 Question
00:51 Accepted answer (Score 1370)
01:30 Answer 2 (Score 144)
01:48 Answer 3 (Score 88)
06:13 Answer 4 (Score 18)
06:29 Thank you

--

Full question
https://stackoverflow.com/questions/1996...

Accepted answer links:
[pd.Series.isin]: https://pandas.pydata.org/pandas-docs/st...

Answer 2 links:
[.query()]: http://pandas.pydata.org/pandas-docs/sta...

Answer 3 links:
[Series.isin]: https://pandas.pydata.org/pandas-docs/st...
[DataFrame.isin]: https://pandas.pydata.org/pandas-docs/st...
[numpy.isin]: https://docs.scipy.org/doc/numpy/referen...
[There is a lot of evidence to suggest that list comprehensions will be faster here.]: https://stackoverflow.com/questions/5402...
[this answer]: https://stackoverflow.com/a/45190397/490...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #dataframe #sqlfunction

#avk47



ACCEPTED ANSWER

Score 1524


You can use pd.Series.isin.

For "IN" use: something.isin(somewhere)

Or for "NOT IN": ~something.isin(somewhere)

As a worked example:

>>> df
    country
0        US
1        UK
2   Germany
3     China
>>> countries_to_keep
['UK', 'China']
>>> df.country.isin(countries_to_keep)
0    False
1     True
2    False
3     True
Name: country, dtype: bool
>>> df[df.country.isin(countries_to_keep)]
    country
1        UK
3     China
>>> df[~df.country.isin(countries_to_keep)]
    country
0        US
2   Germany



ANSWER 2

Score 172


Alternative solution that uses .query() method:

In [5]: df.query("country in @countries_to_keep")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("country not in @countries_to_keep")
Out[6]:
  countries
0        US
2   Germany



ANSWER 3

Score 97


How to implement 'in' and 'not in' for a pandas DataFrame?

Pandas offers two methods: Series.isin and DataFrame.isin for Series and DataFrames, respectively.


Filter DataFrame Based on ONE Column (also applies to Series)

The most common scenario is applying an isin condition on a specific column to filter rows in a DataFrame.

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Series.isin accepts various types as inputs. The following are all valid ways of getting what you want:

df['countries'].isin(c1)

0    False
1     True
2    False
3    False
4     True
Name: countries, dtype: bool

# `in` operation
df[df['countries'].isin(c1)]

  countries
1        UK
4     China

# `not in` operation
df[~df['countries'].isin(c1)]

  countries
0        US
2   Germany
3       NaN

# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]

  countries
2   Germany

# Filter with another Series
df[df['countries'].isin(c3)]

  countries
0        US
4     China

# Filter with array
df[df['countries'].isin(c4)]

  countries
0        US
1        UK

Filter on MANY Columns

Sometimes, you will want to apply an 'in' membership check with some search terms over multiple columns,

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2

   A    B  C
0  x    w  0
1  y    a  1
2  z  NaN  2
3  q    x  3

c1 = ['x', 'w', 'p']

To apply the isin condition to both columns "A" and "B", use DataFrame.isin:

df2[['A', 'B']].isin(c1)

      A      B
0   True   True
1  False  False
2  False  False
3  False   True

From this, to retain rows where at least one column is True, we can use any along the first axis:

df2[['A', 'B']].isin(c1).any(axis=1)

0     True
1    False
2    False
3     True
dtype: bool

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

   A  B  C
0  x  w  0
3  q  x  3

Note that if you want to search every column, you'd just omit the column selection step and do

df2.isin(c1).any(axis=1)

Similarly, to retain rows where ALL columns are True, use all in the same manner as before.

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

   A  B  C
0  x  w  0

Notable Mentions: numpy.isin, query, list comprehensions (string data)

In addition to the methods described above, you can also use the numpy equivalent: numpy.isin.

# `in` operation
df[np.isin(df['countries'], c1)]

  countries
1        UK
4     China

# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]

  countries
0        US
2   Germany
3       NaN

Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas' isin.

Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here.. We resort to an in check now.

c1_set = set(c1) # Using `in` with `sets` is a constant time operation... 
                 # This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]

  countries
1        UK
4     China

# `not in` operation
df[[x not in c1_set for x in df['countries']]]

  countries
0        US
2   Germany
3       NaN

It is a lot more unwieldy to specify, however, so don't use it unless you know what you're doing.

Lastly, there's also DataFrame.query which has been covered in this answer. numexpr FTW!




ANSWER 4

Score 20


I've been usually doing generic filtering over rows like this:

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]