how do you filter pandas dataframes by multiple columns
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Switch On Looping
--
Chapters
00:00 How Do You Filter Pandas Dataframes By Multiple Columns
00:41 Accepted Answer Score 298
01:06 Answer 2 Score 29
01:32 Answer 3 Score 41
01:45 Answer 4 Score 38
02:29 Thank you
--
Full question
https://stackoverflow.com/questions/2208...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #filter #pandas
#avk47
ACCEPTED ANSWER
Score 298
Using & operator, don't forget to wrap the sub-statements with ():
males = df[(df[Gender]=='Male') & (df[Year]==2014)]
To store your DataFrames in a dict using a for loop:
from collections import defaultdict
dic={}
for g in ['male', 'female']:
    dic[g]=defaultdict(dict)
    for y in [2013, 2014]:
        dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict
A demo for your getDF:
def getDF(dic, gender, year):
    return dic[gender][year]
print genDF(dic, 'male', 2014)
ANSWER 2
Score 41
Start from pandas 0.13, this is the most efficient way.
df.query('Gender=="Male" & Year=="2014" ')
ANSWER 3
Score 38
In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):
import pandas as pd
import numpy as np
length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)
%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]
Results for 100,000 rows:
6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results for 10,000,000 rows:
326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So results depend on the size and the data. On my laptop, query() gets faster after 500k rows. Further, the string search in Year=="2014" has an unnecessary overhead (Year==2014 is faster).
ANSWER 4
Score 29
For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:
df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]
where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).