count occurences in each dataframe row then create column with most frequent
--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Thinking It Over
--
Chapters
00:00 Count Occurences In Each Dataframe Row Then Create Column With Most Frequent
00:57 Answer 1 Score 2
02:06 Accepted Answer Score 5
02:26 Thank you
--
Full question
https://stackoverflow.com/questions/4781...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #vectorization
#avk47
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Thinking It Over
--
Chapters
00:00 Count Occurences In Each Dataframe Row Then Create Column With Most Frequent
00:57 Answer 1 Score 2
02:06 Accepted Answer Score 5
02:26 Thank you
--
Full question
https://stackoverflow.com/questions/4781...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #vectorization
#avk47
ACCEPTED ANSWER
Score 5
We can use mode...
from scipy import stats
value,count=stats.mode(df.values,axis=1)
value
Out[180]:
array([[1],
[3],
[3],
[4],
[5]], dtype=int64)
count
Out[181]:
array([[2],
[3],
[2],
[2],
[2]])
After assign it back
df['new']=value
df
Out[183]:
a b c new
0 1 1 2 1
1 3 3 3 3
2 1 3 3 3
3 4 5 4 4
3 4 5 5 5
ANSWER 2
Score 2
Here's a fast approach I learnt from @coldspeed i.e
dummies = pd.get_dummies(df.astype(str)).groupby(by=lambda x: x.split('_')[1], axis=1).sum()
df['new'] = dummies.idxmax(1)
a b c new
0 1 1 2 1
1 3 3 3 3
2 1 3 3 3
3 4 5 4 4
3 4 5 5 5
Explanation :
We can get the one hot encoding of the items present in each column using pd.get_dummies, since get_dummies wont take numbers we have to convert them to strings.
pd.get_dummies(df.astype(str))
a_1 a_3 a_4 b_1 b_3 b_5 c_2 c_3 c_4 c_5
0 1 0 0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0 1 0 0
2 1 0 0 0 1 0 0 1 0 0
3 0 0 1 0 0 1 0 0 1 0
3 0 0 1 0 0 1 0 0 0 1
Now if you group only the numbers in the column and sum them we can get the value counts for each row. i.e
1 2 3 4 5
0 2 1 0 0 0
1 0 0 3 0 0
2 1 0 2 0 0
3 0 0 0 2 1
3 0 0 0 1 2
Using idxmax(axis=1) on the one hot encoding will get you the column names which is the required maximum repeated number in the row.
0 1
1 3
2 3
3 4
3 5
dtype: object
Edit:
If you have strings in your dataframe, then go for get_dummies that would be faster than anything, if you have numbers then you have to go for scipy mode or pandas mode