The Python Oracle

count occurences in each dataframe row then create column with most frequent

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Thinking It Over

--

Chapters
00:00 Count Occurences In Each Dataframe Row Then Create Column With Most Frequent
00:57 Answer 1 Score 2
02:06 Accepted Answer Score 5
02:26 Thank you

--

Full question
https://stackoverflow.com/questions/4781...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #vectorization

#avk47



ACCEPTED ANSWER

Score 5


We can use mode...

from scipy import stats


value,count=stats.mode(df.values,axis=1)
value
Out[180]: 
array([[1],
       [3],
       [3],
       [4],
       [5]], dtype=int64)


count
Out[181]: 
array([[2],
       [3],
       [2],
       [2],
       [2]])

After assign it back

df['new']=value
df
Out[183]: 
   a  b  c  new
0  1  1  2    1
1  3  3  3    3
2  1  3  3    3
3  4  5  4    4
3  4  5  5    5



ANSWER 2

Score 2


Here's a fast approach I learnt from @coldspeed i.e

dummies = pd.get_dummies(df.astype(str)).groupby(by=lambda x: x.split('_')[1], axis=1).sum()

df['new'] = dummies.idxmax(1)

   a  b  c new
0  1  1  2   1
1  3  3  3   3
2  1  3  3   3
3  4  5  4   4
3  4  5  5   5

Explanation :

We can get the one hot encoding of the items present in each column using pd.get_dummies, since get_dummies wont take numbers we have to convert them to strings.

pd.get_dummies(df.astype(str))

   a_1  a_3  a_4  b_1  b_3  b_5  c_2  c_3  c_4  c_5
0    1    0    0    1    0    0    1    0    0    0
1    0    1    0    0    1    0    0    1    0    0
2    1    0    0    0    1    0    0    1    0    0
3    0    0    1    0    0    1    0    0    1    0
3    0    0    1    0    0    1    0    0    0    1

Now if you group only the numbers in the column and sum them we can get the value counts for each row. i.e

   1  2  3  4  5
0  2  1  0  0  0
1  0  0  3  0  0
2  1  0  2  0  0
3  0  0  0  2  1
3  0  0  0  1  2

Using idxmax(axis=1) on the one hot encoding will get you the column names which is the required maximum repeated number in the row.

0    1
1    3
2    3
3    4
3    5
dtype: object

Edit:

If you have strings in your dataframe, then go for get_dummies that would be faster than anything, if you have numbers then you have to go for scipy mode or pandas mode