The Python Oracle

Remove duplicates by columns A, keeping the row with the highest value in column B

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Life in a Drop

--

Chapters
00:00 Remove Duplicates By Columns A, Keeping The Row With The Highest Value In Column B
00:37 Accepted Answer Score 381
01:02 Answer 2 Score 162
01:36 Answer 3 Score 58
01:58 Answer 4 Score 34
02:19 Answer 5 Score 12
02:29 Thank you

--

Full question
https://stackoverflow.com/questions/1249...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #duplicates #pandas

#avk47



ACCEPTED ANSWER

Score 383


This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10



ANSWER 2

Score 165


The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()




ANSWER 3

Score 60


Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')



ANSWER 4

Score 12


Try this:

df.groupby(['A']).max()