How do I get a list of all the duplicate items using pandas in python?
Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzling Curiosities
--
Chapters
00:00 Question
01:47 Accepted answer (Score 279)
02:36 Answer 2 (Score 247)
03:09 Answer 3 (Score 232)
03:43 Answer 4 (Score 26)
04:12 Thank you
--
Full question
https://stackoverflow.com/questions/1465...
Question links:
[duplicated method]: http://pandas.pydata.org/pandas-docs/dev...
Answer 1 links:
[duplicated]: http://pandas.pydata.org/pandas-docs/sta...
Answer 2 links:
[documentation]: http://pandas.pydata.org/pandas-docs/sta...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #duplicates
#avk47
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzling Curiosities
--
Chapters
00:00 Question
01:47 Accepted answer (Score 279)
02:36 Answer 2 (Score 247)
03:09 Answer 3 (Score 232)
03:43 Answer 4 (Score 26)
04:12 Thank you
--
Full question
https://stackoverflow.com/questions/1465...
Question links:
[duplicated method]: http://pandas.pydata.org/pandas-docs/dev...
Answer 1 links:
[duplicated]: http://pandas.pydata.org/pandas-docs/sta...
Answer 2 links:
[documentation]: http://pandas.pydata.org/pandas-docs/sta...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #duplicates
#avk47
ACCEPTED ANSWER
Score 303
Method #1: print all rows where the ID is one of the IDs in duplicated:
>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.
>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
ANSWER 2
Score 286
With Pandas version 0.17, you can set 'keep = False' in the duplicated function to get all the duplicate items.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(['a','b','c','d','a','b'])
In [3]: df
Out[3]:
0
0 a
1 b
2 c
3 d
4 a
5 b
In [4]: df[df.duplicated(keep=False)]
Out[4]:
0
0 a
1 b
4 a
5 b
ANSWER 3
Score 270
df[df.duplicated(['ID'], keep=False)]
it'll return all duplicated rows back to you.
According to documentation:
keep: {‘first’, ‘last’, False}, default ‘first’
- 'first' : Mark duplicates as True except for the first occurrence.
- 'last' : Mark duplicates as True except for the last occurrence.
- False : Mark all duplicates as True.
ANSWER 4
Score 19
df[df['ID'].duplicated() == True]
This worked for me