Using Pandas how do I deduplicate a file being read in chunks?

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: City Beneath the Waves Looping

--

Chapters
00:00 Using Pandas How Do I Deduplicate A File Being Read In Chunks?
01:03 Accepted Answer Score 6
02:11 Thank you

--

Full question
https://stackoverflow.com/questions/3065...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #chunking

#avk47

ACCEPTED ANSWER

Score 6

My solution was to bring in just the columns needed to find the duplicates I want to drop and make a bitmask based on that information. Then, by knowing the chunksize and which chunk I'm on I reindex the chunk I'm on so that it matches the correct position it represents on the bitmask. Then I just pass it through the bitmask and the duplicate rows are dropped.

Bring in the entire column to deduplicate on, in this case 'id'. Then create a bitmask of the rows that AREN'T duplicates. DataFrame.duplicated() returns the rows that are duplicates and the ~ inverts that. Now we have our 'dupemask'.

dupemask = ~df.duplicated(subset = ['id'])

Then create an iterator to bring the file in in chunks. Once that is done loop over the iterator and create a new index for each chunk. This new index matches the small chunk dataframe with its position in the 'dupemask' bitmask, which we can then use to only keep the lines that aren't duplicates.

for i, df in enumerate(chunked_data_iterator):
    df.index = range(i*chunksize, i*chunksize + len(df.index))
    df = df[dupemask]

This approach only works in this case because the data is large because its so wide. It still has to read in a column in its entirety in order to work.