Merge two DataFrames with some equal columns

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: The Builders

--

Chapters
00:00 Merge Two Dataframes With Some Equal Columns
02:08 Answer 1 Score 3
03:57 Accepted Answer Score 3
04:16 Thank you

--

Full question
https://stackoverflow.com/questions/2474...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #csv #pandas

#avk47

ANSWER 1

Score 3

Edited to update add rows,columns and update data, Efficiently merging on the Indexes

Code to update your df1 with df2 data...

    df1 = """id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4"""

df2 ="""id,noteId,text,other
id3,idNote10,My new text 3,On1
id2,idNote19,My new text 2,Pre8
id5,NaN,My new text 2,Hl0
id22,idNote22,My new text 22,M1"""


df1 = pd.read_csv(StringIO.StringIO(df1),sep=",",index_col='id')#this is how you should
df2 = pd.read_csv(StringIO.StringIO(df2),sep=",",index_col='id')#set your index in read_csv not after

SOLUTION

df = pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
#joined on indexes for speed

OUTPUT

>>print df

        noteId                   text other
id                                         
id1   idNote12  This is my old text 1   NaN
id2   idNote19          My new text 2  Pre8
id22  idNote22         My new text 22    M1
id3   idNote10             new text 3   On1
id4   idNote11  This is my old text 4   NaN
id5        NaN          My new text 2   Hl0

Reason it works...

pd.merge has a couple of multipurpose params. The on key actually is actually only used to join the two dataframes when the left_index and right_index keys are set to False - the default value. Otherwise it will just join the identically named columns that are found from the on value. In this case the two columns 'text' and 'noteId'. (I made it a more general by using df1.columns.tolist() as the param - this means any identically named columns in df2 will overwrite the data from df1 instead of marking it text_y)

Using the more general on key (df1.values.tolist()) you can actually loop through a bunch of csvs updating the data from the dataframe as you go

3X faster than accepted solution

In [25]: %timeit       pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
1000 loops, best of 3: 1.11 ms per loop

accepted solution

In [30]: %timeit pd.concat([df1, df2]).groupby('noteId').last().fillna(value='None')
100 loops, best of 3: 3.29 ms per loop

ACCEPTED ANSWER

Score 3

Usually you can solve this with the proper index:

df1.set_index(['id', 'noteId'], inplace=True)
df1.update(df2)

(And if you don't want that index after, just df1.reset_index(inplace=True))

ANSWER 1

Score 3

**SOLUTION**

**3X faster than accepted solution**

ACCEPTED ANSWER

Score 3

SOLUTION

3X faster than accepted solution