Merge two DataFrames with some equal columns
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: The Builders
--
Chapters
00:00 Merge Two Dataframes With Some Equal Columns
02:08 Answer 1 Score 3
03:57 Accepted Answer Score 3
04:16 Thank you
--
Full question
https://stackoverflow.com/questions/2474...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #csv #pandas
#avk47
ANSWER 1
Score 3
Edited to update add rows,columns and update data, Efficiently merging on the Indexes
Code to update your df1 with df2 data...
df1 = """id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4"""
df2 ="""id,noteId,text,other
id3,idNote10,My new text 3,On1
id2,idNote19,My new text 2,Pre8
id5,NaN,My new text 2,Hl0
id22,idNote22,My new text 22,M1"""
df1 = pd.read_csv(StringIO.StringIO(df1),sep=",",index_col='id')#this is how you should
df2 = pd.read_csv(StringIO.StringIO(df2),sep=",",index_col='id')#set your index in read_csv not after
**SOLUTION**
df = pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
#joined on indexes for speed
OUTPUT
>>print df
noteId text other
id
id1 idNote12 This is my old text 1 NaN
id2 idNote19 My new text 2 Pre8
id22 idNote22 My new text 22 M1
id3 idNote10 new text 3 On1
id4 idNote11 This is my old text 4 NaN
id5 NaN My new text 2 Hl0
Reason it works...
pd.merge has a couple of multipurpose params. The on key actually is actually only used to join the two dataframes when the left_index and right_index keys are set to False - the default value. Otherwise it will just join the identically named columns that are found from the on value. In this case the two columns 'text' and 'noteId'. (I made it a more general by using df1.columns.tolist() as the param - this means any identically named columns in df2 will overwrite the data from df1 instead of marking it text_y)
Using the more general on key (df1.values.tolist()) you can actually loop through a bunch of csvs updating the data from the dataframe as you go
**3X faster than accepted solution**
In [25]: %timeit pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
1000 loops, best of 3: 1.11 ms per loop
accepted solution
In [30]: %timeit pd.concat([df1, df2]).groupby('noteId').last().fillna(value='None')
100 loops, best of 3: 3.29 ms per loop
ACCEPTED ANSWER
Score 3
Usually you can solve this with the proper index:
df1.set_index(['id', 'noteId'], inplace=True)
df1.update(df2)
(And if you don't want that index after, just df1.reset_index(inplace=True))