Merge two DataFrames with some equal columns
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Thinking It Over
--
Chapters
00:00 Merge Two Dataframes With Some Equal Columns
02:06 Answer 1 Score 3
03:55 Accepted Answer Score 3
04:09 Thank you
--
Full question
https://stackoverflow.com/questions/2474...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #csv #pandas
#avk47
ANSWER 1
Score 3
Edited to update add rows,columns and update data, Efficiently merging on the Indexes
Code to update your df1 with df2 data...
    df1 = """id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4"""
df2 ="""id,noteId,text,other
id3,idNote10,My new text 3,On1
id2,idNote19,My new text 2,Pre8
id5,NaN,My new text 2,Hl0
id22,idNote22,My new text 22,M1"""
df1 = pd.read_csv(StringIO.StringIO(df1),sep=",",index_col='id')#this is how you should
df2 = pd.read_csv(StringIO.StringIO(df2),sep=",",index_col='id')#set your index in read_csv not after
**SOLUTION**
df = pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
#joined on indexes for speed
OUTPUT
>>print df
        noteId                   text other
id                                         
id1   idNote12  This is my old text 1   NaN
id2   idNote19          My new text 2  Pre8
id22  idNote22         My new text 22    M1
id3   idNote10             new text 3   On1
id4   idNote11  This is my old text 4   NaN
id5        NaN          My new text 2   Hl0
Reason it works...
pd.merge has a couple of multipurpose params. The on key actually is actually only used to join the two dataframes when the left_index and right_index keys are set to False - the default value. Otherwise it will just join the identically named columns that are found from the on value. In this case the two columns 'text' and 'noteId'. (I made it a more general by using df1.columns.tolist() as the param - this means any identically named columns in df2 will overwrite the data from df1 instead of marking it text_y)
Using the more general on key (df1.values.tolist()) you can actually loop through a bunch of csvs updating the data from the dataframe as you go
**3X faster than accepted solution**
In [25]: %timeit       pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
1000 loops, best of 3: 1.11 ms per loop
accepted solution
In [30]: %timeit pd.concat([df1, df2]).groupby('noteId').last().fillna(value='None')
100 loops, best of 3: 3.29 ms per loop
ACCEPTED ANSWER
Score 3
Usually you can solve this with the proper index:
df1.set_index(['id', 'noteId'], inplace=True)
df1.update(df2)
(And if you don't want that index after, just df1.reset_index(inplace=True))