How to find the intersection of a pair of columns in multiple pandas dataframes with pairs in any order?
--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Track title: CC G Dvoks String Quartet No 12 Ame 2
--
Chapters
00:00 How To Find The Intersection Of A Pair Of Columns In Multiple Pandas Dataframes With Pairs In Any Or
01:23 Accepted Answer Score 5
02:19 Answer 2 Score 1
03:24 Thank you
--
Full question
https://stackoverflow.com/questions/5352...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #python3x #pandas #dataframe
#avk47
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Track title: CC G Dvoks String Quartet No 12 Ame 2
--
Chapters
00:00 How To Find The Intersection Of A Pair Of Columns In Multiple Pandas Dataframes With Pairs In Any Or
01:23 Accepted Answer Score 5
02:19 Answer 2 Score 1
03:24 Thank you
--
Full question
https://stackoverflow.com/questions/5352...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #python3x #pandas #dataframe
#avk47
ACCEPTED ANSWER
Score 5
You can create list of DataFrames and in list comprehension sorting per rows with removing duplicates:
dfs = [df1,df2,df3]
L = [pd.DataFrame(np.sort(x.values, axis=1), columns=x.columns).drop_duplicates()
for x in dfs]
print (L)
[ col1 col2
0 A B
1 C D
3 E F, col1 col2
0 A B
1 C D
2 M N
3 E F, col1 col2
0 A B
1 C D
2 M N
3 E F]
And then merge list of DataFrames by all columns (no parameter on):
from functools import reduce
df = reduce(lambda left,right: pd.merge(left,right), L)
print (df)
col1 col2
0 A B
1 C D
2 E F
Another solution for @pygo:
Create index by frozensets and join together by concat with inner join, last remove duplicates by index by duplicated with boolean indexing and iloc for get first 2 columns:
df = pd.concat([x.set_index(x.apply(frozenset, axis=1)) for x in dfs], axis=1, join='inner')
df = df.iloc[~df.index.duplicated(), :2]
print (df)
col1 col2
(B, A) A B
(C, D) C D
(F, E) E F
ANSWER 2
Score 1
Somewhat similar to some of the earlier answers.
import pandas as pd
from io import StringIO
# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1 A B
id2 C D
id3 B A
id4 E F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1 B A
id2 D C
id3 M N
id4 F E
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1 A B
id2 D C
id3 N M
id4 E F
"""), delim_whitespace = True)
# List of n dataframes
dfs = [df1, df2, df3]
# Use frozenset to define the column values without regard for order
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in alist of Series named 'combined'
#[0 (B, A)
# 1 (D, C)
# 2 (B, A)
# 3 (F, E)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (N, M)
# 3 (E, F)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (M, N)
# 3 (F, E)
# Name: combined, dtype: object]
dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[ id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 C D (D, C)
# 2 id3 B A (B, A)
# 3 id4 E F (F, E),
# id col1 col2 combined
# 0 id1 B A (B, A)
# 1 id2 D C (D, C)
# 2 id3 M N (N, M)
# 3 id4 F E (E, F),
# id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 D C (D, C)
# 2 id3 N M (M, N)
# 3 id4 E F (F, E)]
# The reduce function operates on pairs, with previous result as the first argument
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
# id col1 col2 combined
#0 id1 A B (B, A)
#1 id2 C D (D, C)
#3 id4 E F (F, E)