How to find the intersection of a pair of columns in multiple pandas dataframes with pairs in any order?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Fantascape Looping

--

Chapters
00:00 How To Find The Intersection Of A Pair Of Columns In Multiple Pandas Dataframes With Pairs In Any Or
01:18 Accepted Answer Score 5
02:03 Answer 2 Score 1
02:56 Thank you

--

Full question
https://stackoverflow.com/questions/5352...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #python3x #pandas #dataframe

#avk47

ACCEPTED ANSWER

Score 5

You can create list of DataFrames and in list comprehension sorting per rows with removing duplicates:

dfs = [df1,df2,df3]

L = [pd.DataFrame(np.sort(x.values, axis=1), columns=x.columns).drop_duplicates() 
     for x in dfs]
print (L)
[  col1 col2
0    A    B
1    C    D
3    E    F,   col1 col2
0    A    B
1    C    D
2    M    N
3    E    F,   col1 col2
0    A    B
1    C    D
2    M    N
3    E    F]

And then merge list of DataFrames by all columns (no parameter on):

from functools import reduce
df = reduce(lambda left,right: pd.merge(left,right), L)
print (df)
  col1 col2
0    A    B
1    C    D
2    E    F

Another solution for @pygo:

Create index by frozensets and join together by concat with inner join, last remove duplicates by index by duplicated with boolean indexing and iloc for get first 2 columns:

df = pd.concat([x.set_index(x.apply(frozenset, axis=1)) for x in dfs], axis=1, join='inner')
df = df.iloc[~df.index.duplicated(), :2]
print (df)
       col1 col2
(B, A)    A    B
(C, D)    C    D
(F, E)    E    F

ANSWER 2

Score 1

Somewhat similar to some of the earlier answers.

import pandas as pd
from io import StringIO 

# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1  A     B
id2  C     D
id3  B     A
id4  E     F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1  B     A  
id2  D     C  
id3  M     N  
id4  F     E  
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1  A     B  
id2  D     C  
id3  N     M  
id4  E     F 
"""), delim_whitespace = True)

# List of n dataframes
dfs = [df1, df2, df3]

# Use frozenset to define the column values without regard for order 
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in  alist of Series named 'combined'
#[0    (B, A)
# 1    (D, C)
# 2    (B, A)
# 3    (F, E)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (N, M)
# 3    (E, F)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (M, N)
# 3    (F, E)
# Name: combined, dtype: object]

dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[    id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    C    D   (D, C)
# 2  id3    B    A   (B, A)
# 3  id4    E    F   (F, E),     
#     id col1 col2 combined
# 0  id1    B    A   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    M    N   (N, M)
# 3  id4    F    E   (E, F),
#     id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    N    M   (M, N)
# 3  id4    E    F   (F, E)]

# The reduce function operates on pairs, with previous result as the first argument 
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
#    id col1 col2 combined
#0  id1    A    B   (B, A)
#1  id2    C    D   (D, C)
#3  id4    E    F   (F, E)