Find entries that do not match between columns and iterate through columns

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Ominous Technology Looping

--

Chapters
00:00 Find Entries That Do Not Match Between Columns And Iterate Through Columns
01:15 Answer 1 Score 4
01:34 Answer 2 Score 3
01:59 Accepted Answer Score 6
02:35 Answer 4 Score 3
02:52 Thank you

--

Full question
https://stackoverflow.com/questions/6045...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #numpy

#avk47

ACCEPTED ANSWER

Score 6

With a bit more comprehensive regex:

from itertools import groupby
import re

for k, cols in groupby(sorted(df.columns), lambda x: x[:-2] if re.match(".+_(1|2)$", x) else None):
    cols=list(cols)
    if(len(cols)==2 and k):
        df[f"{k}_check"]=df[cols[0]].eq(df[cols[1]])

It will pair together only columns which name ends up with _1 and _2 regardless what you have before in their names, calculating _check only if there are 2- _1 and _2 (assuming you don't have 2 columns with the same name).

For the sample data:

       A_1      A_2   B_1    B_2  A_check  B_check
0  charlie  charlie  beta  cappa     True    False
1  charlie  charlie  beta  delta     True    False
2  charlie  charlie  beta   beta     True     True

ANSWER 2

Score 4

You can use wide_to_long if you know the first part of the column names, i.e. A,B...:

(pd.wide_to_long(df.reset_index(), ['A','B'], 'index','part',sep='_')
   .groupby('index').nunique().eq(1)
   .add_suffix('_check')
)

Output:

       A_check  B_check
index                  
0         True    False
1         True    False
2         True     True

ANSWER 3

Score 3

Another way is to use dataframe reshaping using pd.MultiIndexes:

df = pd.DataFrame([['charlie', 'charlie', 'beta', 'cappa'], 
                   ['charlie', 'charlie', 'beta', 'delta'], 
                   ['charlie', 'charlie', 'beta', 'beta']], 
                  columns=['A_1', 'A_2','B_1','B_2'])

df.columns = df.columns.str.split('_', expand=True) #Creates MultiIndex column header
dfs = df.stack(0) #move the 'A' and 'B' and any others to rows
df_out = (dfs == dfs.shift(-1, axis=1))['1'].unstack() #Compare column 1 to column 2 and move 'A's and 'B's back to columns.
print(df_out)

Output:

      A      B
0  True  False
1  True  False
2  True   True

ANSWER 4

Score 3

You may split the columns and groupby along axis=1 on the series of first value of the split result and call agg to compare

i_cols = df.columns.str.split('_')
df_check = (df.groupby(i_cols.str[0], axis=1).agg(lambda x: x.iloc[:,0] == x.iloc[:,-1])
              .add_suffix('_check'))

In [69]: df_check
Out[69]:
   A_check  B_check
0     True    False
1     True    False
2     True     True