The Python Oracle

Condensing columns in a Pandas Dataframe

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 3

--

Chapters
00:00 Condensing Columns In A Pandas Dataframe
00:55 Accepted Answer Score 4
02:18 Answer 2 Score 0
02:47 Thank you

--

Full question
https://stackoverflow.com/questions/6808...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #list #dataframe #dictionary

#avk47



ACCEPTED ANSWER

Score 4


One Way:

  1. Set ['event_id', 'target'] as index.
  2. split and expand columns to create the hierarchical columns (We'll need this in the dot product).
  3. extract the level 1 values from the hierarchical columns and do the dot product with the dataframe.
  4. Do some string manipulations and rename the columns (I've added 2 ways to rename the columns) to get the desired output.
df1 = df.set_index(['event_id', 'target'])
df1.columns = df1.columns.str.split('.', expand=True, n=1)
df = (
    df1.dot(df1.columns.get_level_values(1) + ',')
    .str.strip(',')
    .str.split(',', expand=True)
    .rename(columns={0: 'Action', 1: 'Actor'})
    .reset_index()
)

OUTPUT:

   event_id           target   Action                Actor
0         1        Microsoft  malware    internal.employee
1         2  Bank of America  hacking  external.competitor
NOTE:

Instead of renaming columns via rename you can also use level=0 column values like this -

df1 = df.set_index(['event_id', 'target'])
df1.columns = df1.columns.str.split('.', expand=True, n=1)
df = (
    df1.dot(df1.columns.get_level_values(1) + ',')
    .str.strip(',')
    .str.split(',',expand=True)
)
df.columns = dict.fromkeys(df1.columns.get_level_values(0)).keys()
df = df.reset_index()

One more alternative is to use mul:

df1 = df.set_index(['event_id', 'target'])
df1.columns = df1.columns.str.split('.', expand=True, n=1)
df = (
    df1.mul(df1.columns.get_level_values(1))
    .replace('', np.NAN)
    .droplevel(1, axis =1)
    .stack()
    .unstack()
)

NOTE:

In case you've a multiple values for actor and agent for 1 target you can use pivot_table instead of stack/unstack:

df1 = df.set_index(['event_id', 'target'])
df1.columns = df1.columns.str.split('.', expand=True, n=1)
(
    df1.mul(df1.columns.get_level_values(1))
    .replace('', np.NAN)
    .droplevel(1, axis =1)
    .stack()
    .reset_index()
    .pivot_table(index = ['event_id', 'target'], columns = 'level_2' , values = 0, aggfunc = ', '.join)
)



ANSWER 2

Score 0


You can flip the code from wide to long, filter for only rows where True is, make some modifications to get your output.

 # flip from wide to long
(pd.wide_to_long(df,
                 stubnames=['action','actor'], 
                 i=['event_id','target'],
                 j='subs', 
                 sep='.', 
                 suffix='.+')
  .loc[lambda df: df.any(1)] # keeps only rows that have `True`
   # if any row has `True`, 
   # replace it with values from `subs` index
  .where(lambda df: df.isna(), 
         lambda df: df.index.get_level_values('subs'))
  .fillna('')
  .droplevel('subs') # served its purpose, let it go
  # use the groupby with the agg to reduce to single rows
  .groupby(['event_id', 'target'])
  .agg("".join)
  .reset_index()
)
: 
   event_id           target   action                actor
0         1        Microsoft  malware    internal.employee
1         2  Bank of America  hacking  external.competitor