How to unnest (explode) a column in a pandas DataFrame, into multiple rows
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Darkness Approaches Looping
--
Chapters
00:00 Question
01:11 Accepted answer (Score 319)
06:37 Answer 2 (Score 54)
09:12 Answer 3 (Score 17)
09:34 Answer 4 (Score 11)
10:39 Thank you
--
Full question
https://stackoverflow.com/questions/5321...
Question links:
[Pandas column of lists, create a row for each list element]: https://stackoverflow.com/questions/2726...
[When should I (not) want to use pandas apply() in my code?]: https://stackoverflow.com/questions/5443...
Accepted answer links:
[user3483203's answer]: https://stackoverflow.com/questions/5321...
[Cython]: https://cython.org/
[numba]: http://numba.pydata.org/
[pandas 0.25]: https://pandas.pydata.org/pandas-docs/ve...
[pandas.DataFrame.explode]: https://pandas.pydata.org/pandas-docs/st...
Answer 2 links:
[image]: https://i.stack.imgur.com/SWzSN.png
Answer 3 links:
[simplified significantly in pandas 0.25]: https://pandas.pydata.org/pandas-docs/ve...
Answer 4 links:
[meshgrid recipe]: https://stackoverflow.com/questions/1208...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #dataframe #pandasexplode
#avk47
ACCEPTED ANSWER
Score 337
I know object dtype columns makes the data hard to convert with pandas functions. When I receive data like this, the first thing that came to mind was to "flatten" or unnest the columns.
I am using pandas and Python functions for this type of question. If you are worried about the speed of the above solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.
Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:
df.explode('B')
A B
0 1 1
1 1 2
0 2 1
1 2 2
Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list
df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode('B')
A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN
Method 1
apply + pd.Series (easy to understand but in terms of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2
Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join or merge with the index after 'unnest' the single columns
s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly the same as before, add reindex at the end.
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3
recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns, use
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4
using reindex or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5 when the list only contains unique values:
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Method 6
using numpy for high performance:
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Method 7
using base function itertools cycle and chain: Pure python solution just for fun
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Generalizing to multiple columns
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Column-wise Unnesting
All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor
df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4
Updated function
def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
Test Output
unnesting(df, ['B','C'], axis=0)
Out[36]:
B0 B1 C0 C1 A
0 1 2 1 2 1
1 3 4 3 4 2
Update 2021-02-17 with original explode function
def unnesting(df, explode, axis):
if axis==1:
df1 = pd.concat([df[x].explode() for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
ANSWER 2
Score 54
Option 1
If all of the sublists in the other column are the same length, numpy can be an efficient option here:
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 3
I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:
df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})
A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C
def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs = [len(r) for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))
return pd.DataFrame(d, columns = tile + ['_'.join(explode)])
unnest(df, ['A', 'D'], ['B', 'C'])
A D B_C
0 1 A 1
1 1 A 2
2 1 A 1
3 1 A 2
4 1 A 3
5 2 B 1
6 2 B 2
7 2 B 3
8 2 B 1
9 2 B 2
10 3 C 1
11 3 C 1
12 3 C 2
Functions
def wen1(df):
return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})
def wen2(df):
return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
def wen3(df):
s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
return s.join(df.drop('B', 1), how='left')
def wen4(df):
return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
def chris2(df):
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A.values, rs)
return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
Timings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit
res = pd.DataFrame(
index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
columns=[10, 50, 100, 500, 1000, 5000, 10000],
dtype=float
)
for f in res.index:
for c in res.columns:
df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
df = pd.concat([df]*c)
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
Performance
ANSWER 3
Score 18
Exploding a list-like column has been simplified significantly in pandas 0.25 with the addition of the explode() method:
df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
df.explode('B')
Out:
A B
0 1 1
0 1 2
1 2 1
1 2 2
ANSWER 4
Score 12
One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:
import numpy as np
import pandas as pd
def unnest(frame, explode):
def mesh(values):
return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))
data = np.vstack(mesh(row) for row in frame[explode].values)
return pd.DataFrame(data=data, columns=explode)
df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
print(unnest(df, ['A', 'B'])) # base
print()
df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
print(unnest(df, ['A', 'B', 'C'])) # multiple columns
print()
df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})
print(unnest(df, ['A', 'B'])) # uneven length lists
print()
print(unnest(df, ['D', 'B'])) # different types
print()
Output
A B
0 1 1
1 1 2
2 2 1
3 2 2
A B C
0 1 1 1
1 1 2 1
2 1 1 2
3 1 2 2
4 2 3 3
5 2 4 3
6 2 3 4
7 2 4 4
A B
0 1 1
1 1 2
2 2 1
3 2 2
4 2 3
5 3 1
D B
0 A 1
1 A 2
2 B 1
3 B 2
4 B 3
5 C 1
