How to expand/flatten pandas dataframe efficiently
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5
--
Chapters
00:00 Question
01:26 Accepted answer (Score 1)
02:13 Answer 2 (Score 4)
03:36 Answer 3 (Score 2)
06:53 Thank you
--
Full question
https://stackoverflow.com/questions/4216...
Answer 1 links:
[numpy.repeat]: https://docs.scipy.org/doc/numpy/referen...
[numpy.repeat]: https://docs.scipy.org/doc/numpy/referen...
[numpy.hstack]: https://docs.scipy.org/doc/numpy/referen...
[numpy.column_stack]: https://docs.scipy.org/doc/numpy/referen...
Answer 2 links:
[leads to quadratic copying]: https://stackoverflow.com/a/36489724/190...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas
#avk47
ANSWER 1
Score 4
You can use numpy for better performance:
Both solutions use mainly numpy.repeat.
from itertools import chain
vals = df.fields.str.len()
df1 = pd.DataFrame({
"id": np.repeat(df.id.values,vals),
"name": np.repeat(df.name.values, vals),
"fields": list(chain.from_iterable(df.fields))})
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
Another solution:
df[['id','name']].values converts columns to numpy array and duplicate them by numpy.repeat, then stack values in lists by numpy.hstack and add it by numpy.column_stack.
df1 = pd.DataFrame(np.column_stack((df[['id','name']].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=df.columns)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
More general solution is filter out column fields and then add it to DataFrame constructor, because always last column:
cols = df.columns[df.columns != 'fields'].tolist()
print (cols)
['id', 'name']
df1 = pd.DataFrame(np.column_stack((df[cols].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=cols + ['fields'])
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
ANSWER 2
Score 2
If your CSV is many thousands of lines long, then using_string_methods (below)
may be faster than using_iterrows or using_repeat:
With
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
In [210]: %timeit using_string_methods(csv)
10 loops, best of 3: 100 ms per loop
In [211]: %timeit using_itertuples(csv)
10 loops, best of 3: 119 ms per loop
In [212]: %timeit using_repeat(csv)
10 loops, best of 3: 126 ms per loop
In [213]: %timeit using_iterrows(csv)
1 loop, best of 3: 1min 7s per loop
So for a 10000-line CSV, using_string_methods is over 600x faster than using_iterrows, and marginally faster than using_repeat.
import pandas as pd
try: from cStringIO import StringIO # for Python2
except ImportError: from io import StringIO # for Python3
def using_string_methods(csv):
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
other_columns = df.columns.difference(['fields']).tolist()
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
df = pd.concat([df.drop('fields', axis=1), fields], axis=1)
result = (pd.melt(df, id_vars=other_columns, value_name='field')
.drop('variable', axis=1))
result = result.dropna(subset=['field'])
return result
def using_iterrows(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
return new_df
def using_repeat(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
cols = df.columns[df.columns != 'fields'].tolist()
df1 = pd.DataFrame(np.column_stack(
(df[cols].values.repeat(list(map(len,df.fields)),axis=0),
np.hstack(df.fields))), columns=cols + ['fields'])
return df1
def using_itertuples(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
other_columns = df.columns.difference(['fields']).tolist()
data = []
for tup in df.itertuples():
data.extend([[getattr(tup, col) for col in other_columns]+[field]
for field in tup.fields])
return pd.DataFrame(data, columns=other_columns+['field'])
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
Generally, fast NumPy/Pandas operations are possible only when the data is in a
native NumPy dtype (such as int64 or float64, or strings.) Once you place
lists (a non-native NumPy dtype) in a DataFrame the jig is up -- you are forced
to use Python-speed loops to process the lists.
So to improve performance, you need to avoid placing lists in a DataFrame.
using_string_methods loads the fields data as strings:
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
and avoid using the apply method (which is generally as slow as a plain Python loop):
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
Instead, it uses faster vectorized string methods to break the strings up into separate columns:
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
Once you have the fields in separate columns, you can use pd.melt to reshape
the DataFrame into the desired format.
pd.melt(df, id_vars=['id', 'name'], value_name='field')
By the way, you might be interested to see that with a slight modification using_iterrows can be just as fast as using_repeat. I show the changes in using_itertuples.
df.itertuples tends to be slightly faster than df.iterrows, but the difference is minor. The majority of the speed gain is achieved by avoiding calling df.append in a for-loop since that leads to quadratic copying.
ACCEPTED ANSWER
Score 1
You can break the lists in the fields column into multiple columns by applying pandas.Series to fields and then merging to id and name like so:
cols = df.columns[df.columns != 'fields'].tolist() # adapted from @jezrael
df = df[cols].join(df.fields.apply(pandas.Series))
Then you can melt the resulting new columns using set_index and stack, and then reseting the index:
df = df.set_index(cols).stack().reset_index()
Finally, drop the redundant column generated by reset_index and rename the generated column to "field":
df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})