Expand a list-like column in dask DF across several columns
--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Music Box Puzzles
--
Chapters
00:00 Expand A List-Like Column In Dask Df Across Several Columns
01:36 Accepted Answer Score 1
01:54 Answer 2 Score 0
02:27 Answer 3 Score 0
02:48 Thank you
--
Full question
https://stackoverflow.com/questions/6936...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #dask
#avk47
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Music Box Puzzles
--
Chapters
00:00 Expand A List-Like Column In Dask Df Across Several Columns
01:36 Accepted Answer Score 1
01:54 Answer 2 Score 0
02:27 Answer 3 Score 0
02:48 Thank you
--
Full question
https://stackoverflow.com/questions/6936...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #dask
#avk47
ACCEPTED ANSWER
Score 1
In this case dask doesn't know what to expect from the outcome, so it's best to specify meta explicitly:
# this is a short-cut to use the existing pandas df
# in actual code it is sufficient to provide an
# empty series with the expected dtype
meta = df['a'].apply(pd.Series)
new_dask_df = dask_df['a'].apply(pd.Series, meta=meta)
new_dask_df.compute()
ANSWER 2
Score 0
I got a working solution. My original function created a list which resulted in the column of lists, as above. Changing the applied function to return a dask bag seems to do the trick:
def create_df_row(x):
vals = np.random.randint(2, size=4)
return db.from_sequence([vals], partition_size=2).to_dataframe()
test_df = dd.from_pandas(pd.DataFrame({'a':[random.choice(['a', 'b', 'c']) for _ in range(20)]}), chunksize=10)
test_df.head()
mini_dfs = [*test_df.groupby('a')['a'].apply(lambda x: create_df_row(x))]
result = dd.concat(mini_dfs)
result.compute().head()
But not sure if this solves the in-memory issue as now i'm holding a list of groupby results.
ANSWER 3
Score 0
Here's how to expand a list-like column across multiple columns manually:
dask_df["a0"] = dask_df["a"].str[0]
dask_df["a1"] = dask_df["a"].str[1]
dask_df["a2"] = dask_df["a"].str[2]
dask_df["a3"] = dask_df["a"].str[3]
print(dask_df.head())
a a0 a1 a2 a3
0 [71, 16, 0, 10] 71 16 0 10
1 [59, 65, 99, 74] 59 65 99 74
2 [83, 26, 33, 38] 83 26 33 38
3 [70, 5, 19, 37] 70 5 19 37
4 [0, 59, 4, 80] 0 59 4 80
SultanOrazbayev's answer seems more elegant.
