The Python Oracle

Expand a list-like column in dask DF across several columns

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Music Box Puzzles

--

Chapters
00:00 Expand A List-Like Column In Dask Df Across Several Columns
01:36 Accepted Answer Score 1
01:54 Answer 2 Score 0
02:27 Answer 3 Score 0
02:48 Thank you

--

Full question
https://stackoverflow.com/questions/6936...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #dask

#avk47



ACCEPTED ANSWER

Score 1


In this case dask doesn't know what to expect from the outcome, so it's best to specify meta explicitly:


# this is a short-cut to use the existing pandas df
# in actual code it is sufficient to provide an
# empty series with the expected dtype
meta = df['a'].apply(pd.Series)

new_dask_df = dask_df['a'].apply(pd.Series, meta=meta)
new_dask_df.compute()



ANSWER 2

Score 0


I got a working solution. My original function created a list which resulted in the column of lists, as above. Changing the applied function to return a dask bag seems to do the trick:

def create_df_row(x):
    vals = np.random.randint(2, size=4)
    return db.from_sequence([vals], partition_size=2).to_dataframe()

test_df = dd.from_pandas(pd.DataFrame({'a':[random.choice(['a', 'b', 'c']) for _ in range(20)]}), chunksize=10)
test_df.head()

enter image description here

mini_dfs = [*test_df.groupby('a')['a'].apply(lambda x: create_df_row(x))]
result = dd.concat(mini_dfs)
result.compute().head()

But not sure if this solves the in-memory issue as now i'm holding a list of groupby results.




ANSWER 3

Score 0


Here's how to expand a list-like column across multiple columns manually:

dask_df["a0"] = dask_df["a"].str[0]
dask_df["a1"] = dask_df["a"].str[1]
dask_df["a2"] = dask_df["a"].str[2]
dask_df["a3"] = dask_df["a"].str[3]

print(dask_df.head())
                  a  a0  a1  a2  a3
0   [71, 16, 0, 10]  71  16   0  10
1  [59, 65, 99, 74]  59  65  99  74
2  [83, 26, 33, 38]  83  26  33  38
3   [70, 5, 19, 37]  70   5  19  37
4    [0, 59, 4, 80]   0  59   4  80

SultanOrazbayev's answer seems more elegant.