How to loop over grouped Pandas dataframe?
--
Track title: CC F Haydns String Quartet No 53 in D
--
Chapters
00:00 Question
01:04 Accepted answer (Score 365)
02:21 Answer 2 (Score 127)
03:05 Answer 3 (Score 34)
03:22 Thank you
--
Full question
https://stackoverflow.com/questions/2740...
Accepted answer links:
[here]: http://pandas.pydata.org/pandas-docs/sta...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #pandas #dataframe #iteration #pandasgroupby
#avk47
ACCEPTED ANSWER
Score 399
df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)) does already return a dataframe, so you cannot loop over the groups anymore.
In general:
df.groupby(...)returns aGroupByobject (a DataFrameGroupBy or SeriesGroupBy), and with this, you can iterate through the groups (as explained in the docs here). You can do something like:grouped = df.groupby('A') for name, group in grouped: ...When you apply a function on the groupby, in your example
df.groupby(...).agg(...)(but this can also betransform,apply,mean, ...), you combine the result of applying the function to the different groups together in one dataframe (the apply and combine step of the 'split-apply-combine' paradigm of groupby). So the result of this will always be again a DataFrame (or a Series depending on the applied function).
ANSWER 2
Score 141
Here is an example of iterating over a pd.DataFrame grouped by the column atable. For this sample, "create" statements for an SQL database are generated within the for loop:
import pandas as pd
def main():
df1 = pd.DataFrame({
'atable': ['Users', 'Users', 'Domains', 'Domains', 'Locks'],
'column': ['col_1', 'col_2', 'col_a', 'col_b', 'col'],
'column_type': ['varchar', 'varchar', 'int', 'varchar', 'varchar'],
'is_null': ['No', 'No', 'Yes', 'No', 'Yes'],
})
print(df1)
df1_grouped = df1.groupby('atable')
# iterate over each group
for group_name, df_group in df1_grouped:
print("\n-- Group with {} rows(s)".format(len(df_group)))
print('CREATE TABLE {}('.format(group_name))
x = 0
for row_index, row in df_group.iterrows():
x = x + 1
col = row['column']
column_type = row['column_type']
is_null = 'NOT NULL' if row['is_null'] == 'No' else ''
if x < len(df_group):
print('\t{} {} {},'.format(col, column_type, is_null))
else:
print('\t{} {} {}'.format(col, column_type, is_null))
print(");")
if __name__ == '__main__':
main()
output
atable column column_type is_null
0 Users col_1 varchar No
1 Users col_2 varchar No
2 Domains col_a int Yes
3 Domains col_b varchar No
4 Locks col varchar Yes
-- Group with 2 rows(s)
CREATE TABLE Domains(
col_a int ,
col_b varchar NOT NULL
);
-- Group with 1 rows(s)
CREATE TABLE Locks(
col varchar
);
-- Group with 2 rows(s)
CREATE TABLE Users(
col_1 varchar NOT NULL,
col_2 varchar NOT NULL
);
ANSWER 3
Score 38
You can iterate over the index values if your dataframe has already been created.
df = df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))
for name in df.index:
print name
print df.loc[name]
ANSWER 4
Score 2
Loop over groupby object
When you groupby a DataFrame/Series, you create a pandas.core.groupby.generic.DataFrameGroupBy object which defines the __iter__() method, so can be iterated over like any other objects that define this method. It can be cast into a list/tuple/iterator etc. In each iteration, it returns a tuple whose first element is the grouper key and the second element is a dataframe created by the grouping; you can think of it like iteration over dict_items where in each iteration, the items are key-value tuples. Unless you select a column or columns on the groupby object, it returns all columns of the dataframe. The output of the following code illustrates this point.
import pandas as pd
from IPython.display import display
df = pd.DataFrame({
'A': ['g1', 'g1', 'g2', 'g1'],
'B': [1, 2, 3, 4],
'C': ['a', 'b', 'c', 'd']
})
grouped = df.groupby('A')
list(grouped) # OK
dict(iter(grouped)) # OK
for x in grouped:
print(f" Type of x: {type(x).__name__}\n Length of x: {len(x)}")
print(f"Value of x[0]: {x[0]}\n Type of x[1]: {type(x[1]).__name__}")
display(x[1])
A pretty useful use case of a loop over groupby object is to split a dataframe into separate files. For example, the following creates two csv files (g_0.csv and g_1.csv) from a single dataframe.
for i, (k, g) in enumerate(df.groupby('A')):
g.to_csv(f"g_{i}.csv")
Loop over grouped dataframe
As mentioned above, groupby object splits a dataframe into dataframes by a key. So you can loop over each grouped dataframe like any other dataframe. See this answer for a comprehensive ways to iterate over a dataframe. The most performant way is probably itertuples(). Following is an example where a nested dictionary is created using a loop on the grouped dataframe:
out = {}
for k, g in grouped: # loop over groupby
out[k] = {}
for row in g.itertuples(): # loop over dataframe
out[k][row.B] = row.C
print(out)
# {'g1': {1: 'a', 2: 'b', 4: 'd'}, 'g2': {3: 'c'}}
