How to loop over grouped Pandas dataframe?

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Track title: CC F Haydns String Quartet No 53 in D

--

Chapters
00:00 Question
01:04 Accepted answer (Score 365)
02:21 Answer 2 (Score 127)
03:05 Answer 3 (Score 34)
03:22 Thank you

--

Full question
https://stackoverflow.com/questions/2740...

Accepted answer links:
[here]: http://pandas.pydata.org/pandas-docs/sta...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #dataframe #iteration #pandasgroupby

#avk47

ACCEPTED ANSWER

Score 399

df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)) does already return a dataframe, so you cannot loop over the groups anymore.

In general:

df.groupby(...) returns a GroupBy object (a DataFrameGroupBy or SeriesGroupBy), and with this, you can iterate through the groups (as explained in the docs here). You can do something like:
```
grouped = df.groupby('A')

for name, group in grouped:
    ...
```
When you apply a function on the groupby, in your example df.groupby(...).agg(...) (but this can also be transform, apply, mean, ...), you combine the result of applying the function to the different groups together in one dataframe (the apply and combine step of the 'split-apply-combine' paradigm of groupby). So the result of this will always be again a DataFrame (or a Series depending on the applied function).

ANSWER 2

Score 141

Here is an example of iterating over a pd.DataFrame grouped by the column atable. For this sample, "create" statements for an SQL database are generated within the for loop:

import pandas as pd

def main():

    df1 = pd.DataFrame({
        'atable':     ['Users', 'Users', 'Domains', 'Domains', 'Locks'],
        'column':     ['col_1', 'col_2', 'col_a', 'col_b', 'col'],
        'column_type': ['varchar', 'varchar', 'int', 'varchar', 'varchar'],
        'is_null':    ['No', 'No', 'Yes', 'No', 'Yes'],
    })

    print(df1)

    df1_grouped = df1.groupby('atable')

# iterate over each group
    for group_name, df_group in df1_grouped:
        print("\n-- Group with {} rows(s)".format(len(df_group)))
        print('CREATE TABLE {}('.format(group_name))


        x = 0
        for row_index, row in df_group.iterrows():
            x = x + 1
            col = row['column']
            column_type = row['column_type']
            is_null = 'NOT NULL' if row['is_null'] == 'No' else ''

            if x < len(df_group):
                print('\t{} {} {},'.format(col, column_type, is_null))
            else:
                print('\t{} {} {}'.format(col, column_type, is_null))

        print(");")

if __name__ == '__main__':
    main()

output

    atable column column_type is_null
0    Users  col_1     varchar      No
1    Users  col_2     varchar      No
2  Domains  col_a         int     Yes
3  Domains  col_b     varchar      No
4    Locks    col     varchar     Yes

-- Group with 2 rows(s)
CREATE TABLE Domains(
    col_a int ,
    col_b varchar NOT NULL
);

-- Group with 1 rows(s)
CREATE TABLE Locks(
    col varchar
);

-- Group with 2 rows(s)
CREATE TABLE Users(
    col_1 varchar NOT NULL,
    col_2 varchar NOT NULL
);

ANSWER 3

Score 38

You can iterate over the index values if your dataframe has already been created.

df = df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))
for name in df.index:
    print name
    print df.loc[name]

ANSWER 4

Score 2

Loop over groupby object

When you groupby a DataFrame/Series, you create a pandas.core.groupby.generic.DataFrameGroupBy object which defines the __iter__() method, so can be iterated over like any other objects that define this method. It can be cast into a list/tuple/iterator etc. In each iteration, it returns a tuple whose first element is the grouper key and the second element is a dataframe created by the grouping; you can think of it like iteration over dict_items where in each iteration, the items are key-value tuples. Unless you select a column or columns on the groupby object, it returns all columns of the dataframe. The output of the following code illustrates this point.

import pandas as pd
from IPython.display import display

df = pd.DataFrame({
    'A': ['g1', 'g1', 'g2', 'g1'],
    'B': [1, 2, 3, 4],
    'C': ['a', 'b', 'c', 'd']
})

grouped = df.groupby('A')

list(grouped)         # OK
dict(iter(grouped))   # OK

for x in grouped:
    print(f"    Type of x: {type(x).__name__}\n  Length of x: {len(x)}")
    print(f"Value of x[0]: {x[0]}\n Type of x[1]: {type(x[1]).__name__}")
    display(x[1])

A pretty useful use case of a loop over groupby object is to split a dataframe into separate files. For example, the following creates two csv files (g_0.csv and g_1.csv) from a single dataframe.

for i, (k, g) in enumerate(df.groupby('A')):
    g.to_csv(f"g_{i}.csv")

Loop over grouped dataframe

As mentioned above, groupby object splits a dataframe into dataframes by a key. So you can loop over each grouped dataframe like any other dataframe. See this answer for a comprehensive ways to iterate over a dataframe. The most performant way is probably itertuples(). Following is an example where a nested dictionary is created using a loop on the grouped dataframe:

out = {}
for k, g in grouped:            # loop over groupby
    out[k] = {}
    for row in g.itertuples():  # loop over dataframe
        out[k][row.B] = row.C
print(out)
# {'g1': {1: 'a', 2: 'b', 4: 'd'}, 'g2': {3: 'c'}}