The Python Oracle

How to create incrementing group column counter

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Track title: CC H Dvoks String Quartet No 12 Ame

--

Chapters
00:00 Question
01:07 Accepted answer (Score 2)
01:48 Answer 2 (Score 1)
02:22 Answer 3 (Score 1)
02:53 Thank you

--

Full question
https://stackoverflow.com/questions/7419...

Accepted answer links:
[diff]: https://pandas.pydata.org/docs/reference...

Answer 2 links:
[mozway]: https://stackoverflow.com/users/16343464...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #dataframe

#avk47



ACCEPTED ANSWER

Score 2


You can use diff to select only the first item of each stretch of True:

df['ExpectedGroup'] = (df['case_id'].diff()
                      &df['case_id']
                      ).cumsum().where(df['case_id'])

If you don't want the intermediate column:

s = (df.FromState == 'O') & (df.ToState == 'O')
# or
# s = df[['FromState', 'ToState']].eq('O').all(axis=1)

df['ExpectedGroup'] = (s.diff()&s).cumsum().where(s)
# or
# df.loc[s, 'ExpectedGroup'] = (s.diff()&s).cumsum()

Output:

  ID FromState ToState  Hours  ExpectedGroup  case_id
0  A         P       O      2            NaN    False
1  A         O       O      5            1.0     True
2  A         O       O     10            1.0     True
3  A         O       P      4            NaN    False
4  A         P       P    300            NaN    False
5  B         P       O      2            NaN    False
6  B         O       O      5            2.0     True
7  B         O       O     10            2.0     True
8  B         O       P      4            NaN    False
9  B         P       P    300            NaN    False



ANSWER 2

Score 1


Let's use cumsum to create counter then reencode the counter using factorize

m = df['case_id']
df.loc[m, 'ExpectedGroup'] = (~m).cumsum()[m].factorize()[0] + 1

   ID FromState ToState  Hours  ExpectedGroup  case_id
0   A         P       O      2            NaN    False
1   A         O       O      5            1.0     True
2   A         O       O     10            1.0     True
3   A         O       P      4            NaN    False
4   A         P       P    300            NaN    False
5   A         P       O      2            NaN    False
6   A         O       O      5            2.0     True
7   A         O       O     10            2.0     True
8   A         O       P      4            NaN    False
9   A         P       P    300            NaN    False
10  B         P       O      2            NaN    False
11  B         O       O      5            3.0     True
12  B         O       O     10            3.0     True
13  B         O       P      4            NaN    False
14  B         P       P    300            NaN    False



ANSWER 3

Score 1


Similar to mozway's brilliant approach:

df['ExpectedGroup'] = (df['case_id'].shift(-1) & df['case_id']).cumsum().mask(~s)
df

   ID FromState ToState  Hours  ExpectedGroup  case_id
0   A         P       O      2            NaN    False
1   A         O       O      5            1.0     True
2   A         O       O     10            1.0     True
3   A         O       P      4            NaN    False
4   A         P       P    300            NaN    False
5   A         P       O      2            NaN    False
6   A         O       O      5            2.0     True
7   A         O       O     10            2.0     True
8   A         O       P      4            NaN    False
9   A         P       P    300            NaN    False
10  B         P       O      2            NaN    False
11  B         O       O      5            3.0     True
12  B         O       O     10            3.0     True
13  B         O       P      4            NaN    False
14  B         P       P    300            NaN    False