How to count a boolean in grouped Spark data frame
This video explains
How to count a boolean in grouped Spark data frame
--
Become part of the top 3% of the developers by applying to Toptal
https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Underwater World
--
Chapters
00:00 Question
00:44 Accepted answer (Score 34)
01:17 Thank you
--
Full question
https://stackoverflow.com/questions/3549...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #sql #apachespark #pyspark #apachesparksql
#avk47
How to count a boolean in grouped Spark data frame
--
Become part of the top 3% of the developers by applying to Toptal
https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Underwater World
--
Chapters
00:00 Question
00:44 Accepted answer (Score 34)
01:17 Thank you
--
Full question
https://stackoverflow.com/questions/3549...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #sql #apachespark #pyspark #apachesparksql
#avk47
ACCEPTED ANSWER
Score 40
Probably the simplest solution is a plain CAST (C style where TRUE -> 1, FALSE -> 0) with SUM:
(data
.groupby("Region")
.agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))
A little bit more universal and idiomatic solution is CASE WHEN with COUNT:
(data
.groupby("Region")
.agg(
F.avg("Salary"),
F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))
but here it is clearly an overkill.
ANSWER 2
Score 1
count_if function
Pyspark 3.5 introduced pyspark.sql.functions.count_if documented as "Returns the number of TRUE values for the col."
So for your example, you could do:
from pyspark.sql import functions as F
results = (
data
.groupby("Region")
.agg(
F.avg("Salary").alias("AverageSalary"),
# new count_if method
F.count_if("IsUnemployed").alias("CountEmployed"),
# old casting method still required for getting proportion of true values
F.avg(F.col("IsUnemployed").cast("integer")).alias("ProportionEmployed"),
)
)