How to count a boolean in grouped Spark data frame

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Ominous Technology Looping

--

Chapters
00:00 How To Count A Boolean In Grouped Spark Data Frame
00:33 Accepted Answer Score 40
00:59 Answer 2 Score 1
01:26 Thank you

--

Full question
https://stackoverflow.com/questions/3549...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #sql #apachespark #pyspark #apachesparksql

#avk47

ACCEPTED ANSWER

Score 40

Probably the simplest solution is a plain CAST (C style where TRUE -> 1, FALSE -> 0) with SUM:

(data
    .groupby("Region")
    .agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))

A little bit more universal and idiomatic solution is CASE WHEN with COUNT:

(data
    .groupby("Region")
    .agg(
        F.avg("Salary"),
        F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))

but here it is clearly an overkill.

ANSWER 2

Score 1

`count_if` function

Pyspark 3.5 introduced pyspark.sql.functions.count_if documented as "Returns the number of TRUE values for the col."

So for your example, you could do:

from pyspark.sql import functions as F  
results = (
    data
    .groupby("Region")
    .agg(
        F.avg("Salary").alias("AverageSalary"),
        # new count_if method
        F.count_if("IsUnemployed").alias("CountEmployed"),
        # old casting method still required for getting proportion of true values
        F.avg(F.col("IsUnemployed").cast("integer")).alias("ProportionEmployed"),
    )
)

ACCEPTED ANSWER

Score 40

ANSWER 2

Score 1

count_if function

`count_if` function