The Python Oracle

How to standardize ONE column in Spark using StandardScaler?

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5 Looping

--

Chapters
00:00 How To Standardize One Column In Spark Using Standardscaler?
01:05 Accepted Answer Score 12
01:34 Thank you

--

Full question
https://stackoverflow.com/questions/4762...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #apachespark #pyspark #scale

#avk47



ACCEPTED ANSWER

Score 12


Just use plain aggregation:

from pyspark.sql.functions import stddev, mean, col

sample17 = spark.createDataFrame([(1, ), (2, ), (3, )]).toDF("age")

(sample17
  .select(mean("age").alias("mean_age"), stddev("age").alias("stddev_age"))
  .crossJoin(sample17)
  .withColumn("age_scaled" , (col("age") - col("mean_age")) / col("stddev_age")))

# +--------+----------+---+----------+
# |mean_age|stddev_age|age|age_scaled|
# +--------+----------+---+----------+
# |     2.0|       1.0|  1|      -1.0|
# |     2.0|       1.0|  2|       0.0|
# |     2.0|       1.0|  3|       1.0|
# +--------+----------+---+----------+

or

mean_age, sttdev_age = sample17.select(mean("age"), stddev("age")).first()
sample17.withColumn("age_scaled", (col("age") - mean_age) / sttdev_age)

# +---+----------+
# |age|age_scaled|
# +---+----------+
# |  1|      -1.0|
# |  2|       0.0|
# |  3|       1.0|
# +---+----------+

If you want Transformer you can split vector into columns.