How to standardize ONE column in Spark using StandardScaler?
Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Puddle Jumping Looping
--
Chapters
00:00 Question
01:26 Accepted answer (Score 11)
02:09 Thank you
--
Full question
https://stackoverflow.com/questions/4762...
Accepted answer links:
[split vector into columns]: https://stackoverflow.com/q/38384347/837...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #apachespark #pyspark #scale
#avk47
    --
Music by Eric Matyas
https://www.soundimage.org
Track title: Puddle Jumping Looping
--
Chapters
00:00 Question
01:26 Accepted answer (Score 11)
02:09 Thank you
--
Full question
https://stackoverflow.com/questions/4762...
Accepted answer links:
[split vector into columns]: https://stackoverflow.com/q/38384347/837...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #apachespark #pyspark #scale
#avk47
ACCEPTED ANSWER
Score 12
Just use plain aggregation:
from pyspark.sql.functions import stddev, mean, col
sample17 = spark.createDataFrame([(1, ), (2, ), (3, )]).toDF("age")
(sample17
  .select(mean("age").alias("mean_age"), stddev("age").alias("stddev_age"))
  .crossJoin(sample17)
  .withColumn("age_scaled" , (col("age") - col("mean_age")) / col("stddev_age")))
# +--------+----------+---+----------+
# |mean_age|stddev_age|age|age_scaled|
# +--------+----------+---+----------+
# |     2.0|       1.0|  1|      -1.0|
# |     2.0|       1.0|  2|       0.0|
# |     2.0|       1.0|  3|       1.0|
# +--------+----------+---+----------+
or
mean_age, sttdev_age = sample17.select(mean("age"), stddev("age")).first()
sample17.withColumn("age_scaled", (col("age") - mean_age) / sttdev_age)
# +---+----------+
# |age|age_scaled|
# +---+----------+
# |  1|      -1.0|
# |  2|       0.0|
# |  3|       1.0|
# +---+----------+
If you want Transformer you can split vector into columns.