How to standardize ONE column in Spark using StandardScaler?
--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5 Looping
--
Chapters
00:00 How To Standardize One Column In Spark Using Standardscaler?
01:05 Accepted Answer Score 12
01:34 Thank you
--
Full question
https://stackoverflow.com/questions/4762...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #apachespark #pyspark #scale
#avk47
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5 Looping
--
Chapters
00:00 How To Standardize One Column In Spark Using Standardscaler?
01:05 Accepted Answer Score 12
01:34 Thank you
--
Full question
https://stackoverflow.com/questions/4762...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #apachespark #pyspark #scale
#avk47
ACCEPTED ANSWER
Score 12
Just use plain aggregation:
from pyspark.sql.functions import stddev, mean, col
sample17 = spark.createDataFrame([(1, ), (2, ), (3, )]).toDF("age")
(sample17
.select(mean("age").alias("mean_age"), stddev("age").alias("stddev_age"))
.crossJoin(sample17)
.withColumn("age_scaled" , (col("age") - col("mean_age")) / col("stddev_age")))
# +--------+----------+---+----------+
# |mean_age|stddev_age|age|age_scaled|
# +--------+----------+---+----------+
# | 2.0| 1.0| 1| -1.0|
# | 2.0| 1.0| 2| 0.0|
# | 2.0| 1.0| 3| 1.0|
# +--------+----------+---+----------+
or
mean_age, sttdev_age = sample17.select(mean("age"), stddev("age")).first()
sample17.withColumn("age_scaled", (col("age") - mean_age) / sttdev_age)
# +---+----------+
# |age|age_scaled|
# +---+----------+
# | 1| -1.0|
# | 2| 0.0|
# | 3| 1.0|
# +---+----------+
If you want Transformer you can split vector into columns.