Can anyone explain me StandardScaler?

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Secret Catacombs

--

Chapters
00:00 Question
00:24 Accepted answer (Score 142)
01:04 Answer 2 (Score 171)
03:44 Answer 3 (Score 39)
04:19 Answer 4 (Score 31)
04:33 Thank you

--

Full question
https://stackoverflow.com/questions/4075...

Question links:
[page]: http://scikit-learn.org/stable/modules/g...

Answer 1 links:
[How and why to Standardize your data: A python tutorial]: https://towardsdatascience.com/how-and-w...
[image]: https://i.stack.imgur.com/obywE.png
[StandardScaler difference between “with_std=False or True” and “with_mean=False or True”]: https://stackoverflow.com/a/57381708/502...

Answer 3 links:
[image]: https://i.stack.imgur.com/Z7ATR.png
http://sebastianraschka.com/Articles/201...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #machinelearning #scikitlearn #scaling #standardized

#avk47

ANSWER 1

Score 186

Intro

I assume that you have a matrix X where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn ML function by the way -- X.shape should be [number_of_samples, number_of_features]).

Core of method

The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.

StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY, so that each column/feature/variable will have μ = 0 and σ = 1.

P.S: I find the most upvoted answer on this page, wrong. I am quoting "each value in the dataset will have the sample mean value subtracted" -- This is neither true nor correct.

Example with code

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)
array([1., 1.])

Appendix: The maths

UPDATE 08/2020: Concerning the input parameters with_mean and with_std to False/True, I have provided an answer here: StandardScaler difference between “with_std=False or True” and “with_mean=False or True”

ACCEPTED ANSWER

Score 145

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).

ANSWER 3

Score 31

How to calculate it:

You can read more here:

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html#standardization-and-min-max-scaling

ANSWER 4

Score 12

This is useful when you want to compare data that correspond to different units. In that case, you want to remove the units. To do that in a consistent way of all the data, you transform the data in a way that the variance is unitary and that the mean of the series is 0.