The Python Oracle

Normalize columns of a dataframe

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Island

--

Chapters
00:00 Question
00:40 Accepted answer (Score 401)
01:07 Answer 2 (Score 748)
01:37 Answer 3 (Score 76)
04:18 Answer 4 (Score 76)
04:47 Thank you

--

Full question
https://stackoverflow.com/questions/2641...

Accepted answer links:
[documentation]: http://scikit-learn.org/stable/modules/p...

Answer 3 links:
[Wikipedia: Unbiased Estimation of Standard Deviation]: https://en.wikipedia.org/wiki/Unbiased_e...
[sklearn.preprocessing.scale]: https://scikit-learn.org/stable/modules/...

Answer 4 links:
https://stats.stackexchange.com/question...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #dataframe #normalize

#avk47



ANSWER 1

Score 825


one easy way by using Pandas: (here I want to use mean normalization)

normalized_df=(df-df.mean())/df.std()

to use min-max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.




ACCEPTED ANSWER

Score 421


You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.




ANSWER 3

Score 80


Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

You don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.




ANSWER 4

Score 66


Your problem is actually a simple transform acting on the columns:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

Or even more terse:

   frame.apply(lambda x: x/x.max(), axis=0)