Python Pandas Fillna Median not working

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 3

--

Chapters
00:00 Python Pandas Fillna Median Not Working
00:49 Accepted Answer Score 21
01:12 Answer 2 Score 0
02:32 Thank you

--

Full question
https://stackoverflow.com/questions/4912...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #python3x #python27 #pandas #dataframe

#avk47

ACCEPTED ANSWER

Score 21

As @thesilkworm suggested, convert your series to numeric first. Below is a minimal example:

import pandas as pd, numpy as np

df = pd.DataFrame([[np.nan, np.nan, np.nan],
                   [5, 1, 2, 'hello'],
                   [1, 4, 3, 4],
                   [9, 8, 7, 6]], dtype=object)

df = df.fillna(df.median())  # fails

df[df.columns] = df[df.columns].apply(pd.to_numeric, errors='coerce')

df = df.fillna(df.median())  # works

ANSWER 2

Score 0

You can use np.nanmedian + a dict comprehension to rename.

df = pd.DataFrame({"col1": [1,2,np.nan,3], "col2": [5, np.nan, 10, np.nan]})

col_map = {df.columns.get_loc(col):col for col in df.columns} # {0: "col1", 1: "col2"}
median_values = pd.Series(np.nanmedian(df, axis=0)).rename(col_map)
df = df.fillna(median_values)

>> df
   col1 col2
0   1.0 5.0
1   2.0 7.5
2   2.0 10.0
3   3.0 7.5

You can see that the intermittent step in jpp's answer of calling df.median() after .apply() is the same as the median_series I've defined above.

print(df.median()) # after .apply()
col1    2.0
col2    7.5
dtype: float64

print(median_series)
col1    2.0
col2    7.5
dtype: float64

Note: You'll get an error using np.nanmedian if one of the values in your df is a string like "hi", which we often want in production.

Explanation: I still like the answer by jpp for most cases, but sometimes if my data is enormous or I'm sending data to an ML API Endpoint I don't want to use errors='coerce' to blanket over everything and do want to return an error. Imagine the scenario you're in production 6 months later and some of your col1 values are being passed as strings: [1, 2, 'hi', 'bye']. Using errors=coerce will set hi & bye to NaN. You'll be unaware that strings are leaking into the column from some upstream change you didn't make and your ML algorithm is passing out scores like normal, but in reality it's just scoring a bunch of NaNs. Your model performance will degrade without knowing why and in reality it's a good model it's just not scoring on correct data. Company loses money, you're fired. I realize I digressed down an unlikely slippery slope but wanted to pack the emphasis.

We want this error if in prod:

df = pd.DataFrame({"col1": [1,2,np.nan,'hi'], "col2": [5, np.nan, 10, np.nan]})
col_map = {df.columns.get_loc(col):col for col in df.columns} # {0: "col1", 1: "col2"}
median_series = pd.Series(np.nanmedian(df, axis=0)).rename(col_map)
>> TypeError: ufunc 'isnan' not supported for the input types, and the inputs 
could not be safely coerced to any supported types according to the casting rule ''safe''