sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
--
Track title: CC G Dvoks String Quartet No 12 Ame 2
--
Chapters
00:00 Question
00:59 Accepted answer (Score 164)
01:52 Answer 2 (Score 74)
02:19 Answer 3 (Score 63)
02:42 Answer 4 (Score 20)
03:11 Thank you
--
Full question
https://stackoverflow.com/questions/3132...
Answer 2 links:
[this]: https://stackoverflow.com/questions/1525...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #python27 #scikitlearn #valueerror
#avk47
ACCEPTED ANSWER
Score 174
This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.
EDIT: How could I miss that:
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
is obviously wrong. Right would be:
np.any(np.isnan(mat))
and
np.all(np.isfinite(mat))
You want to check whether any of the elements are NaN, and not whether the return value of the any function is a number...
ANSWER 2
Score 74
I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:
df = df.reset_index()
I encountered this issue many times when I removed some entries in my df, such as
df = df[df.label=='desired_one']
ANSWER 3
Score 69
This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):
import pandas as pd
import numpy as np
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(axis=1)
return df[indices_to_keep].astype(np.float64)
ANSWER 4
Score 17
This is the check on which it fails:
Which says
def _assert_all_finite(X):
"""Like assert_all_finite, but only for ndarray."""
X = np.asanyarray(X)
# First try an O(n) time, O(1) space solution for the common case that
# everything is finite; fall back to O(n) space np.isfinite to prevent
# false positives from overflow in sum method.
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
and not np.isfinite(X).all()):
raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.