The Python Oracle

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Track title: CC G Dvoks String Quartet No 12 Ame 2

--

Chapters
00:00 Question
00:59 Accepted answer (Score 164)
01:52 Answer 2 (Score 74)
02:19 Answer 3 (Score 63)
02:42 Answer 4 (Score 20)
03:11 Thank you

--

Full question
https://stackoverflow.com/questions/3132...

Answer 2 links:
[this]: https://stackoverflow.com/questions/1525...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #python27 #scikitlearn #valueerror

#avk47



ACCEPTED ANSWER

Score 174


This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.

EDIT: How could I miss that:

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

is obviously wrong. Right would be:

np.any(np.isnan(mat))

and

np.all(np.isfinite(mat))

You want to check whether any of the elements are NaN, and not whether the return value of the any function is a number...




ANSWER 2

Score 74


I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:

df = df.reset_index()

I encountered this issue many times when I removed some entries in my df, such as

df = df[df.label=='desired_one']



ANSWER 3

Score 69


This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):

import pandas as pd
import numpy as np

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(axis=1)
    return df[indices_to_keep].astype(np.float64)



ANSWER 4

Score 17


This is the check on which it fails:

Which says

def _assert_all_finite(X):
    """Like assert_all_finite, but only for ndarray."""
    X = np.asanyarray(X)
    # First try an O(n) time, O(1) space solution for the common case that
    # everything is finite; fall back to O(n) space np.isfinite to prevent
    # false positives from overflow in sum method.
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
            and not np.isfinite(X).all()):
        raise ValueError("Input contains NaN, infinity"
                         " or a value too large for %r." % X.dtype)

So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.