The Python Oracle

How to convert a Scikit-learn dataset to a Pandas dataset

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Book End

--

Chapters
00:00 How To Convert A Scikit-Learn Dataset To A Pandas Dataset
00:18 Accepted Answer Score 208
00:54 Answer 2 Score 127
01:08 Answer 3 Score 88
01:39 Answer 4 Score 19
01:55 Thank you

--

Full question
https://stackoverflow.com/questions/3810...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #scikitlearn #dataset

#avk47



ACCEPTED ANSWER

Score 208


Manually, you can use pd.DataFrame constructor, giving a numpy array (data) and a list of the names of the columns (columns). To have everything in one DataFrame, you can concatenate the features and the target into one numpy array with np.c_[...] (note the []):

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# save load_iris() sklearn dataset to iris
# if you'd like to check dataset type use: type(load_iris())
# if you'd like to view list of attributes use: dir(load_iris())
iris = load_iris()

# np.c_ is the numpy concatenate function
# which is used to concat iris['data'] and iris['target'] arrays 
# for pandas column argument: concat iris['feature_names'] list
# and string list (in this case one string); you can make this anything you'd like..  
# the original dataset would probably call this ['Species']
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])



ANSWER 2

Score 127


from sklearn.datasets import load_iris
import pandas as pd

data = load_iris()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df.head()

This tutorial maybe of interest: http://www.neural.cz/dataset-exploration-boston-house-pricing.html




ANSWER 3

Score 88


TOMDLt's solution is not generic enough for all the datasets in scikit-learn. For example it does not work for the boston housing dataset. I propose a different solution which is more universal. No need to use numpy as well.

from sklearn import datasets
import pandas as pd

boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df_boston['target'] = pd.Series(boston_data.target)
df_boston.head()

As a general function:

def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
    df['target'] = pd.Series(sklearn_dataset.target)
    return df

df_boston = sklearn_to_df(datasets.load_boston())



ANSWER 4

Score 19


Took me 2 hours to figure this out

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
##iris.keys()


df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                 columns= iris['feature_names'] + ['target'])

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

Get back the species for my pandas