The Python Oracle

Implementing ROC Curves for K-NN machine learning algorithm using python and Scikit Learn

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: The World Wide Mind

--

Chapters
00:00 Implementing Roc Curves For K-Nn Machine Learning Algorithm Using Python And Scikit Learn
01:16 Accepted Answer Score 19
02:47 Thank you

--

Full question
https://stackoverflow.com/questions/5291...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #machinelearning #scikitlearn #knn

#avk47



ACCEPTED ANSWER

Score 19


If you look at the documentation for roc_curve(), you will see the following regarding the y_score parameter:

y_score : array, shape = [n_samples] Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).

You can get probability estimates using the predict_proba() method of the KNeighborsClassifier in sklearn. This returns a numpy array with two columns for a binary classification, one each for the negative and positive class. For the roc_curve() function you want to use probability estimates of the positive class, so you can replace your:

y_scores = cross_val_score(knn_cv, X, y, cv=76)
fpr, tpr, threshold = roc_curve(y_test, y_scores)

with:

y_scores = knn.predict_proba(X_test)
fpr, tpr, threshold = roc_curve(y_test, y_scores[:, 1])

Notice how you need to take all the rows of the second column with [:, 1] to only select the probability estimates of the positive class. Here's a minimal reproducible example using the Wisconsin breast cancer dataset, since I don't have your autoimmune.csv:

from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(X_train,y_train)

y_scores = knn.predict_proba(X_test)
fpr, tpr, threshold = roc_curve(y_test, y_scores[:, 1])
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve of kNN')
plt.show()

This produces the following ROC curve:

KNN ROC curve