ROC curve for Isolation Forest

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Over Ancient Waters Looping

--

Chapters
00:00 Roc Curve For Isolation Forest
01:35 Accepted Answer Score 2
02:17 Answer 2 Score 4
03:12 Answer 3 Score 1
04:07 Thank you

--

Full question
https://stackoverflow.com/questions/5510...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #scikitlearn #roc #outliers #auc

#avk47

ANSWER 1

Score 4

The confusion_matrix() function gives you just the correctly/misclassified point but does not the provide the information about how far the model is confident when it misclassifies a datapoint.

This information is used to create an ROC curve (which is used to measure ability of a model to rank each datapoint based on its likelihood towards a particular class).

Instead, use the decision_function() or score_samples() functions to calculate the model's confidence that each data point is (or is not) an anomaly. Then, use roc_curve() to get the points necessary to plot the curve itself.

Here is an example for breast cancer dataset.

from sklearn.datasets import load_breast_cancer
X, y  = load_breast_cancer(return_X_y=True)
# to make malignant as ones
y = (y == 0).astype(int)
from sklearn.ensemble import IsolationForest

clf = IsolationForest(max_samples=100,
                        random_state=0, contamination='auto')
clf.fit(X)
y_pred = clf.score_samples(X)

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y,y_pred)
import matplotlib.pyplot as plt
plt.plot(fpr, tpr, 'k-', lw=2)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

ACCEPTED ANSWER

Score 2

The confusion matrix essentially gives you a single point on the ROC curve. To construct a 'full' ROC curve you will need a list of probabilities and then the ROC curve can be plotted by varying the 'threshold' used in determining the class prediction to determine which class each instance belongs to.

In your simple case (when you have only one point of the ROC curve) you could plot the ROC curve by extrapolating to the origin and the point (1,1):

# compare to your confusion matrix to see values.
TP = 180
FN = 21

tpr = TP/(TP+FN)
fpr = 1-tpr

tpr_line = [0, tpr, 1]
fpr_line = [0, fpr 1]

plt.plot(fpr, tpr, 'k-', lw=2)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.xlim(0, 1)
plt.ylim(0, 1)

and the ROC curve looks like:

ANSWER 3

Score 1

Everyone seems to give wrong answers for plotting ROC for IsolationForest. That's because decision_function/score_samples return opposite of what people expect (here hits/positives are low values and negatives are high values). That makes the ROC flipped.

You must negate the results from decision_function/score_samples to get correct ROC.

Additionally, this specific dataset (breast_cancer) also needs to have the positive label set to 0.

Note these two are unrelated errors in the top answer and don't cancel out each other.

A comparison of correct and incorrect ROC:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import IsolationForest
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)

clf = IsolationForest(max_samples=100,
                      random_state=0, contamination='auto')
clf.fit(X)
y_pred = clf.score_samples(X)

fpr_wrong, tpr_wrong, _ = roc_curve(y, y_pred)
fpr_correct, tpr_correct, _ = roc_curve(y, -y_pred, pos_label=0)

plt.plot(fpr_correct, tpr_correct, 'green', lw=1)
plt.plot(fpr_wrong, tpr_wrong, 'red', lw=1)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()