The Python Oracle

Isolation Forest vs Robust Random Cut Forest in outlier detection

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Track title: CC H Dvoks String Quartet No 12 Ame

--

Chapters
00:00 Question
01:19 Accepted answer (Score 12)
03:34 Answer 2 (Score 3)
04:03 Thank you

--

Full question
https://stackoverflow.com/questions/6311...

Question links:
[Paper]: https://cs.nju.edu.cn/zhouzh/zhouzh.file...?
[Tutorial]: https://towardsdatascience.com/outlier-d...
[Paper]: http://proceedings.mlr.press/v48/guha16....
[Tutorial]: https://freecontent.manning.com/the-rand.../

Accepted answer links:
[sklearn.ensemble.IsolationForest]: https://scikit-learn.org/stable/modules/...
[Amazon Kinesis]: https://docs.aws.amazon.com/kinesisanaly...
[Amazon SageMaker]: https://docs.aws.amazon.com/sagemaker/la...
https://github.com/kLabUM/rrcf
[one machine or multiple machines]: https://docs.aws.amazon.com/sagemaker/la...
[SageMaker doc]: https://docs.aws.amazon.com/sagemaker/la...
[image]: https://i.stack.imgur.com/3FXmE.png

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #scikitlearn #amazonsagemaker #outliers #anomalydetection

#avk47



ACCEPTED ANSWER

Score 12


In part of my answers I'll assume you refer to Sklearn's Isolation Forest. I believe those are the 4 main differences:

  1. Code availability: Isolation Forest has a popular open-source implementation in Scikit-Learn (sklearn.ensemble.IsolationForest), while both AWS implementation of Robust Random Cut Forest (RRCF) are closed-source, in Amazon Kinesis and Amazon SageMaker. There is an interesting third party RRCF open-source implementation on GitHub though: https://github.com/kLabUM/rrcf ; but unsure how popular it is yet

  2. Training design: RRCF can work on streams, as highlighted in the paper and as exposed in the streaming analytics service Kinesis Data Analytics. On the other hand, the absence of partial_fit method hints me that Sklearn's Isolation Forest is a batch-only algorithm that cannot readily work on data streams

  3. Scalability: SageMaker RRCF is more scalable. Sklearn's Isolation Forest is single-machine code, which can nonetheless be parallelized over CPUs with the n_jobs parameter. On the other hand, SageMaker RRCF can be used over one machine or multiple machines. Also, it supports SageMaker Pipe mode (streaming data via unix pipes) which makes it able to learn on much bigger data than what fits on disk

  4. the way features are sampled at each recursive isolation: RRCF gives more weight to dimension with higher variance (according to SageMaker doc), while I think isolation forest samples at random, which is one reason why RRCF is expected to perform better in high-dimensional space (picture from the RRCF paper) enter image description here




ANSWER 2

Score 3


I believe they also differ in how they assign anomaly score. IF's score is based on distance from the root node. RRCF is based on how much a new point changes the tree structure (i.e., shift in the tree size by including the new point). This makes RRCF less sensitive to the sample size.