Can sklearn Random Forest classifier adjust sample size by tree, to handle class imbalance?

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Techno Bleepage Open

--

Chapters
00:00 Question
01:32 Accepted answer (Score 2)
01:53 Answer 2 (Score 3)
02:16 Answer 3 (Score 0)
02:49 Answer 4 (Score 0)
03:27 Thank you

--

Full question
https://stackoverflow.com/questions/2025...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #r #scikitlearn #classification #randomforest

#avk47

ANSWER 1

Score 3

In version 0.16-dev, you can now use class_weight="auto" to have something close to what you want to do. This will still use all samples, but it will reweight them so that classes become balanced.

ACCEPTED ANSWER

Score 2

After reading over the documentation, I think that the answer is definitely no. Kudos to anyone who adds the functionality though. As mentioned above the R package randomForest contains this functionality.

ANSWER 3

Score 0

As far as I am aware, the scikit-learn forest employ bootstrapping i.e. the sample set sizes each tree is trained with are always of the same size and drawn from the original training set by random sampling with replacement.

Assuming you have a large enough set of training samples, why not balancing this itself out to hold 50/50 positive/negative samples and you will achieve the desired effect. scikit-learn provides functionality for this.

ANSWER 4

Score 0

Workaround in R only, for classification one can simply use all cores of the machine with 100% CPU utilization.

This matches the time and speed of Sklearn RandomForest classifier.

Also for regression there is a package RandomforestParallel on GitHub, which is much faster than Python Sklearn Regressor.

Classification: I have tested and works well.