The Python Oracle

Decision Tree generating terminal leaves with same classes

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: City Beneath the Waves Looping

--

Chapters
00:00 Question
02:31 Accepted answer (Score 4)
04:11 Thank you

--

Full question
https://stackoverflow.com/questions/5122...

Question links:
https://medium.com/@haydar_ai/learning-d...
[Small Decision Tree Example]: https://i.stack.imgur.com/7A6so.png
[enter image description here]: https://i.stack.imgur.com/FCbOG.png

Accepted answer links:
[export_graphviz()]: http://scikit-learn.org/stable/modules/g...
[Gini impurity]: https://stats.stackexchange.com/a/339514...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #scikitlearn #decisiontree

#avk47



ACCEPTED ANSWER

Score 4


The class attribute you are referring to is the majority class at that particular node, and the colors come from the filled = True parameter you pass to export_graphviz().

Now, looking at your dataset, you have 147 samples of class1 and 525 samples of class2, which is a fairly imbalanced ratio. It just so happens that the optimal splits for your particular dataset at this depth produce splits where the majority class is class2. This is normal behaviour, and a product of your data, and not altogether surprising given that class2 outnumbers class1 by about 3:1.

As to why the tree doesn't stop when the majority class is the same for the two children of a split, it's because of the way the algorithm works. If left unbounded with no max depth, it will continue until it produces only pure leaf nodes that contain a single class exclusively (and where the Gini impurity is 0). You've set max_depth = 2 in your example, so the tree simply stops before it can yield all pure nodes.

You'll notice that in the split you've boxed in red in your example, the node on the right is almost 100% class2, with 54 instances of class2 and only 2 of class1. If the algorithm had stopped before that it would produce the node above, with 291-45 class2-class1, which is far less useful.

Perhaps you could increase the max depth of your tree and see if you can separate out the classes further.