Decision Tree generating terminal leaves with same classes

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Take control of your privacy with Proton's trusted, Swiss-based, secure services.
Choose what you need and safeguard your digital life:
Mail: https://go.getproton.me/SH1CU
VPN: https://go.getproton.me/SH1DI
Password Manager: https://go.getproton.me/SH1DJ
Drive: https://go.getproton.me/SH1CT

Music by Eric Matyas
https://www.soundimage.org
Track title: Ocean Floor

--

Chapters
00:00 Decision Tree Generating Terminal Leaves With Same Classes
01:52 Accepted Answer Score 4
03:24 Thank you

--

Full question
https://stackoverflow.com/questions/5122...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #scikitlearn #decisiontree

#avk47

ACCEPTED ANSWER

Score 4

The class attribute you are referring to is the majority class at that particular node, and the colors come from the filled = True parameter you pass to export_graphviz().

Now, looking at your dataset, you have 147 samples of class1 and 525 samples of class2, which is a fairly imbalanced ratio. It just so happens that the optimal splits for your particular dataset at this depth produce splits where the majority class is class2. This is normal behaviour, and a product of your data, and not altogether surprising given that class2 outnumbers class1 by about 3:1.

As to why the tree doesn't stop when the majority class is the same for the two children of a split, it's because of the way the algorithm works. If left unbounded with no max depth, it will continue until it produces only pure leaf nodes that contain a single class exclusively (and where the Gini impurity is 0). You've set max_depth = 2 in your example, so the tree simply stops before it can yield all pure nodes.

You'll notice that in the split you've boxed in red in your example, the node on the right is almost 100% class2, with 54 instances of class2 and only 2 of class1. If the algorithm had stopped before that it would produce the node above, with 291-45 class2-class1, which is far less useful.

Perhaps you could increase the max depth of your tree and see if you can separate out the classes further.