Prune Unnecessary Leaves In Sklearn Decisiontreeclassifier
Solution 1:
Using ncfirth's link, I was able to modify the code there so that it fits to my problem:
from sklearn.tree._tree import TREE_LEAF
defis_leaf(inner_tree, index):
# Check whether node is leaf nodereturn (inner_tree.children_left[index] == TREE_LEAF and
inner_tree.children_right[index] == TREE_LEAF)
defprune_index(inner_tree, decisions, index=0):
# Start pruning from the bottom - if we start from the top, we might miss# nodes that become leaves during pruning.# Do not use this directly - use prune_duplicate_leaves instead.ifnot is_leaf(inner_tree, inner_tree.children_left[index]):
prune_index(inner_tree, decisions, inner_tree.children_left[index])
ifnot is_leaf(inner_tree, inner_tree.children_right[index]):
prune_index(inner_tree, decisions, inner_tree.children_right[index])
# Prune children if both children are leaves now and make the same decision: if (is_leaf(inner_tree, inner_tree.children_left[index]) and
is_leaf(inner_tree, inner_tree.children_right[index]) and
(decisions[index] == decisions[inner_tree.children_left[index]]) and
(decisions[index] == decisions[inner_tree.children_right[index]])):
# turn node into a leaf by "unlinking" its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
##print("Pruned {}".format(index))defprune_duplicate_leaves(mdl):
# Remove leaves if both
decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
prune_index(mdl.tree_, decisions)
Using this on a DecisionTreeClassifier clf:
prune_duplicate_leaves(clf)
Edit: Fixed a bug for more complex trees
Solution 2:
DecisionTreeClassifier(max_leaf_nodes=8)
specifies (max) 8 leaves, so unless the tree builder has another reason to stop it will hit the max.
In the example shown, 5 of the 8 leaves have a very small amount of samples (<=3) compared to the others 3 leaves (>50), a possible sign of over-fitting.
Instead of pruning the tree after training, one can specifying either min_samples_leaf
or min_samples_split
to better guide the training, which will likely get rid of the problematic leaves. For instance use the value 0.05
for least 5% of samples.
Solution 3:
I had a problem with the code posted here so I revised it and had to add a small section (it deals with the case that both sides are the same but there is still a comparison present):
from sklearn.tree._tree import TREE_LEAF, TREE_UNDEFINED
defis_leaf(inner_tree, index):
# Check whether node is leaf nodereturn (inner_tree.children_left[index] == TREE_LEAF and
inner_tree.children_right[index] == TREE_LEAF)
defprune_index(inner_tree, decisions, index=0):
# Start pruning from the bottom - if we start from the top, we might miss# nodes that become leaves during pruning.# Do not use this directly - use prune_duplicate_leaves instead.ifnot is_leaf(inner_tree, inner_tree.children_left[index]):
prune_index(inner_tree, decisions, inner_tree.children_left[index])
ifnot is_leaf(inner_tree, inner_tree.children_right[index]):
prune_index(inner_tree, decisions, inner_tree.children_right[index])
# Prune children if both children are leaves now and make the same decision: if (is_leaf(inner_tree, inner_tree.children_left[index]) and
is_leaf(inner_tree, inner_tree.children_right[index]) and
(decisions[index] == decisions[inner_tree.children_left[index]]) and
(decisions[index] == decisions[inner_tree.children_right[index]])):
# turn node into a leaf by "unlinking" its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
inner_tree.feature[index] = TREE_UNDEFINED
##print("Pruned {}".format(index))defprune_duplicate_leaves(mdl):
# Remove leaves if both
decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
prune_index(mdl.tree_, decisions)
Post a Comment for "Prune Unnecessary Leaves In Sklearn Decisiontreeclassifier"