Skip to content Skip to sidebar Skip to footer

Prune Unnecessary Leaves In Sklearn Decisiontreeclassifier

I use sklearn.tree.DecisionTreeClassifier to build a decision tree. With the optimal parameter settings, I get a tree that has unnecessary leaves (see example picture below - I do

Solution 1:

Using ncfirth's link, I was able to modify the code there so that it fits to my problem:

from sklearn.tree._tree import TREE_LEAF

defis_leaf(inner_tree, index):
    # Check whether node is leaf nodereturn (inner_tree.children_left[index] == TREE_LEAF and 
            inner_tree.children_right[index] == TREE_LEAF)

defprune_index(inner_tree, decisions, index=0):
    # Start pruning from the bottom - if we start from the top, we might miss# nodes that become leaves during pruning.# Do not use this directly - use prune_duplicate_leaves instead.ifnot is_leaf(inner_tree, inner_tree.children_left[index]):
        prune_index(inner_tree, decisions, inner_tree.children_left[index])
    ifnot is_leaf(inner_tree, inner_tree.children_right[index]):
        prune_index(inner_tree, decisions, inner_tree.children_right[index])

    # Prune children if both children are leaves now and make the same decision:     if (is_leaf(inner_tree, inner_tree.children_left[index]) and
        is_leaf(inner_tree, inner_tree.children_right[index]) and
        (decisions[index] == decisions[inner_tree.children_left[index]]) and 
        (decisions[index] == decisions[inner_tree.children_right[index]])):
        # turn node into a leaf by "unlinking" its children
        inner_tree.children_left[index] = TREE_LEAF
        inner_tree.children_right[index] = TREE_LEAF
        ##print("Pruned {}".format(index))defprune_duplicate_leaves(mdl):
    # Remove leaves if both 
    decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
    prune_index(mdl.tree_, decisions)

Using this on a DecisionTreeClassifier clf:

prune_duplicate_leaves(clf)

Edit: Fixed a bug for more complex trees

Solution 2:

DecisionTreeClassifier(max_leaf_nodes=8) specifies (max) 8 leaves, so unless the tree builder has another reason to stop it will hit the max.

In the example shown, 5 of the 8 leaves have a very small amount of samples (<=3) compared to the others 3 leaves (>50), a possible sign of over-fitting. Instead of pruning the tree after training, one can specifying either min_samples_leaf or min_samples_split to better guide the training, which will likely get rid of the problematic leaves. For instance use the value 0.05 for least 5% of samples.

Solution 3:

I had a problem with the code posted here so I revised it and had to add a small section (it deals with the case that both sides are the same but there is still a comparison present):

from sklearn.tree._tree import TREE_LEAF, TREE_UNDEFINED

defis_leaf(inner_tree, index):
    # Check whether node is leaf nodereturn (inner_tree.children_left[index] == TREE_LEAF and 
            inner_tree.children_right[index] == TREE_LEAF)

defprune_index(inner_tree, decisions, index=0):
    # Start pruning from the bottom - if we start from the top, we might miss# nodes that become leaves during pruning.# Do not use this directly - use prune_duplicate_leaves instead.ifnot is_leaf(inner_tree, inner_tree.children_left[index]):
        prune_index(inner_tree, decisions, inner_tree.children_left[index])
    ifnot is_leaf(inner_tree, inner_tree.children_right[index]):
        prune_index(inner_tree, decisions, inner_tree.children_right[index])

    # Prune children if both children are leaves now and make the same decision:     if (is_leaf(inner_tree, inner_tree.children_left[index]) and
        is_leaf(inner_tree, inner_tree.children_right[index]) and
        (decisions[index] == decisions[inner_tree.children_left[index]]) and 
        (decisions[index] == decisions[inner_tree.children_right[index]])):
        # turn node into a leaf by "unlinking" its children
        inner_tree.children_left[index] = TREE_LEAF
        inner_tree.children_right[index] = TREE_LEAF
        inner_tree.feature[index] = TREE_UNDEFINED
        ##print("Pruned {}".format(index))defprune_duplicate_leaves(mdl):
    # Remove leaves if both 
    decisions = mdl.tree_.value.argmax(axis=2).flatten().tolist() # Decision for each node
    prune_index(mdl.tree_, decisions)

Post a Comment for "Prune Unnecessary Leaves In Sklearn Decisiontreeclassifier"