Machine-learning · June 13, 2026

Decision Trees: A Complete Guide with Hand-Worked Examples

How information gain, Gini impurity, and recursive splitting turn data into interpretable rules

by Perivitta 22 mins read Intermediate
Share
Back to all posts

Decision Trees: A Complete Guide with Hand-Worked Examples

Introduction

A decision tree is one of the most intuitive models in machine learning. It makes predictions by asking a sequence of yes/no questions about the input features, branching left or right at each node until it reaches a leaf that gives the prediction. Every path from root to leaf is a human-readable rule: "if area > 2000 and bedrooms > 3, predict price > $450k."

That interpretability is why decision trees are used in medical diagnosis, credit risk, and fraud detection — domains where you need to explain the reasoning behind a decision, not just report a number. They are also the building block of the most powerful ensemble methods: Random Forest trains hundreds of trees in parallel; XGBoost trains them in sequence. Understanding a single tree is therefore a prerequisite for understanding both.


1. The Core Idea: Find the Best Split

Building a tree is a recursive process. At each node, the algorithm asks: which feature, and which threshold, produces the most informative split? "Most informative" means the two resulting groups are as pure as possible, where pure means predominantly one class.

There are two standard measures of impurity that define what "best" means:

  • Gini impurity — used by CART (the algorithm behind scikit-learn's implementation)
  • Entropy / Information Gain — used by ID3 and C4.5

Both measures agree in most practical situations. We will derive and use both.


2. Gini Impurity

Gini impurity measures the probability that a randomly chosen element from a node would be misclassified if it were labelled according to the class distribution at that node.

\[ \text{Gini}(S) = 1 - \sum_{k=1}^{K} p_k^2 \]

where \(p_k\) is the proportion of class \(k\) at node \(S\), and \(K\) is the number of classes.

  • A perfectly pure node (all one class) has Gini = 0
  • A maximally impure binary node (50/50 split) has Gini = 0.5

When we evaluate a split, we compute the weighted Gini of the two child nodes:

\[ \text{Gini}_{\text{split}} = \frac{n_L}{n} \cdot \text{Gini}(L) + \frac{n_R}{n} \cdot \text{Gini}(R) \]

We choose the split that minimises \(\text{Gini}_{\text{split}}\).


3. Information Gain and Entropy

Entropy measures the average unpredictability of a node's class distribution:

\[ H(S) = - \sum_{k=1}^{K} p_k \log_2 p_k \]

A pure node has entropy 0. A 50/50 binary split has entropy 1 bit. Information Gain is the reduction in entropy achieved by the split:

\[ \text{IG}(S, f) = H(S) - \left[ \frac{n_L}{n} H(L) + \frac{n_R}{n} H(R) \right] \]

We choose the feature and threshold that maximises information gain.


4. Worked Example: Building a Tree by Hand

Suppose we have 10 loan applicants and want to predict whether they default (D = Yes) or repay (D = No) based on two features: income level (High / Low) and credit score (Good / Poor).

# Income Credit Score Default?
1HighGoodNo
2HighGoodNo
3HighPoorNo
4LowGoodNo
5LowPoorYes
6LowPoorYes
7LowPoorYes
8HighPoorNo
9LowGoodNo
10LowPoorYes

The root node has 6 No and 4 Yes. Its entropy is:

\[ H(\text{root}) = -\tfrac{6}{10}\log_2\tfrac{6}{10} - \tfrac{4}{10}\log_2\tfrac{4}{10} \approx 0.971 \text{ bits} \]

Step 1: Evaluate the split on Income

Income = High: rows {1,2,3,8} → 4 No, 0 Yes → \(H = 0\)

Income = Low: rows {4,5,6,7,9,10} → 2 No, 4 Yes → \(H = -\tfrac{2}{6}\log_2\tfrac{2}{6} - \tfrac{4}{6}\log_2\tfrac{4}{6} \approx 0.918\)

\[ \text{IG}(\text{Income}) = 0.971 - \left[\tfrac{4}{10}(0) + \tfrac{6}{10}(0.918)\right] \approx 0.971 - 0.551 = 0.420 \text{ bits} \]

Step 2: Evaluate the split on Credit Score

Credit = Good: rows {1,2,4,9} → 4 No, 0 Yes → \(H = 0\)

Credit = Poor: rows {3,5,6,7,8,10} → 2 No, 4 Yes → \(H \approx 0.918\)

\[ \text{IG}(\text{Credit}) = 0.971 - \left[\tfrac{4}{10}(0) + \tfrac{6}{10}(0.918)\right] \approx 0.420 \text{ bits} \]

Both splits give the same information gain here. We pick Income arbitrarily (ties are broken by index). The High Income branch is already pure (all No). On the Low Income branch, we recurse.

Step 3: Recurse on the Low Income branch

6 samples: rows {4,5,6,7,9,10}. Credit Score = Good → {4,9} both No (pure). Credit Score = Poor → {5,6,7,10} all Yes (pure). The second split on Credit Score produces two pure leaves. The tree is done.

RulePrediction
Income = HighNo Default
Income = Low AND Credit = GoodNo Default
Income = Low AND Credit = PoorDefault

5. Splitting on Continuous Features

Real datasets have continuous features like age, income as a dollar amount, or transaction size. For a continuous feature, the algorithm tests every possible threshold (midpoints between adjacent sorted values) and computes information gain for each. The threshold that produces the highest information gain is chosen.

For a feature with \(n\) unique values, there are \(n-1\) candidate thresholds. This is why tree training on large datasets can be slow: for each node, every feature's thresholds must be evaluated. XGBoost's histogram-based split finding is a direct optimization of this step.

Worked Example: Choosing a Threshold

Suppose we have 6 applicants with a continuous Age feature and a loan default label:

#AgeDefault?
122No
225No
330Yes
435Yes
538Yes
642No

Root entropy: 3 No, 3 Yes → \(H = -\tfrac{3}{6}\log_2\tfrac{3}{6} - \tfrac{3}{6}\log_2\tfrac{3}{6} = 1.0\) bit.

Candidate thresholds (midpoints between consecutive sorted ages): 23.5, 27.5, 32.5, 36.5, 40.

Evaluating Age ≤ 27.5:
Left (Age ≤ 27.5): rows {1,2} → 2 No, 0 Yes → \(H = 0\)
Right (Age > 27.5): rows {3,4,5,6} → 1 No, 3 Yes → \(H = -\tfrac{1}{4}\log_2\tfrac{1}{4} - \tfrac{3}{4}\log_2\tfrac{3}{4} \approx 0.811\)

\[ \text{IG}(\text{Age} \leq 27.5) = 1.0 - \left[\tfrac{2}{6}(0) + \tfrac{4}{6}(0.811)\right] \approx 1.0 - 0.541 = 0.459 \text{ bits} \]

Evaluating Age ≤ 36.5:
Left: rows {1,2,3,4,5} → 2 No, 3 Yes → \(H \approx 0.971\)
Right: rows {6} → 1 No, 0 Yes → \(H = 0\)

\[ \text{IG}(\text{Age} \leq 36.5) = 1.0 - \left[\tfrac{5}{6}(0.971) + \tfrac{1}{6}(0)\right] \approx 1.0 - 0.809 = 0.191 \text{ bits} \]

The threshold Age ≤ 27.5 gives the highest information gain (0.459 bits) and is selected as the best split. The algorithm repeats this process for every feature at every node, always choosing the globally best split.


6. Decision Trees for Regression

When the target is continuous, impurity is replaced by variance reduction (or equivalently, minimizing mean squared error). Each leaf predicts the mean of training targets that fell into it.

\[ \text{MSE}(S) = \frac{1}{n} \sum_{i \in S} (y_i - \bar{y}_S)^2 \]

The split that most reduces the weighted MSE of the two child nodes is chosen. This is exactly how regression trees in Random Forest and gradient boosting work.

Worked Example: Regression Split

Suppose we want to predict house price from house size:

Size (sq ft)Price ($k)
900150
1100200
1400280
1800350
2200420

Root mean: \(\bar{y} = (150+200+280+350+420)/5 = 280\). Root MSE = \(\frac{1}{5}[(150-280)^2 + (200-280)^2 + (280-280)^2 + (350-280)^2 + (420-280)^2] = \frac{1}{5}[16900 + 6400 + 0 + 4900 + 19600] = 9560\).

Evaluating Size ≤ 1250 (splitting {900,1100} from {1400,1800,2200}):
Left mean = 175, Left MSE = \(\frac{1}{2}[(150-175)^2 + (200-175)^2] = 625\)
Right mean = 350, Right MSE = \(\frac{1}{3}[(280-350)^2 + (350-350)^2 + (420-350)^2] \approx 3267\)

\[ \text{Weighted MSE} = \tfrac{2}{5}(625) + \tfrac{3}{5}(3267) = 250 + 1960 = 2210 \]

Variance reduction = 9560 − 2210 = 7350. The algorithm compares this against all other candidate thresholds and picks the one with the largest variance reduction. At prediction time, a new house with Size 1050 falls in the left leaf and gets predicted price $175k (the mean of that leaf's training targets).


7. Controlling Tree Complexity

An unconstrained tree will grow until every leaf is pure, perfectly fitting the training set and badly overfitting. Several hyperparameters control this:

Hyperparameter Effect sklearn name
Max depth Limits the number of splits from root to leaf. Depth 1 = a single decision stump. max_depth
Min samples per leaf A leaf must have at least this many training samples. Prevents splits on tiny subgroups. min_samples_leaf
Min samples to split A node must have at least this many samples before it can be split further. min_samples_split
Max features At each split, only consider a random subset of features. Reduces correlation between trees in ensembles. max_features
Min impurity decrease A split is only made if it reduces impurity by at least this amount. min_impurity_decrease

Practical Guidance on Choosing Values

max_depth is the most important lever. Start with 3–5 for most tabular datasets. A depth-3 tree has at most 8 leaves, which is interpretable and often surprisingly effective. Only increase depth once you have confirmed that the model is underfitting (high training error, not just high test error).

min_samples_leaf directly prevents the tree from memorizing noise. A common rule of thumb: set it to at least 1% of your training set size. For a 10,000-row dataset, min_samples_leaf=100 means no leaf can represent fewer than 100 examples — small enough to be specific, large enough to be reliable. For regression, larger values (5%+) are often better since noise in continuous targets is harder to filter.

min_samples_split is typically set to 2 * min_samples_leaf. There is rarely a reason to tune it independently.

max_features matters most in ensemble contexts. For a single tree used for interpretability, leave it at None (use all features). In Random Forest, max_features="sqrt" (scikit-learn default) decorrelates trees effectively. For gradient boosting, a value around 0.5–0.8 acts as column subsampling.

The general tuning strategy: fix max_depth first, then adjust min_samples_leaf to reduce overfitting, then use cost-complexity pruning (Section 8) to fine-tune.


8. Cost-Complexity Pruning

Post-pruning (also called cost-complexity pruning) builds the full tree first, then removes branches that provide little benefit. It adds a regularization term \(\alpha\) that penalizes tree complexity:

\[ R_\alpha(T) = R(T) + \alpha |T| \]

where \(R(T)\) is the training error and \(|T|\) is the number of leaves. Higher \(\alpha\) produces a smaller, more regularized tree. The optimal \(\alpha\) is found by cross-validation. In scikit-learn this is controlled by ccp_alpha.

How to Find the Right ccp_alpha

Scikit-learn exposes the full pruning path via cost_complexity_pruning_path(), which returns the effective alphas and corresponding impurities at each pruning step. You cross-validate over this set of alphas to find the one that maximises validation accuracy:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Build the full tree first to get candidate alphas
full_tree = DecisionTreeClassifier(random_state=42)
path = full_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1]  # exclude the last (trivial root node)

# Cross-validate each alpha
cv_scores = []
for alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

# Pick the alpha with the best CV score
best_alpha = ccp_alphas[np.argmax(cv_scores)]
print(f"Best ccp_alpha: {best_alpha:.5f}")

# Train final model
final_tree = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
final_tree.fit(X_train, y_train)

A typical result: the full unpruned tree has 40+ leaves and 72% test accuracy; after pruning with the optimal alpha it has 6 leaves and 87% test accuracy. The pruned tree is both more accurate and more interpretable — a rare win in both directions.

When to use pruning vs pre-pruning hyperparameters: Use max_depth and min_samples_leaf when you have a clear interpretability requirement and want a fixed-size tree. Use ccp_alpha when you want to let the data determine the tree shape — it finds the pruning level that optimally trades off training fit and tree complexity.


9. Advantages and Limitations

Advantages Limitations
Fully interpretable — every prediction has a traceable rule path High variance — small changes in training data produce very different trees
Handles both categorical and continuous features natively Prone to overfitting without depth or leaf constraints
Requires no feature scaling or normalization Axis-aligned splits cannot capture diagonal decision boundaries efficiently
Handles missing values with surrogate splits Biased toward features with more unique values when using information gain
Fast to train and predict on tabular data A single tree rarely achieves competitive accuracy on its own

The high variance problem is the most practically important limitation. If you bootstrap-resample your training data (take a random 80% sample) and retrain the tree, you often get a completely different structure. This instability means single trees are unreliable for most production use cases. It is the primary motivation for ensemble methods.

The axis-aligned splits limitation means decision trees struggle with features that only matter in combination. If a class boundary is "x + y > 5", a tree needs many splits to approximate a diagonal boundary, while logistic regression captures it in one linear term. For these patterns, trees need to be deep (and therefore overfit) to compete.

The feature bias in information gain is a real concern: features with many unique values (like a customer ID) appear highly informative simply because they can partition the data into many small pure groups. The solution is to use the Gain Ratio (C4.5) or Gini (CART), which normalise for the number of distinct values. Scikit-learn uses Gini by default, which handles this better than raw information gain.

Despite these limitations, a decision tree is the right choice when the model's decisions must be explained to a non-technical audience, when regulatory requirements demand auditability (credit scoring, medical diagnosis), or when you want a fast interpretable baseline before moving to an ensemble.


10. From Trees to Ensembles

The high variance of a single decision tree is its main weakness. The insight behind ensemble methods is that averaging many high-variance, low-bias models can dramatically reduce variance without sacrificing bias:

  • Random Forest trains many trees on bootstrap samples of the data and averages their predictions. Each tree sees a random subset of features at each split, decorrelating the trees so that averaging produces a meaningful reduction in variance.
  • Gradient Boosting (XGBoost, LightGBM) trains shallow trees sequentially, each correcting the residuals of the ensemble so far. The depth constraint keeps individual trees weak (high bias), and the boosting process reduces bias at the ensemble level.

Both approaches work because of properties of the individual tree: interpretable splits, no feature scaling requirement, and good handling of tabular data with mixed feature types.


11. Python Implementation

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train with depth constraint and cost-complexity pruning
clf = DecisionTreeClassifier(
    criterion="gini",
    max_depth=4,
    min_samples_leaf=5,
    ccp_alpha=0.01,
    random_state=42
)
clf.fit(X_train, y_train)

print(f"Train accuracy: {accuracy_score(y_train, clf.predict(X_train)):.3f}")
print(f"Test accuracy:  {accuracy_score(y_test,  clf.predict(X_test)):.3f}")

# Visualize
fig, ax = plt.subplots(figsize=(20, 8))
plot_tree(clf, feature_names=load_iris().feature_names,
          class_names=load_iris().target_names,
          filled=True, ax=ax)
plt.tight_layout()
plt.savefig("decision_tree.png", dpi=150)

Frequently Asked Questions

What is the difference between Gini impurity and information gain?

Both measure node purity but use different formulas. Gini impurity measures the probability of misclassifying a randomly chosen sample and is used by CART (scikit-learn). Information gain measures the entropy reduction from a split and is used by ID3 and C4.5. In practice they produce very similar trees; Gini is slightly faster to compute since it avoids a logarithm.

How do you prevent a decision tree from overfitting?

The main controls are max_depth, min_samples_split, and min_samples_leaf. Limiting depth is the simplest approach. Post-pruning with cost-complexity pruning (the ccp_alpha parameter in scikit-learn) removes branches whose removal does not hurt validation accuracy. Cross-validation is used to tune the right pruning strength.

When should you use a decision tree over other models?

Use decision trees when interpretability is a hard requirement — medical diagnosis, credit decisions, and compliance contexts where you must explain the prediction. They also require no feature scaling and handle mixed data types natively. For pure predictive accuracy, Random Forest or XGBoost almost always outperform a single tree.

How does a decision tree relate to Random Forest and XGBoost?

Both ensemble methods are built on decision trees. Random Forest trains many deep trees in parallel on random data and feature subsets, then averages their predictions (bagging). XGBoost trains shallow trees in sequence, where each tree corrects the residual errors of the previous ones (boosting). Understanding a single tree is the prerequisite for understanding both.


Key Takeaways

  • A decision tree recursively splits data by choosing the feature and threshold that maximises information gain (entropy reduction) or minimises Gini impurity at each node.
  • Unpruned trees overfit. Control complexity with max_depth, min_samples_leaf, and ccp_alpha.
  • For regression, variance reduction (MSE) replaces impurity as the split criterion; each leaf predicts the mean of its training targets.
  • A single tree has high variance. Random Forest and gradient boosting methods reduce this variance through ensembling while retaining the tree's interpretable split structure.

References

  • Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Wadsworth.
  • Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.
  • Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Related Articles

Knowledge Distillation: How Small Models Learn from Big Ones
Knowledge Distillation: How Small Models Learn from Big Ones
Knowledge distillation trains a small student model to learn from a large...
Read More →
Mixture of Experts (MoE): The Architecture Behind GPT-4, Mixtral, and Grok
Mixture of Experts (MoE): The Architecture Behind GPT-4, Mixtral, and Grok
Mixture of Experts scales model capacity without scaling compute. Instead of activating...
Read More →
Found this useful?