Random Forest: A Complete Beginner's Guide

Before We Start: The Core Idea in One Sentence

Random Forest builds a large number of deliberately different decision trees on random subsets of your data and random subsets of your features, then combines their predictions. A crowd of imperfect but diverse experts reliably outperforms any single expert.

1. Why Ensembles Work: Wisdom of Crowds

In 1906, statistician Francis Galton visited a county fair and observed an ox-weight-guessing competition. About 800 villagers each wrote down a guess. No single person was accurate, but the median of all guesses was 1,207 lb. The actual weight was 1,198 lb. The crowd was within 0.8%.

This phenomenon, that aggregating many independent imperfect estimates beats any individual estimate, is called the wisdom of crowds. Random Forest is its machine learning incarnation.

Formally, suppose you have $B$ classifiers, each making an independent error with probability $p < 0.5$. The probability that more than half of them are wrong (i.e., the majority vote is wrong) follows the binomial distribution:

P(\text{majority wrong}) = \sum_{k=\lceil B/2 \rceil}^{B} \binom{B}{k} p^k (1-p)^{B-k}

For $B = 100$ trees each with error rate $p = 0.30$, the majority-vote error is approximately 0.0009%, compared to 30% for any single tree. That dramatic drop is the power of ensembles.

Figure 1: The Random Forest pipeline. Each tree trains on an independent bootstrap sample using a random subset of features at every split. Predictions are combined via majority vote (classification) or averaging (regression).

2. Prerequisites: Decision Trees in 5 Minutes

Random Forest is built on decision trees, so you need to understand them first. A decision tree recursively splits the training data at the feature value that best separates the classes.

Example CART decision tree for Titanic survival prediction — **Figure 2:** A CART decision tree predicting Titanic survival. Each internal node tests one feature; leaves give the predicted class (survived / died). Source: Wikipedia (CC BY-SA 3.0)

2.1 Gini Impurity

The most common split criterion in Random Forest (and the default in scikit-learn) is Gini impurity, which measures how often a randomly chosen element would be incorrectly labelled if it were randomly labelled according to the class distribution at that node.

For a node containing $n$ samples across $K$ classes, where $p_k$ is the fraction of samples belonging to class $k$:

G = 1 - \sum_{k=1}^{K} p_k^2

Interpretation of extremes:

$G = 0$: the node is pure. All samples belong to one class, so no more splitting is needed.
$G = 0.5$: for a binary problem, samples are split 50/50. This is maximum impurity. Knowing the node label tells you nothing.

Example: A node has 10 samples: 7 class A, 3 class B.

p_A = 0.7,\quad p_B = 0.3 \] \[ G = 1 - (0.7^2 + 0.3^2) = 1 - (0.49 + 0.09) = 1 - 0.58 = 0.42

2.2 Entropy and Information Gain

Entropy (used in the ID3 / C4.5 algorithms) measures uncertainty using Shannon's information theory:

H = -\sum_{k=1}^{K} p_k \log_2(p_k)

For the same node (7A, 3B):

H = -(0.7\log_2 0.7 + 0.3\log_2 0.3) = -(0.7 \cdot (-0.515) + 0.3 \cdot (-1.737)) \approx 0.881 \text{ bits}

Information Gain is the reduction in entropy achieved by a split:

\text{IG}(S, f) = H(S) - \sum_{v \in \text{values}(f)} \frac{|S_v|}{|S|} H(S_v)

where $S$ is the current node, $f$ is the feature being tested, and $S_v$ is the subset where feature $f$ takes value $v$.

Gini vs Entropy: Both produce very similar trees in practice. Gini is slightly faster to compute (no logarithm) and is the scikit-learn default. Entropy can produce slightly more balanced trees.

2.3 How a Tree Grows

The CART (Classification and Regression Trees) algorithm builds a binary tree greedily:

At the current node, consider every feature and every possible split threshold.
Choose the split that minimises the weighted Gini impurity of the two child nodes: $\text{Gini}_{\text{split}} = \frac{n_L}{n} G_L + \frac{n_R}{n} G_R$
Recursively repeat for each child node.
Stop when a stopping condition is met: pure node, minimum samples, maximum depth, etc.

The critical weakness of a single tree: Grown to full depth, a decision tree memorises the training data perfectly (zero training error) but generalises poorly. It overfits. Random Forest solves this by averaging many trees, each trained on slightly different data.

3. Bagging: Bootstrap Aggregating

Bagging (short for Bootstrap Aggregating), proposed by Leo Breiman in 1996, is the foundation of Random Forest. The idea is:

Bootstrap: From your training set of $n$ samples, draw $n$ samples with replacement to create a new dataset. Some original samples appear multiple times; others are omitted entirely. On average, ~36.8% of samples are left out and form the Out-of-Bag set (covered in Section 7).
Train: Fit a full decision tree on the bootstrap sample.
Repeat: Do steps 1–2 $B$ times, producing $B$ trees.
Aggregate: For a new input, collect the prediction of each tree and return the majority vote (classification) or mean (regression).

Bagging diagram showing bootstrap samples, individual models, and aggregated prediction — **Figure 3:** Bagging: each model is trained on a different bootstrap sample drawn from the original training set. Final prediction combines all models. Source: Wikipedia (CC BY-SA 4.0)

Why does bootstrap sampling help?

Each bootstrap sample is a slightly different view of the data. Trees trained on different samples make different errors, and different errors cancel out when you average. This reduces variance without increasing bias, which is the key insight.

Mathematically, if each tree has variance $\sigma^2$ and the trees are uncorrelated ($\rho = 0$), the variance of the average prediction is:

\text{Var}\!\left(\frac{1}{B}\sum_{b=1}^B \hat{f}_b(x)\right) = \frac{\sigma^2}{B}

In practice, trees are correlated (they all train on the same features, so strong features appear in most trees), so the variance reduction is:

\text{Var} = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2

As $B \to \infty$, this approaches $\rho \sigma^2$. The remaining variance is entirely due to inter-tree correlation $\rho$. This is exactly the problem that Random Forest's feature randomness solves.

4. Random Forest = Bagging + Feature Randomness

Random Forest, introduced by Leo Breiman in 2001, adds one crucial modification to bagging: at each split, only a random subset of $m$ features is considered, rather than all $p$ features.

Algorithm	Bootstrap samples	Feature subsampling at each split
Single Decision Tree	No	All $p$ features
Bagging	Yes	All $p$ features
Random Forest	Yes	Random $m \ll p$ features

Recommended defaults for $m$:

Classification: $m = \lfloor\sqrt{p}\rfloor$
Regression: $m = \lfloor p/3 \rfloor$

Why this extra step? Without feature subsampling, all trees in a bagged ensemble tend to use the same dominant features near the root. This makes the trees highly correlated ($\rho$ stays high), limiting variance reduction. By forcing each tree to consider only a random subset of features at each split, we decorrelate the trees. Even if one feature is overwhelmingly predictive, many trees will not have access to it at key splits and will find alternative patterns. The ensemble correlation $\rho$ drops, and the ensemble variance falls further.

5. The Complete Math

5.1 Bootstrap Sampling

Given a training set $\mathcal{D} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$, bootstrap sample $b$ is drawn as:

\mathcal{D}^{*b} = \{(\mathbf{x}_{i_1}, y_{i_1}), \ldots, (\mathbf{x}_{i_n}, y_{i_n})\}, \quad i_j \sim \text{Uniform}\{1,\ldots,n\} \text{ with replacement}

The probability that any given sample is excluded from bootstrap $b$:

P(\text{excluded}) = \left(1 - \frac{1}{n}\right)^n \xrightarrow{n \to \infty} e^{-1} \approx 0.368

So about 36.8% of training points form the Out-of-Bag (OOB) set for each tree.

5.2 Tree Construction (Formal)

For bootstrap sample $b$, build tree $\hat{f}^b$ using the CART algorithm, with one modification: at every split node, sample $m$ features uniformly without replacement from all $p$ features, then find the best split among only those $m$ features.

5.3 Aggregation

Classification (majority vote):

\hat{y}(\mathbf{x}) = \text{mode}\!\left(\hat{f}^1(\mathbf{x}),\, \hat{f}^2(\mathbf{x}),\, \ldots,\, \hat{f}^B(\mathbf{x})\right)

Soft vote (class probabilities): Each tree outputs a class probability vector. These are averaged, and the class with the highest mean probability is returned:

\hat{P}(y=k \mid \mathbf{x}) = \frac{1}{B} \sum_{b=1}^B \hat{P}^b(y=k \mid \mathbf{x})

Regression (averaging):

\hat{y}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^B \hat{f}^b(\mathbf{x})

5.4 Gini-Based Feature Importance (Mean Decrease Impurity)

The importance of feature $j$ in tree $b$ is the total Gini impurity decrease across all nodes $t$ where feature $j$ was used to split, weighted by the fraction of samples reaching each node:

\text{MDI}_j^b = \sum_{t:\, v(t)=j} \frac{n_t}{n} \!\left[G(t) - \frac{n_{t_L}}{n_t}G(t_L) - \frac{n_{t_R}}{n_t}G(t_R)\right]

The forest-level importance averages over all $B$ trees and normalises to sum to 1:

\text{MDI}_j = \frac{\frac{1}{B}\sum_{b=1}^B \text{MDI}_j^b}{\sum_{j'=1}^p \frac{1}{B}\sum_{b=1}^B \text{MDI}_{j'}^b}

6. Step-by-Step Worked Example

Let's build a tiny Random Forest by hand to see every step concretely. We have 8 training samples with two features (Weather, Wind) and one label (Play tennis?).

ID	Weather	Wind	Play?
1	Sunny	Weak	No
2	Sunny	Strong	No
3	Overcast	Weak	Yes
4	Rain	Weak	Yes
5	Rain	Weak	Yes
6	Rain	Strong	No
7	Overcast	Strong	Yes
8	Sunny	Weak	No

Step 1: Bootstrap Sample for Tree 1

Draw 8 samples with replacement. Suppose we get IDs: 1, 3, 3, 4, 5, 6, 7, 8 (ID 3 appears twice; ID 2 is excluded, so it is OOB for Tree 1).

Bootstrap set composition (counting from the original dataset labels): IDs 3, 3, 4, 5, 7 all have Play=Yes (5 samples); IDs 1, 6, 8 have Play=No (3 samples):

IDs in bootstrap	Play=Yes	Play=No
1,3,3,4,5,6,7,8	5 (3,3,4,5,7)	3 (1,6,8)

Root node Gini: $G = 1 - (5/8)^2 - (3/8)^2 = 1 - 25/64 - 9/64 = 30/64 \approx 0.469$

Step 2: Feature Sampling at the Root Split

We have $p=2$ features. With $m = \lfloor\sqrt{2}\rfloor = 1$, we randomly pick 1 feature to consider. Suppose we pick Weather.

Split on Weather:

Weather	Samples (bootstrap)	Yes	No	Gini
Sunny	1, 8	0	2	0.0
Overcast	3, 3, 7	3	0	0.0
Rain	4, 5, 6	2	1	$1-(2/3)^2-(1/3)^2 = 0.444$

Weighted Gini after splitting on Weather:

G_{\text{split}} = \frac{2}{8}(0) + \frac{3}{8}(0) + \frac{3}{8}(0.444) = 0.167

Gini reduction: $0.469 - 0.167 = 0.302$. Since we only had one feature to check, Weather is the split. Tree 1's root node splits on Weather.

Step 3: Grow Tree 1 Fully

After splitting on Weather, the Sunny branch (2 samples, both No) and Overcast branch (3 samples, all Yes) are already pure. No further splits are needed for those branches. The Rain branch (3 samples: Yes, Yes, No) gets another random feature draw and continues recursively.

Step 4: Repeat for Trees 2 through B

Each tree gets its own bootstrap sample and its own random feature draws at every node. For $B=3$ trees (a minimal forest), we end up with 3 different trees, each with slightly different structure.

Step 5: Predict a New Sample

New sample: Weather=Rain, Wind=Weak. Each tree votes independently:

Tree	Prediction
Tree 1	Yes
Tree 2	Yes
Tree 3	No

Majority vote: Yes (2 vs 1). Final prediction: Play tennis = Yes.

7. Out-of-Bag (OOB) Error

One of the most elegant properties of Random Forest is that it comes with a built-in validation estimate for free, without needing a separate validation set.

Because each tree is trained on a bootstrap sample (roughly 63.2% of the data), the remaining 36.8% are genuine unseen samples for that tree. These are the Out-of-Bag (OOB) samples. For each training sample $i$, we collect all trees whose bootstrap did not include sample $i$, and use only those trees to predict $y_i$:

\hat{y}_i^{\text{OOB}} = \text{mode}\!\left\{ \hat{f}^b(\mathbf{x}_i) : i \notin \mathcal{D}^{*b} \right\}

Aggregate across all training samples to get the OOB error:

\text{OOB error} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\!\left[\hat{y}_i^{\text{OOB}} \neq y_i\right]

Breiman showed that OOB error is an approximately unbiased estimate of the test error, roughly equivalent to leave-one-out cross-validation (the approximation tightens as $B$ grows), but at a tiny fraction of the cost. In scikit-learn, set oob_score=True to compute it automatically.

8. Feature Importance

Random Forest provides two methods to measure which features matter most.

8.1 Mean Decrease Impurity (MDI): built-in, fast

This is the Gini-based importance from Section 5.4: how much each feature reduces impurity across all splits in all trees. It is available via rf.feature_importances_.

Caveat: MDI is biased toward high-cardinality features (features with many unique values, especially continuous ones) because they offer more candidate split points. Do not trust MDI alone for feature selection.

8.2 Permutation Importance: unbiased, slower

Shuffle the values of feature $j$ in the validation set (or OOB set), breaking any relationship between that feature and the target. Measure how much the model's accuracy drops:

\text{PI}_j = \text{score}(\mathcal{D}_{\text{val}}) - \text{score}(\mathcal{D}_{\text{val}}^{\text{perm}_j})

A large drop means the model relied heavily on feature $j$. Features the model ignores show near-zero drops. Permutation importance is less biased than MDI and works on any estimator. Use sklearn.inspection.permutation_importance.

Bar chart of Random Forest feature importances on the Iris dataset — **Figure 4:** MDI feature importances from a Random Forest trained on the Iris dataset. Petal width and petal length dominate, consistent with how strongly they separate species. Source: scikit-learn documentation (BSD License)

9. Hyperparameter Guide

Parameter	scikit-learn name	Default	Effect	Tuning tip
Number of trees	`n_estimators`	100	More trees → lower variance, never overfits. Diminishing returns beyond ~200–500.	Start at 100; increase until OOB error plateaus. Time and memory cost scale linearly.
Features per split	`max_features`	`"sqrt"`	Lower → more decorrelated trees, higher bias per tree. Higher → more correlated trees, lower bias per tree.	Try `"sqrt"`, `"log2"`, and a few integer values. Most impactful for accuracy.
Max tree depth	`max_depth`	`None` (full)	Limiting depth reduces variance and training time. Very shallow trees increase bias.	Use `None` (full trees) when you have enough data. Set 5–15 for noisy datasets.
Min samples to split	`min_samples_split`	2	Higher → simpler trees, less overfitting.	Try 2, 5, 10. Especially useful for noisy tabular data.
Min samples in leaf	`min_samples_leaf`	1	Ensures leaf nodes have a minimum population; smooths predictions.	Set 1–5 for classification; higher (5–20) for regression.
Bootstrap sampling	`bootstrap`	`True`	`False` uses the full dataset for every tree (no OOB error available).	Keep `True` unless your dataset is very small.
OOB score	`oob_score`	`False`	Computes the free validation estimate using OOB samples.	Always set to `True`; it costs nothing extra.
Class weights	`class_weight`	`None`	Adjusts influence of each class during training.	Use `"balanced"` for imbalanced datasets.

10. Bias–Variance Tradeoff

Every model's generalisation error decomposes as:

\text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

A single deep decision tree has low bias (it can fit any boundary) but high variance (it changes dramatically with the training set). Random Forest fixes the variance problem while preserving the low bias of deep trees:

Model	Bias	Variance	Notes
Shallow tree (max_depth=2)	High	Low	Underfits
Deep tree (full)	Low	High	Overfits
Random Forest	Low	Low	Averaging reduces variance
Gradient Boosting	Very low	Medium	Sequentially reduces bias too, but more prone to overfit

Adding more trees to a Random Forest never increases bias and only reduces variance (asymptotically). There is no overfitting problem from adding more trees; the only cost is time and memory.

11. Classification vs Regression

Classification

Each tree votes for a class label (hard vote) or outputs a probability vector (soft vote).
Split criterion: Gini impurity or entropy.
Final prediction: majority label or highest mean class probability.
scikit-learn: RandomForestClassifier

Regression

Each tree outputs a real-valued prediction.
Split criterion: Mean Squared Error (MSE) reduction or Mean Absolute Error (MAE) reduction.
Final prediction: mean of all tree predictions.
scikit-learn: RandomForestRegressor

The split criterion for regression uses MSE:

\text{MSE}(S) = \frac{1}{|S|}\sum_{i \in S}(y_i - \bar{y}_S)^2, \quad \bar{y}_S = \frac{1}{|S|}\sum_{i \in S} y_i

The best split minimises the weighted MSE of the two children, identical in structure to the Gini criterion for classification.

12. Pros, Cons, and When to Use It

Advantages

Strong out-of-the-box performance on tabular data. It often beats logistic regression and single trees without any tuning.
Handles mixed feature types. Continuous and ordinal features work directly. Categorical features must be numerically encoded first (e.g. OrdinalEncoder or OneHotEncoder) because scikit-learn's implementation requires numeric inputs.
Robust to outliers and noisy features. Averaging across many trees dilutes the effect of any single bad split.
Built-in feature importance. MDI scores give a fast, interpretable signal about which features matter most.
Free validation estimate (OOB). You do not need to set aside a separate validation set.
No feature scaling required. Tree-based splits are invariant to monotone transformations of features.
Parallelisable. Each tree is trained independently, so training scales linearly across cores using n_jobs=-1.
Hard to overfit with more trees. Adding more n_estimators only reduces variance and never increases it.

Disadvantages

Black box. Individual trees are interpretable, but a forest of 500 trees is not. Use SHAP values for post-hoc explanations.
Slow to predict at scale. Each prediction must traverse all $B$ trees. For latency-critical deployments, consider model distillation.
Memory intensive. The model stores all $B$ full trees in memory.
MDI importance is biased. High-cardinality features are systematically overrated. Use permutation importance or SHAP for reliable feature selection.
Cannot extrapolate. Like all tree-based methods, Random Forest cannot predict beyond the range it saw during training. If the largest target value in training was $y_{\max}$, it will never predict above $y_{\max}$.
Not ideal for very high-dimensional sparse data. On text or NLP features, gradient boosting or linear models often win.

When to choose Random Forest

Situation	Recommendation
Tabular data, mixed types, quick strong baseline needed	Great choice
Dataset has noisy features or outliers	Very robust
Need built-in feature importance	Use MDI + permutation importance
Maximum accuracy on structured data, tuning budget available	Consider XGBoost / LightGBM instead
Image, audio, or text data	Use CNNs / Transformers
Very large dataset (>10M rows), latency-critical	Consider LightGBM for speed
Interpretability is a hard requirement	Use a shallow tree or logistic regression + SHAP

13. Python Implementation

13.1 Building a Random Forest from Scratch

This section builds a working Random Forest one piece at a time. Each block is explained before the code so you know what it does and why it is written that way.

Part 1: Represent a tree node

A decision tree is a network of nodes. An internal node holds a feature index and a threshold value: samples where feature <= threshold go left; all others go right. A leaf node holds the final predicted class label and has no children. The is_leaf() method lets the prediction function know when to stop traversing and return an answer.


import numpy as np
from collections import Counter

class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature   = feature     # which column to split on
        self.threshold = threshold   # split at this value
        self.left      = left        # subtree for samples where feature <= threshold
        self.right     = right       # subtree for samples where feature > threshold
        self.value     = value       # class label; only set on leaf nodes

    def is_leaf(self):
        return self.value is not None

Part 2: Measure node purity with Gini impurity

Before we can split, we need a way to measure how mixed the labels are at a given node. Gini impurity does this job. It squares the proportion of each class and subtracts from 1. The result is 0 when all samples share the same label (perfectly pure), and rises toward 0.5 as the mix becomes more even.

For example, a node with 6 positives and 4 negatives gives: 1 - (0.6² + 0.4²) = 0.48.


def gini(y):
    counts = Counter(y)
    n = len(y)
    return 1.0 - sum((count / n) ** 2 for count in counts.values())

Part 3: Find the best split at a node

At each node we want the split that reduces impurity the most. This function samples max_features columns at random (the feature-randomness step that decorrelates trees), then tries every unique value in each selected column as a candidate threshold. For each candidate it calculates the Gini gain: the impurity of the parent node minus the weighted average impurity of the two children. The threshold that gives the highest gain is returned.


def best_split(X, y, max_features):
    n_features = X.shape[1]
    # Pick m features at random (feature randomness step that decorrelates trees)
    feature_indices = np.random.choice(n_features, size=max_features, replace=False)

    best_gain, best_feat, best_thresh = -1, None, None

    for feat in feature_indices:
        for thresh in np.unique(X[:, feat]):
            left_mask  = X[:, feat] <= thresh
            right_mask = ~left_mask
            if left_mask.sum() == 0 or right_mask.sum() == 0:
                continue  # skip a threshold that doesn't actually divide the data
            n, n_l, n_r = len(y), left_mask.sum(), right_mask.sum()
            gain = gini(y) - (n_l / n) * gini(y[left_mask]) - (n_r / n) * gini(y[right_mask])
            if gain > best_gain:
                best_gain, best_feat, best_thresh = gain, feat, thresh

    return best_feat, best_thresh

Part 4: Grow one complete tree

grow_tree calls best_split at each node and recurses into both child branches until a stopping condition is met. A node becomes a leaf in three situations: all its samples share the same label (the node is already pure), the tree has reached its maximum allowed depth, or no valid split could be found. The depth argument is incremented on every recursive call so the tree knows how deep it has gone.


def grow_tree(X, y, max_features, max_depth, depth=0):
    # Become a leaf if the data is pure, we hit max depth, or no split exists
    if len(set(y)) == 1 or depth == max_depth:
        return Node(value=Counter(y).most_common(1)[0][0])

    feat, thresh = best_split(X, y, max_features)
    if feat is None:
        return Node(value=Counter(y).most_common(1)[0][0])

    mask  = X[:, feat] <= thresh
    left  = grow_tree(X[mask],  y[mask],  max_features, max_depth, depth + 1)
    right = grow_tree(X[~mask], y[~mask], max_features, max_depth, depth + 1)
    return Node(feature=feat, threshold=thresh, left=left, right=right)

Part 5: Predict with one tree

To predict a single sample, start at the root and walk down the tree. At each internal node, check whether the sample's value for the split feature is less than or equal to the threshold. If yes, go left; if no, go right. Keep going until you reach a leaf, then return its stored class label.


def predict_tree(node, x):
    if node.is_leaf():
        return node.value
    if x[node.feature] <= node.threshold:
        return predict_tree(node.left, x)
    return predict_tree(node.right, x)

Part 6: Assemble the forest

fit loops n_estimators times. Each iteration draws a bootstrap sample (n random row indices with replacement), then calls grow_tree on that sample. Because each tree sees a different bootstrap sample and different random feature subsets at each split, the trees end up different from one another. predict runs every tree on the input, collects all the votes into a matrix, and returns the majority class for each row.


class RandomForestScratch:
    def __init__(self, n_estimators=100, max_features="sqrt", max_depth=None):
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.max_depth    = max_depth or 999
        self.trees        = []

    def fit(self, X, y):
        n = len(y)
        m = int(np.sqrt(X.shape[1])) if self.max_features == "sqrt" else self.max_features
        for _ in range(self.n_estimators):
            idx  = np.random.choice(n, n, replace=True)  # bootstrap: sample with replacement
            tree = grow_tree(X[idx], y[idx], m, self.max_depth)
            self.trees.append(tree)

    def predict(self, X):
        # votes[i][b] = prediction of tree b for sample i
        votes = np.array([[predict_tree(t, x) for t in self.trees] for x in X])
        # For each sample, return whichever class received the most votes
        return np.array([Counter(row).most_common(1)[0][0] for row in votes])

Part 7: Test on the Iris dataset

We test on the Iris dataset: 150 flower samples, 4 numeric features (sepal and petal length and width), and 3 species to predict. We hold out 20% as the test set. With 50 trees and max depth 8, this scratch forest typically reaches 95 to 100% accuracy on the held-out samples.


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestScratch(n_estimators=50, max_depth=8)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f"Scratch RF accuracy: {accuracy_score(y_test, y_pred):.3f}")
# Expected output: Scratch RF accuracy: 0.967  (exact value varies with random state)

13.2 Using scikit-learn for Real Projects

For any real project, use scikit-learn's implementation. It is optimised, parallelised, and handles edge cases that the scratch version ignores. The example below uses the Breast Cancer Wisconsin dataset: 569 samples, 30 numeric features, and a binary label (malignant vs benign). Each step is shown separately so you can see what it produces.

Step 1: Load and split the data

We split 80% for training and 20% for testing. The stratify=y argument ensures both classes appear in the same proportion in both splits, which matters when one class is rarer than the other.


from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}   Test: {X_test.shape}")
# Output: Train: (455, 30)   Test: (114, 30)

Step 2: Train the forest

We build 200 trees. With 30 features, max_features="sqrt" means each split considers approximately 5 randomly chosen features. Setting oob_score=True computes a free accuracy estimate from the out-of-bag samples during training, so we can check model quality without touching the test set at all. n_jobs=-1 uses every available CPU core.


rf = RandomForestClassifier(
    n_estimators=200,      # build 200 independent trees
    max_features="sqrt",   # at each split, consider sqrt(30) ~ 5 features
    max_depth=None,        # grow each tree until all leaves are pure
    min_samples_leaf=1,
    oob_score=True,        # estimate accuracy from out-of-bag samples for free
    n_jobs=-1,             # use all CPU cores in parallel
    random_state=42
)
rf.fit(X_train, y_train)

print(f"OOB accuracy (no test data used): {rf.oob_score_:.4f}")
print(f"Test accuracy:                    {rf.score(X_test, y_test):.4f}")
# Typical output:
# OOB accuracy (no test data used): 0.9648
# Test accuracy:                    0.9649

Step 3: Read the classification report

The classification report gives per-class detail. Precision answers: of all samples the model predicted as class X, how many actually were X? Recall answers: of all samples that truly belong to class X, how many did the model correctly identify? F1 is the harmonic mean of precision and recall and summarises both in one number.


print(classification_report(y_test, rf.predict(X_test), target_names=data.target_names))
# Example output:
#               precision    recall  f1-score   support
#    malignant       0.95      0.95      0.95        42
#       benign       0.97      0.97      0.97        72
#     accuracy                           0.96       114

Step 4: Cross-validation for a reliable accuracy estimate

A single train/test split can be lucky or unlucky depending on which samples end up where. Five-fold cross-validation trains and evaluates the model five times on different non-overlapping folds and reports the mean and spread. A tight standard deviation (small number after the +/-) means the score is stable across different data splits.


cv_scores = cross_val_score(rf, X, y, cv=5, scoring="accuracy")
print(f"5-fold CV accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")
# Typical output: 5-fold CV accuracy: 0.9665 +/- 0.0097

Step 5: MDI feature importance

rf.feature_importances_ holds the Mean Decrease Impurity score for each of the 30 features. We sort them from highest to lowest and plot the top 10. A high score means that feature was used frequently at high-impact splits across all 200 trees. Note that MDI tends to overrate continuous features with many unique values, so treat this chart as a rough guide rather than a definitive ranking.


mdi_importance = rf.feature_importances_
sorted_idx = np.argsort(mdi_importance)[::-1][:10]

plt.figure(figsize=(10, 5))
plt.bar(range(10), mdi_importance[sorted_idx])
plt.xticks(range(10), feature_names[sorted_idx], rotation=45, ha="right")
plt.title("Top 10 Features by MDI Importance")
plt.tight_layout()
plt.show()
# Top features are typically: worst radius, worst perimeter, worst area

Step 6: Permutation importance (more reliable)

Permutation importance works differently from MDI. It takes the trained model and the test set, then randomly shuffles the values of one feature at a time, breaking any relationship between that feature and the target. It measures how much accuracy drops. A large drop means the model genuinely relied on that feature. A near-zero drop means the model can manage without it. We repeat the shuffle 10 times per feature and display the spread as a boxplot, which gives a sense of how stable each importance score is.


perm_result = permutation_importance(
    rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
)
perm_idx = np.argsort(perm_result.importances_mean)[::-1][:10]

plt.figure(figsize=(10, 5))
plt.boxplot(perm_result.importances[perm_idx].T, vert=True, patch_artist=True)
plt.xticks(range(1, 11), feature_names[perm_idx], rotation=45, ha="right")
plt.title("Top 10 Features by Permutation Importance")
plt.tight_layout()
plt.show()

Step 7: Confusion matrix

The confusion matrix shows the full breakdown of predictions. Rows represent the true class; columns represent the predicted class. Numbers on the diagonal are correct predictions. Numbers off the diagonal are errors, and their position tells you exactly which class was confused with which. A well-trained model shows high numbers on the diagonal and values close to zero everywhere else.


ConfusionMatrixDisplay.from_estimator(
    rf, X_test, y_test, display_labels=data.target_names
)
plt.title("Random Forest Confusion Matrix")
plt.show()

13.3 Hyperparameter Tuning with GridSearchCV

Once you have a working model, the next step is finding the best combination of hyperparameters. GridSearchCV tries every combination of values you specify and evaluates each with cross-validation. It then tells you which combination performed best.

The grid below has 3 values of n_estimators, 3 of max_features, 3 of max_depth, and 3 of min_samples_leaf. That is 3 × 3 × 3 × 3 = 81 combinations, each evaluated with 5-fold CV, meaning 405 model fits in total. On the breast cancer dataset this runs in a few minutes.


from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators":     [100, 200, 500],      # more trees = more stable but slower
    "max_features":     ["sqrt", "log2", 0.3], # 0.3 means use 30% of features per split
    "max_depth":        [None, 10, 20],        # None = grow until pure
    "min_samples_leaf": [1, 3, 5],             # higher = simpler trees, less overfitting
}

grid_search = GridSearchCV(
    RandomForestClassifier(n_jobs=-1, random_state=42),
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
    verbose=1  # prints progress so you can see it is running
)
grid_search.fit(X_train, y_train)

print("Best hyperparameters:", grid_search.best_params_)
print("Best CV accuracy:    ", round(grid_search.best_score_, 4))
print("Test accuracy:       ", round(grid_search.best_estimator_.score(X_test, y_test), 4))
# Example output:
# Best hyperparameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'n_estimators': 200}
# Best CV accuracy:     0.967
# Test accuracy:        0.9649

For large datasets where 400+ fits would take too long, replace GridSearchCV with RandomizedSearchCV. Instead of trying all combinations, it samples a fixed number at random, giving a good result in a fraction of the time.

14. Key Takeaways

Wisdom of crowds: Many imperfect, decorrelated trees outperform any single tree because their errors cancel out.
Two sources of randomness: (1) bootstrap sampling of rows, (2) random feature subsets at each split. Both are needed. Bootstrapping alone (bagging) leaves trees too correlated.
Variance falls, bias stays. Adding more trees only helps. There is no overfitting risk from increasing n_estimators.
OOB error is a free, approximately unbiased test estimate. Always enable it with oob_score=True.
MDI importance is fast but biased. Use permutation importance or SHAP for reliable feature selection.
No feature scaling needed. Works with missing values (via imputation) and mixed numeric types.
Cannot extrapolate beyond the training range. This is a known weakness for regression on out-of-distribution inputs.

References

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. The original paper introducing Random Forest.
Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. Introduces bootstrap aggregating.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), Ch. 15. Springer. Rigorous mathematical treatment of Random Forests.
Louppe, G. (2014). Understanding Random Forests: From Theory to Practice. PhD thesis, University of Liège. Definitive analysis of feature importance and variance reduction.
Strobl, C. et al. (2007). Bias in random forest variable importance measures. BMC Bioinformatics, 8, 25. Documents MDI bias toward high-cardinality features.
scikit-learn: Random Forests
Wikipedia: Random Forest

K-Nearest Neighbors (KNN) — Part 1: Classification

KNN is simple enough to fit in one equation but deceptively tricky...

A Beginner's Guide to Elastic Net Regression (L1 + L2 Regularization)

Elastic Net blends L1 and L2 penalties to get Lasso's feature selection...

Algorithm	Bootstrap samples	Feature subsampling at each split
Single Decision Tree	No	All \(p\) features
Bagging	Yes	All \(p\) features
Random Forest	Yes	Random \(m \ll p\) features