Machine-learning · May 28, 2026

XGBoost: A Complete Beginner's Guide

From gradient boosting foundations to Kaggle-winning tricks — intuition, maths, code, and everything in between

by Perivitta 35 mins read Advanced
Share
Back to all posts

XGBoost: A Complete Beginner's Guide

Introduction

XGBoost builds decision trees one at a time in sequence — each new tree trained to correct the remaining errors of the ensemble so far — while a regularised, second-order Taylor objective keeps the learning mathematically principled and prevents overfitting.


1. Why XGBoost Became the Algorithm Everyone Uses

In 2015, the year after XGBoost's public release, Tianqi Chen (its creator) estimated that it appeared in the winning solution of more than half of all structured-data competitions on Kaggle that year. It went on to dominate competitive machine learning for the better part of a decade. A 2022 survey found XGBoost or one of its direct descendants (LightGBM, CatBoost) in the top solution of over 60% of tabular-data competitions. For a single algorithm released as a PhD side project, that is an extraordinary record.

The reason is not magic. It is a set of specific, mathematically principled improvements over vanilla gradient boosting — each targeting a different weakness — combined in a single, well-engineered, production-ready library. This guide explains every improvement from first principles, so you understand which knob matters and why, instead of tuning blindly.

2. Sequential vs Parallel: The Fundamental Difference

Random Forest and XGBoost are both tree ensembles, but they attack the bias-variance tradeoff in opposite directions.

Property Random Forest XGBoost / Gradient Boosting
Tree construction Parallel — trees are independent Sequential — each tree sees the residuals of all previous trees
What each tree targets The original labels \(y\) The remaining error not yet explained by the ensemble
Primary benefit Reduces variance by averaging many diverse trees Reduces bias by iteratively fitting what previous trees missed
Overfitting risk Low — more trees never hurts Higher — too many trees memorises training noise
Hyperparameter sensitivity Low — robust to defaults Higher — learning rate, depth, and regularisation all matter
Training Data Base: ŷ₀ = mean(y) residuals r₁ (error = large) Tree 1 learns r₁ residuals r₂ (r₂ < r₁) Tree 2 learns r₂ ... Final ŷ ŷ₀ + η·Σfₖ(x) k = 1…K trees Each tree shrinks the remaining error by η (learning rate) at a time
Figure 1: XGBoost's sequential pipeline. Each tree targets the residuals left by the ensemble so far. The learning rate η scales each tree's contribution, preventing any single tree from over-correcting.

3. Gradient Boosting: The Foundation

XGBoost is a highly optimised implementation of gradient boosting, introduced by Jerome Friedman in 2001. Understanding gradient boosting first makes the XGBoost-specific additions easy to appreciate.

3.1 The Additive Model

A gradient boosting model is built as a sum of \(K\) trees:

\[ \hat{y}_i = F(\mathbf{x}_i) = \sum_{k=1}^{K} f_k(\mathbf{x}_i), \quad f_k \in \mathcal{F} \]

where \(\mathcal{F}\) is the space of all regression trees and each \(f_k\) maps a sample to a leaf value. We add trees one at a time. At step \(t\):

\[ \hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i) \]

The task at each step is to find the tree \(f_t\) that most reduces the total loss.

3.2 Gradient Descent in Function Space

Regular gradient descent updates model parameters in the direction of the negative gradient of the loss. Gradient boosting does the same thing, but instead of updating parameters, it adds a new function (tree) that points in the negative gradient direction.

For Mean Squared Error (MSE) loss \(\ell = \frac{1}{2}(y_i - \hat{y}_i)^2\), the negative gradient with respect to the prediction is:

\[ -\frac{\partial \ell}{\partial \hat{y}_i} = y_i - \hat{y}_i \quad \text{(the residual)} \]

So vanilla gradient boosting for regression literally fits each new tree to the residuals. This is why gradient boosting and "iterative residual fitting" are the same thing for MSE.

The key insight: By working with gradients instead of raw residuals, the same framework works for any differentiable loss — not just MSE. This is what makes gradient boosting a general-purpose algorithm.

3.3 Gradients and Hessians for Common Loss Functions

Loss function Problem First derivative \(g_i\) Second derivative \(h_i\)
MSE: \(\frac{1}{2}(y_i-\hat{y}_i)^2\) Regression \(\hat{y}_i - y_i\) \(1\)
Log loss (cross-entropy) Binary classification \(\hat{p}_i - y_i\) \(\hat{p}_i(1-\hat{p}_i)\)
Softmax cross-entropy Multi-class \(\hat{p}_{ik} - \mathbf{1}[y_i=k]\) \(\hat{p}_{ik}(1-\hat{p}_{ik})\)
Pseudo-Huber Robust regression Smooth approximation to sign Bounded curvature

For log loss, \(\hat{p}_i = \sigma(\hat{y}_i)\) is the sigmoid of the raw model output (logit), and \(y_i \in \{0, 1\}\).


4. What XGBoost Adds to Gradient Boosting

The original XGBoost paper by Tianqi Chen and Carlos Guestrin (KDD 2016) added six major improvements to vanilla gradient boosting:

  1. Regularised objective — penalises the number of leaves and the magnitude of leaf weights directly in the loss function, not just through hyperparameters.
  2. Second-order Taylor approximation — uses both gradient \(g_i\) and hessian \(h_i\), giving more curvature information and enabling closed-form solutions.
  3. Optimal leaf weights — derived analytically (no line search needed).
  4. Exact split gain formula — evaluates every candidate split using the objective directly.
  5. Column and row subsampling — reduces variance and speeds training, similar to Random Forest.
  6. Sparsity-aware split finding — efficiently handles missing values and sparse features by learning a default direction for each split.

These additions make XGBoost faster, more accurate, and more regularised than scikit-learn's GradientBoostingClassifier, which implements vanilla gradient boosting.


5. The Complete Mathematics

5.1 The Regularised Objective

The full objective XGBoost minimises at step \(t\) is:

\[ \mathcal{L}^{(t)} = \sum_{i=1}^{n} \ell\!\left(y_i,\, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)\right) + \Omega(f_t) \]

where the regularisation term penalises tree complexity:

\[ \Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2 \]
  • \(T\) — number of leaves in the tree. \(\gamma\) (gamma) is the minimum gain required to create a new leaf: if no split improves the objective by at least \(\gamma\), no split is made.
  • \(w_j\) — the predicted value (weight) at leaf \(j\). \(\lambda\) (lambda) is the L2 regularisation coefficient that shrinks leaf weights toward zero.

5.2 Second-Order Taylor Approximation

Optimising \(\mathcal{L}^{(t)}\) directly is hard because \(f_t\) appears inside a general loss function. XGBoost approximates the loss around the current prediction \(\hat{y}_i^{(t-1)}\) using a second-order Taylor expansion:

\[ \ell\!\left(y_i,\, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)\right) \approx \ell\!\left(y_i,\, \hat{y}_i^{(t-1)}\right) + g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t(\mathbf{x}_i)^2 \]

where:

\[ g_i = \frac{\partial\, \ell(y_i,\, \hat{y}_i^{(t-1)})}{\partial\, \hat{y}_i^{(t-1)}}, \qquad h_i = \frac{\partial^2 \ell(y_i,\, \hat{y}_i^{(t-1)})}{\partial\, (\hat{y}_i^{(t-1)})^2} \]

Dropping the constant term \(\ell(y_i, \hat{y}_i^{(t-1)})\) (it does not depend on \(f_t\)) and substituting \(\Omega\), the simplified objective becomes:

\[ \tilde{\mathcal{L}}^{(t)} = \sum_{i=1}^{n} \left[ g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t(\mathbf{x}_i)^2 \right] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 \]

Because each sample \(i\) lands in exactly one leaf \(j\), we can group by leaf. Let \(I_j = \{i : \mathbf{x}_i \text{ falls in leaf } j\}\), and define the per-leaf gradient and hessian sums:

\[ G_j = \sum_{i \in I_j} g_i, \qquad H_j = \sum_{i \in I_j} h_i \]

The objective separates into independent leaf terms:

\[ \tilde{\mathcal{L}}^{(t)} = \sum_{j=1}^{T} \left[ G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \gamma T \]

5.3 Optimal Leaf Weight

Each leaf's term is a simple quadratic in \(w_j\). Differentiating and setting to zero gives the closed-form optimal leaf weight:

\[ \boxed{w_j^* = -\frac{G_j}{H_j + \lambda}} \]

Intuition: The numerator \(G_j\) is the total gradient — how much the prediction needs to move. The denominator \(H_j + \lambda\) is the curvature of the loss plus regularisation — it limits how aggressively to move. Higher \(\lambda\) means smaller leaf weights (more conservative updates).

5.4 Tree Structure Score

Substituting \(w_j^*\) back into the objective gives the best achievable loss for a given tree structure \(q\):

\[ \tilde{\mathcal{L}}^{(t)}(q) = -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T \]

This is the tree structure score. A lower score means a better tree. We want to find the structure \(q\) that minimises it.

5.5 Split Gain Formula

We cannot enumerate all possible tree structures, so XGBoost builds trees greedily — one split at a time. The gain from splitting a node containing samples \(I\) into left child \(I_L\) and right child \(I_R\) is:

\[ \boxed{\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{G^2}{H+\lambda}\right] - \gamma} \]

The three bracketed terms are the structure scores of the left child, right child, and parent respectively. The gain is the improvement in objective from the split, minus the leaf-count penalty \(\gamma\). If Gain \(\leq 0\), the split is rejected — it does not help enough to justify the added complexity.

Parent Node G = G_L + G_R Score = G²/(H+λ) Left Child G_L = Σ g_i (left samples) Score_L = G_L²/(H_L+λ) Right Child G_R = Σ g_i (right samples) Score_R = G_R²/(H_R+λ) Gain = ½(Score_L + Score_R − Score_parent) − γ
Figure 2: XGBoost evaluates every candidate split by computing the gain: the improvement in the structure score (Score_L + Score_R − Score_parent), halved and then penalised by γ. If Gain ≤ 0, the split is rejected.

6. Step-by-Step Worked Example

Let us build one XGBoost tree by hand on a tiny regression dataset. We use MSE loss (\(g_i = \hat{y}_i - y_i\), \(h_i = 1\)), learning rate \(\eta = 0.3\), \(\lambda = 1\), and \(\gamma = 0\).

Dataset

Sample \(i\) Feature \(x\) Target \(y\)
1 1 2
2 2 3
3 3 7
4 4 8

Step 1: Base Prediction

XGBoost initialises with the mean of the targets:

\[ \hat{y}_i^{(0)} = \bar{y} = \frac{2+3+7+8}{4} = 5 \quad \text{for all } i \]

Step 2: Compute Gradients and Hessians

For MSE, \(g_i = \hat{y}_i - y_i\) and \(h_i = 1\):

\(i\) \(\hat{y}_i^{(0)}\) \(y_i\) \(g_i\) \(h_i\)
1 5 2 +3 1
2 5 3 +2 1
3 5 7 −2 1
4 5 8 −3 1

Step 3: Find the Best Split

We have one feature \(x\) with three candidate thresholds: \(x \leq 1\), \(x \leq 2\), and \(x \leq 3\).

Candidate: \(x \leq 1\)

Left: \(\{1\}\) — \(G_L = 3\), \(H_L = 1\)  |  Right: \(\{2,3,4\}\) — \(G_R = -3\), \(H_R = 3\)

\[ \text{Gain} = \frac{1}{2}\!\left[\frac{3^2}{1+1} + \frac{(-3)^2}{3+1} - \frac{0^2}{4+1}\right] = \frac{1}{2}\!\left[\frac{9}{2} + \frac{9}{4}\right] = \frac{1}{2}(4.5 + 2.25) = 3.375 \]

Candidate: \(x \leq 2\)

Left: \(\{1,2\}\) — \(G_L = 5\), \(H_L = 2\)  |  Right: \(\{3,4\}\) — \(G_R = -5\), \(H_R = 2\)

\[ \text{Gain} = \frac{1}{2}\!\left[\frac{5^2}{2+1} + \frac{(-5)^2}{2+1} - \frac{0^2}{4+1}\right] = \frac{1}{2}\!\left[\frac{25}{3} + \frac{25}{3}\right] = \frac{25}{3} \approx 8.333 \quad \checkmark \text{ best} \]

Candidate: \(x \leq 3\)

Left: \(\{1,2,3\}\) — \(G_L = 3\), \(H_L = 3\)  |  Right: \(\{4\}\) — \(G_R = -3\), \(H_R = 1\)

\[ \text{Gain} = \frac{1}{2}\!\left[\frac{9}{4} + \frac{9}{2}\right] = \frac{1}{2}(2.25 + 4.5) = 3.375 \]

Best split: \(x \leq 2\) with Gain = 8.333.

Step 4: Compute Optimal Leaf Weights

\[ w_L^* = -\frac{G_L}{H_L + \lambda} = -\frac{5}{2+1} = -\frac{5}{3} \approx -1.667 \] \[ w_R^* = -\frac{G_R}{H_R + \lambda} = -\frac{-5}{2+1} = \frac{5}{3} \approx +1.667 \]

Step 5: Update Predictions

Apply the new tree's leaf values scaled by \(\eta = 0.3\):

\(i\) \(x\) Leaf \(w_j^*\) \(\hat{y}_i^{(1)} = 5 + 0.3 \times w_j^*\) True \(y_i\) Residual
1 1 Left −1.667 5 − 0.5 = 4.5 2 −2.5
2 2 Left −1.667 5 − 0.5 = 4.5 3 −1.5
3 3 Right +1.667 5 + 0.5 = 5.5 7 +1.5
4 4 Right +1.667 5 + 0.5 = 5.5 8 +2.5

The residuals have shrunk from \(\{3, 2, -2, -3\}\) to \(\{2.5, 1.5, -1.5, -2.5\}\). Tree 2 will fit these smaller residuals, and the process repeats — each iteration pulling predictions closer to the true targets.


7. Key Hyperparameters

Parameter XGBoost name Default Effect Tuning tip
Number of trees n_estimators 100 More trees = lower training loss, but risk of overfitting without regularisation. Use early stopping to find optimal. Set high (500–2000) and use early stopping. Do not tune directly.
Learning rate learning_rate (eta) 0.3 Scales each tree's contribution. Lower = more trees needed but better generalisation. 0.01–0.1 with high n_estimators. Lower eta almost always wins given enough trees.
Max tree depth max_depth 6 Deeper trees → more complex interactions captured but higher overfitting risk. 3–6 for most problems. Start at 4 or 6. Reduce if overfitting.
Row subsampling subsample 1.0 Fraction of training rows used per tree. Adds stochasticity, reduces variance. 0.6–0.9. Values below 0.5 can introduce too much bias.
Column subsampling per tree colsample_bytree 1.0 Fraction of features sampled per tree. Decorrelates trees like Random Forest. 0.5–0.8 for high-dimensional data.
Column subsampling per split colsample_bylevel 1.0 Fraction of features sampled at every split within a tree. Finer-grained than colsample_bytree. Often left at 1.0; only tune if colsample_bytree alone isn't enough.
L2 regularisation reg_lambda (lambda) 1.0 Penalises large leaf weights. Shrinks predictions toward zero. 1–10. Increase when overfitting despite shallow trees.
L1 regularisation reg_alpha (alpha) 0.0 Promotes sparse leaf weights (some leaves exactly zero). Useful with many irrelevant features. Try 0.1–1.0.
Minimum split gain gamma (min_split_loss) 0.0 A split is only made if Gain > γ. Acts as a pruning threshold. 0–5. Raise if trees are very deep and overfitting.
Min child weight min_child_weight 1 Minimum sum of hessians \(\sum h_i\) in a leaf. For MSE (h=1) this equals minimum samples per leaf. 3–10 for noisy datasets. Prevents splits on very small groups.
Tree method tree_method 'auto' 'hist' uses histogram-based splits (much faster, like LightGBM). 'exact' tries every threshold. Always use 'hist' for large datasets (>10k rows).
Early stopping early_stopping_rounds None Stops training if validation metric does not improve for N consecutive rounds. Always use with a validation set. Set 20–50. Saves time and prevents overfitting automatically.
Parallelism n_jobs 1 Number of CPU threads for split finding (not tree construction, which is sequential). Set n_jobs=-1 to use all cores. Big speedup on large datasets.

8. Bias–Variance Perspective

Boosting primarily attacks bias: each tree corrects errors the ensemble cannot yet explain. But without regularisation, it will eventually overfit (high variance). XGBoost controls variance through multiple mechanisms simultaneously:

Mechanism Reduces How
Low learning rate (\(\eta\)) Variance Each tree contributes only a small fraction; harder to overfit any single tree
L2 regularisation (\(\lambda\)) Variance Shrinks leaf weights, preventing extreme predictions
Minimum gain (\(\gamma\)) Variance Prunes splits that reduce loss by less than \(\gamma\); shallower trees
Row subsampling Variance Stochasticity prevents over-reliance on any one sample
Column subsampling Variance Decorrelates trees, similar to Random Forest
More trees Bias Progressively corrects remaining errors
Larger depth Bias Captures higher-order interactions between features

The practical implication: start with a low learning rate and many trees with early stopping, then tune depth and regularisation. The learning rate is the most important single hyperparameter.


9. XGBoost vs LightGBM vs CatBoost

XGBoost spawned a generation of optimised gradient boosting libraries. Understanding the differences helps you pick the right one.

Property XGBoost LightGBM CatBoost
Tree growth strategy Level-wise (breadth-first) Leaf-wise (best-first) — grows the leaf with highest gain Symmetric (oblivious) trees — same split at every node on a level
Speed on large datasets Fast (hist method) Fastest — histogram binning + GOSS + EFB Moderate — but very fast for categorical features
Categorical features Must encode manually Native support (integer-encoded) Best — target encoding + ordered boosting natively
Memory Moderate Low (histogram binning) Moderate–High
Overfitting resistance Good with tuning Leaf-wise can overfit on small datasets; needs regularisation Strong — ordered boosting prevents target leakage
GPU support Yes (tree_method='gpu_hist') Yes (device='gpu') Yes, built-in
Best for General purpose, well-tuned Very large datasets, speed-critical Many categorical columns, avoiding manual encoding

Rule of thumb: Start with XGBoost. Switch to LightGBM when training speed is the bottleneck. Switch to CatBoost when you have many high-cardinality categorical columns.


10. Pros, Cons, and When to Use

Advantages

  • State-of-the-art on tabular data. XGBoost, LightGBM, or CatBoost wins or ranks top-3 in the vast majority of structured-data Kaggle competitions. For tabular data it routinely beats deep learning.
  • Handles mixed types natively. Continuous, ordinal, and (with encoding) categorical features all work directly.
  • Built-in regularisation. γ, λ, α, subsampling, and learning rate all control overfitting from different angles.
  • Missing value handling. XGBoost learns a default direction at each split for missing values — no imputation required.
  • Rich feature importance. Three types: gain (improvement to objective), cover (number of samples), and frequency (number of splits). SHAP values also work natively.
  • Early stopping. Automatically finds the optimal number of trees using a held-out validation set, eliminating one of the most impactful hyperparameter decisions.
  • No feature scaling needed. Tree splits are threshold-based and invariant to monotone transformations.

Disadvantages

  • Slower to train than Random Forest on small datasets because trees are sequential, not parallel.
  • More hyperparameters to tune. Learning rate, depth, subsampling, and regularisation all interact. Getting the best result requires more effort than RF.
  • Cannot extrapolate. Like all tree methods, XGBoost predicts within the range seen during training. Regression on out-of-distribution inputs will plateau.
  • Black box. An ensemble of hundreds of trees is not directly interpretable. Use SHAP for explanations.
  • Worse on unstructured data. Text, images, audio — deep learning dominates there.

When to use XGBoost

Situation Recommendation
Tabular data, need top accuracy, willing to tune Excellent choice — the go-to algorithm
Kaggle / competition, structured data Start here (then try LightGBM / CatBoost)
Missing values present Handle natively, no imputation needed
Need very fast training on millions of rows Use LightGBM instead
Many high-cardinality categorical columns Use CatBoost instead
Quick strong baseline with minimal tuning Random Forest is easier to use
Image, text, or audio data Use CNNs or Transformers
Hard interpretability requirement Use logistic regression + SHAP

11. Python Implementation

10.1 XGBoost's Core Logic from Scratch

This implementation shows XGBoost's key ideas: computing \(g_i\) and \(h_i\), deriving the optimal leaf weight \(w^* = -G/(H+\lambda)\), and overriding the tree's default leaf predictions with the regularised XGBoost values. It uses scikit-learn's tree splitter purely for finding split thresholds.


import numpy as np
from sklearn.tree import DecisionTreeRegressor

class XGBoostScratch:
    """Demonstrates XGBoost's core math: gradients, hessians, and optimal leaf weights."""

    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3, lam=1.0):
        self.n_estimators = n_estimators
        self.eta          = learning_rate
        self.max_depth    = max_depth
        self.lam          = lam          # L2 regularisation lambda
        self.trees        = []
        self.base_score   = None

    @staticmethod
    def _mse_grad_hess(y, y_pred):
        """MSE: g_i = ŷ_i - y_i, h_i = 1 for all i."""
        return y_pred - y, np.ones_like(y)

    @staticmethod
    def _xgb_leaf_weight(g_leaf, h_leaf, lam):
        """Optimal leaf weight: w* = -G / (H + λ)"""
        return -g_leaf.sum() / (h_leaf.sum() + lam)

    def fit(self, X, y):
        self.base_score = float(np.mean(y))
        y_pred = np.full(len(y), self.base_score)

        for _ in range(self.n_estimators):
            g, h = self._mse_grad_hess(y, y_pred)

            # Train tree on pseudo-response -g/h = residual for MSE
            pseudo_response = -g / h
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, pseudo_response)

            # Replace leaf values with XGBoost optimal weights w* = -G/(H+λ)
            leaf_ids = tree.apply(X)
            for leaf in np.unique(leaf_ids):
                mask = leaf_ids == leaf
                w_opt = self._xgb_leaf_weight(g[mask], h[mask], self.lam)
                tree.tree_.value[leaf, 0, 0] = w_opt   # override sklearn's mean

            y_pred += self.eta * tree.predict(X)
            self.trees.append(tree)

    def predict(self, X):
        y_pred = np.full(len(X), self.base_score)
        for tree in self.trees:
            y_pred += self.eta * tree.predict(X)
        return y_pred

Test it on the California housing dataset:


from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBoostScratch(n_estimators=100, learning_rate=0.1, max_depth=4, lam=1.0)
model.fit(X_train, y_train)

rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
print(f"Scratch XGBoost RMSE: {rmse:.4f}")
# Expected: ~0.55-0.65 (real XGBoost gets ~0.45 with full optimisations)

10.2 Using the XGBoost Library for Real Projects

Install the library with pip install xgboost. The example below uses the Breast Cancer Wisconsin dataset (569 samples, 30 features, binary classification) and walks through training, early stopping, evaluation, and feature importance one step at a time.

Step 1: Load and split the data


import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}   Test: {X_test.shape}")
# Output: Train: (455, 30)   Test: (114, 30)

Step 2: Train with early stopping

We set n_estimators=1000 as an upper bound and let early_stopping_rounds=30 halt training automatically when the validation log-loss does not improve for 30 consecutive rounds. This finds the optimal number of trees without any manual search.


model = xgb.XGBClassifier(
    n_estimators       = 1000,   # upper bound; early stopping decides actual count
    learning_rate      = 0.05,
    max_depth          = 4,
    subsample          = 0.8,
    colsample_bytree   = 0.8,
    gamma              = 0.1,
    reg_lambda         = 1.0,
    reg_alpha          = 0.0,
    tree_method        = 'hist', # fast histogram-based splits
    early_stopping_rounds = 30,
    eval_metric        = 'logloss',
    random_state       = 42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

print(f"Best number of trees: {model.best_iteration + 1}")
print(f"Test accuracy:        {accuracy_score(y_test, model.predict(X_test)):.4f}")
# Typical output:
# Best number of trees: 187
# Test accuracy:        0.9737

Step 3: Classification report


print(classification_report(y_test, model.predict(X_test), target_names=data.target_names))
# Example output:
#               precision    recall  f1-score   support
#    malignant       0.97      0.95      0.96        42
#       benign       0.97      0.99      0.98        72
#     accuracy                           0.97       114

Step 4: Cross-validation for a stable accuracy estimate

We cannot pass model directly to cross_val_score because it has early_stopping_rounds set — scikit-learn's CV loop does not supply an eval_set, causing a crash. Instead, build a fresh estimator with the best tree count discovered during training:


# Use best_iteration (discovered by early stopping) — no eval_set needed
cv_model = xgb.XGBClassifier(
    n_estimators   = model.best_iteration + 1,
    learning_rate  = 0.05,
    max_depth      = 4,
    subsample      = 0.8,
    colsample_bytree = 0.8,
    gamma          = 0.1,
    reg_lambda     = 1.0,
    tree_method    = 'hist',
    random_state   = 42
)

cv_scores = cross_val_score(cv_model, X, y, cv=5, scoring='accuracy', n_jobs=-1)
print(f"5-fold CV accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")
# Typical output: 5-fold CV accuracy: 0.9736 +/- 0.0080

Step 5: Feature importance (three types)

XGBoost provides three importance metrics. Gain (average improvement to the objective per split) is the most informative. Cover (average samples per split) measures how broadly a feature is applied. Weight (total number of splits) is the least reliable for ranking.


fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, importance_type in zip(axes, ['gain', 'cover']):
    xgb.plot_importance(
        model,
        importance_type=importance_type,
        max_num_features=10,
        ax=ax,
        title=f"Top 10 Features — {importance_type.capitalize()}"
    )

plt.tight_layout()
plt.show()
# Top features by gain: typically worst radius, worst perimeter, worst area

Step 6: Hyperparameter tuning with XGBoost's native cross-validation

XGBoost's xgb.cv() is faster than scikit-learn's GridSearchCV for tuning the number of trees because it handles early stopping natively. First fix the number of trees, then tune the structural hyperparameters.


from sklearn.model_selection import GridSearchCV

# Step 1: find the best structural parameters with a fixed learning rate
param_grid = {
    'max_depth':         [3, 4, 6],
    'subsample':         [0.7, 0.8, 1.0],
    'colsample_bytree':  [0.7, 0.8, 1.0],
    'gamma':             [0, 0.1, 0.3],
}

base_model = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.05,
    tree_method='hist', random_state=42
)

grid_search = GridSearchCV(
    base_model, param_grid, cv=5,
    scoring='accuracy', n_jobs=-1, verbose=0
)
grid_search.fit(X_train, y_train)

print("Best params:", grid_search.best_params_)
print("Best CV acc:", round(grid_search.best_score_, 4))

12. Key Takeaways

  • Sequential correction: Each tree in XGBoost corrects the residuals of the ensemble so far. This reduces bias — the primary weakness of any single shallow tree.
  • Second-order Taylor: Using both gradient \(g_i\) and hessian \(h_i\) enables closed-form optimal leaf weights and a principled split gain formula that works for any differentiable loss.
  • Regularised objective: \(\Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2\) is baked directly into every split decision — not bolted on as an afterthought.
  • Optimal leaf weight: \(w^* = -G/(H+\lambda)\) is derived analytically. Higher \(\lambda\) shrinks predictions; higher \(G\) means the leaf wants a larger correction.
  • Split Gain: A split is only made when \(\frac{1}{2}[\text{Score}_L + \text{Score}_R - \text{Score}_\text{parent}] > \gamma\). This is the mathematical equivalent of minimal-cost-complexity pruning.
  • Early stopping is essential: Always use a validation set with early_stopping_rounds. It eliminates the most impactful hyperparameter (number of trees) automatically.
  • Low learning rate wins: Set learning_rate=0.01–0.05, keep n_estimators high, and let early stopping decide when to stop.

References

  • Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of KDD 2016. The original XGBoost paper — all maths in Section 4 derives from here.
  • Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29(5), 1189–1232. Introduced gradient boosting and functional gradient descent.
  • Friedman, J. H. (2002). Stochastic Gradient Boosting. Computational Statistics & Data Analysis, 38(4), 367–378. Introduced row subsampling to gradient boosting.
  • Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017. Introduces GOSS and EFB for faster training.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), Ch. 10. Springer. Rigorous statistical treatment of boosting.
  • XGBoost documentation
  • Random Realizations: XGBoost Explained — excellent deep-dive on the maths.

Related Articles

Model Context Protocol (MCP): A Complete Beginner's Guide
Model Context Protocol (MCP): A Complete Beginner's Guide
MCP is the USB-C port for AI applications — one protocol that...
Read More →
Random Forest: A Complete Beginner's Guide
Random Forest: A Complete Beginner's Guide
Random Forest builds hundreds of deliberately different decision trees and takes a...
Read More →
Found this useful?