XGBoost: A Complete Beginner's Guide

Introduction

XGBoost builds decision trees one at a time in sequence — each new tree trained to correct the remaining errors of the ensemble so far — while a regularised, second-order Taylor objective keeps the learning mathematically principled and prevents overfitting.

1. Why XGBoost Became the Algorithm Everyone Uses

In 2015, the year after XGBoost's public release, Tianqi Chen (its creator) estimated that it appeared in the winning solution of more than half of all structured-data competitions on Kaggle that year. It went on to dominate competitive machine learning for the better part of a decade. A 2022 survey found XGBoost or one of its direct descendants (LightGBM, CatBoost) in the top solution of over 60% of tabular-data competitions. For a single algorithm released as a PhD side project, that is an extraordinary record.

The reason is not magic. It is a set of specific, mathematically principled improvements over vanilla gradient boosting — each targeting a different weakness — combined in a single, well-engineered, production-ready library. This guide explains every improvement from first principles, so you understand which knob matters and why, instead of tuning blindly.

2. Sequential vs Parallel: The Fundamental Difference

Random Forest and XGBoost are both tree ensembles, but they attack the bias-variance tradeoff in opposite directions.

Property	Random Forest	XGBoost / Gradient Boosting
Tree construction	Parallel — trees are independent	Sequential — each tree sees the residuals of all previous trees
What each tree targets	The original labels \(y\)	The remaining error not yet explained by the ensemble
Primary benefit	Reduces variance by averaging many diverse trees	Reduces bias by iteratively fitting what previous trees missed
Overfitting risk	Low — more trees never hurts	Higher — too many trees memorises training noise
Hyperparameter sensitivity	Low — robust to defaults	Higher — learning rate, depth, and regularisation all matter

Figure 1: XGBoost's sequential pipeline. Each tree targets the residuals left by the ensemble so far. The learning rate η scales each tree's contribution, preventing any single tree from over-correcting.

3. Gradient Boosting: The Foundation

XGBoost is a highly optimised implementation of gradient boosting, introduced by Jerome Friedman in 2001. Understanding gradient boosting first makes the XGBoost-specific additions easy to appreciate.

3.1 The Additive Model

A gradient boosting model is built as a sum of \(K\) trees:

\hat{y}_i = F(\mathbf{x}_i) = \sum_{k=1}^{K} f_k(\mathbf{x}_i), \quad f_k \in \mathcal{F}

where \(\mathcal{F}\) is the space of all regression trees and each \(f_k\) maps a sample to a leaf value. We add trees one at a time. At step \(t\):

\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)

The task at each step is to find the tree \(f_t\) that most reduces the total loss.

3.2 Gradient Descent in Function Space

Regular gradient descent updates model parameters in the direction of the negative gradient of the loss. Gradient boosting does the same thing, but instead of updating parameters, it adds a new function (tree) that points in the negative gradient direction.

For Mean Squared Error (MSE) loss \(\ell = \frac{1}{2}(y_i - \hat{y}_i)^2\), the negative gradient with respect to the prediction is:

-\frac{\partial \ell}{\partial \hat{y}_i} = y_i - \hat{y}_i \quad \text{(the residual)}

So vanilla gradient boosting for regression literally fits each new tree to the residuals. This is why gradient boosting and "iterative residual fitting" are the same thing for MSE.

The key insight: By working with gradients instead of raw residuals, the same framework works for any differentiable loss — not just MSE. This is what makes gradient boosting a general-purpose algorithm.

3.3 Gradients and Hessians for Common Loss Functions

Loss function	Problem	First derivative \(g_i\)	Second derivative \(h_i\)
MSE: \(\frac{1}{2}(y_i-\hat{y}_i)^2\)	Regression	\(\hat{y}_i - y_i\)	\(1\)
Log loss (cross-entropy)	Binary classification	\(\hat{p}_i - y_i\)	\(\hat{p}_i(1-\hat{p}_i)\)
Softmax cross-entropy	Multi-class	\(\hat{p}_{ik} - \mathbf{1}[y_i=k]\)	\(\hat{p}_{ik}(1-\hat{p}_{ik})\)
Pseudo-Huber	Robust regression	Smooth approximation to sign	Bounded curvature

For log loss, \(\hat{p}_i = \sigma(\hat{y}_i)\) is the sigmoid of the raw model output (logit), and \(y_i \in \{0, 1\}\).

4. What XGBoost Adds to Gradient Boosting

The original XGBoost paper by Tianqi Chen and Carlos Guestrin (KDD 2016) added six major improvements to vanilla gradient boosting:

Regularised objective — penalises the number of leaves and the magnitude of leaf weights directly in the loss function, not just through hyperparameters.
Second-order Taylor approximation — uses both gradient \(g_i\) and hessian \(h_i\), giving more curvature information and enabling closed-form solutions.
Optimal leaf weights — derived analytically (no line search needed).
Exact split gain formula — evaluates every candidate split using the objective directly.
Column and row subsampling — reduces variance and speeds training, similar to Random Forest.
Sparsity-aware split finding — efficiently handles missing values and sparse features by learning a default direction for each split.

These additions make XGBoost faster, more accurate, and more regularised than scikit-learn's GradientBoostingClassifier, which implements vanilla gradient boosting.

5. The Complete Mathematics

5.1 The Regularised Objective

The full objective XGBoost minimises at step \(t\) is:

\mathcal{L}^{(t)} = \sum_{i=1}^{n} \ell\!\left(y_i,\, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)\right) + \Omega(f_t)

where the regularisation term penalises tree complexity:

\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2

\(T\) — number of leaves in the tree. \(\gamma\) (gamma) is the minimum gain required to create a new leaf: if no split improves the objective by at least \(\gamma\), no split is made.
\(w_j\) — the predicted value (weight) at leaf \(j\). \(\lambda\) (lambda) is the L2 regularisation coefficient that shrinks leaf weights toward zero.

5.2 Second-Order Taylor Approximation

Optimising \(\mathcal{L}^{(t)}\) directly is hard because \(f_t\) appears inside a general loss function. XGBoost approximates the loss around the current prediction \(\hat{y}_i^{(t-1)}\) using a second-order Taylor expansion:

\ell\!\left(y_i,\, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)\right) \approx \ell\!\left(y_i,\, \hat{y}_i^{(t-1)}\right) + g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t(\mathbf{x}_i)^2

where:

g_i = \frac{\partial\, \ell(y_i,\, \hat{y}_i^{(t-1)})}{\partial\, \hat{y}_i^{(t-1)}}, \qquad h_i = \frac{\partial^2 \ell(y_i,\, \hat{y}_i^{(t-1)})}{\partial\, (\hat{y}_i^{(t-1)})^2}

Dropping the constant term \(\ell(y_i, \hat{y}_i^{(t-1)})\) (it does not depend on \(f_t\)) and substituting \(\Omega\), the simplified objective becomes:

\tilde{\mathcal{L}}^{(t)} = \sum_{i=1}^{n} \left[ g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t(\mathbf{x}_i)^2 \right] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2

Because each sample \(i\) lands in exactly one leaf \(j\), we can group by leaf. Let \(I_j = \{i : \mathbf{x}_i \text{ falls in leaf } j\}\), and define the per-leaf gradient and hessian sums:

G_j = \sum_{i \in I_j} g_i, \qquad H_j = \sum_{i \in I_j} h_i

The objective separates into independent leaf terms:

\tilde{\mathcal{L}}^{(t)} = \sum_{j=1}^{T} \left[ G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \gamma T

5.3 Optimal Leaf Weight

Each leaf's term is a simple quadratic in \(w_j\). Differentiating and setting to zero gives the closed-form optimal leaf weight:

\boxed{w_j^* = -\frac{G_j}{H_j + \lambda}}

Intuition: The numerator \(G_j\) is the total gradient — how much the prediction needs to move. The denominator \(H_j + \lambda\) is the curvature of the loss plus regularisation — it limits how aggressively to move. Higher \(\lambda\) means smaller leaf weights (more conservative updates).

5.4 Tree Structure Score

Substituting \(w_j^*\) back into the objective gives the best achievable loss for a given tree structure \(q\):

\tilde{\mathcal{L}}^{(t)}(q) = -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T

This is the tree structure score. A lower score means a better tree. We want to find the structure \(q\) that minimises it.

5.5 Split Gain Formula

We cannot enumerate all possible tree structures, so XGBoost builds trees greedily — one split at a time. The gain from splitting a node containing samples \(I\) into left child \(I_L\) and right child \(I_R\) is:

\boxed{\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{G^2}{H+\lambda}\right] - \gamma}

The three bracketed terms are the structure scores of the left child, right child, and parent respectively. The gain is the improvement in objective from the split, minus the leaf-count penalty \(\gamma\). If Gain \(\leq 0\), the split is rejected — it does not help enough to justify the added complexity.

Figure 2: XGBoost evaluates every candidate split by computing the gain: the improvement in the structure score (Score_L + Score_R − Score_parent), halved and then penalised by γ. If Gain ≤ 0, the split is rejected.

6. Step-by-Step Worked Example

Let us build one XGBoost tree by hand on a tiny regression dataset. We use MSE loss (\(g_i = \hat{y}_i - y_i\), \(h_i = 1\)), learning rate \(\eta = 0.3\), \(\lambda = 1\), and \(\gamma = 0\).

Dataset

Sample \(i\)	Feature \(x\)	Target \(y\)
1	1	2
2	2	3
3	3	7
4	4	8

Step 1: Base Prediction

XGBoost initialises with the mean of the targets:

\hat{y}_i^{(0)} = \bar{y} = \frac{2+3+7+8}{4} = 5 \quad \text{for all } i

Step 2: Compute Gradients and Hessians

For MSE, \(g_i = \hat{y}_i - y_i\) and \(h_i = 1\):

\(i\)	\(\hat{y}_i^{(0)}\)	\(y_i\)	\(g_i\)	\(h_i\)
1	5	2	+3	1
2	5	3	+2	1
3	5	7	−2	1
4	5	8	−3	1

Step 3: Find the Best Split

We have one feature \(x\) with three candidate thresholds: \(x \leq 1\), \(x \leq 2\), and \(x \leq 3\).

Candidate: \(x \leq 1\)

Left: \(\{1\}\) — \(G_L = 3\), \(H_L = 1\) | Right: \(\{2,3,4\}\) — \(G_R = -3\), \(H_R = 3\)

\text{Gain} = \frac{1}{2}\!\left[\frac{3^2}{1+1} + \frac{(-3)^2}{3+1} - \frac{0^2}{4+1}\right] = \frac{1}{2}\!\left[\frac{9}{2} + \frac{9}{4}\right] = \frac{1}{2}(4.5 + 2.25) = 3.375

Candidate: \(x \leq 2\)

Left: \(\{1,2\}\) — \(G_L = 5\), \(H_L = 2\) | Right: \(\{3,4\}\) — \(G_R = -5\), \(H_R = 2\)

\text{Gain} = \frac{1}{2}\!\left[\frac{5^2}{2+1} + \frac{(-5)^2}{2+1} - \frac{0^2}{4+1}\right] = \frac{1}{2}\!\left[\frac{25}{3} + \frac{25}{3}\right] = \frac{25}{3} \approx 8.333 \quad \checkmark \text{ best}

Candidate: \(x \leq 3\)

Left: \(\{1,2,3\}\) — \(G_L = 3\), \(H_L = 3\) | Right: \(\{4\}\) — \(G_R = -3\), \(H_R = 1\)

\text{Gain} = \frac{1}{2}\!\left[\frac{9}{4} + \frac{9}{2}\right] = \frac{1}{2}(2.25 + 4.5) = 3.375

Best split: \(x \leq 2\) with Gain = 8.333.

Step 4: Compute Optimal Leaf Weights

w_L^* = -\frac{G_L}{H_L + \lambda} = -\frac{5}{2+1} = -\frac{5}{3} \approx -1.667 \] \[ w_R^* = -\frac{G_R}{H_R + \lambda} = -\frac{-5}{2+1} = \frac{5}{3} \approx +1.667

Step 5: Update Predictions

Apply the new tree's leaf values scaled by \(\eta = 0.3\):

\(i\)	\(x\)	Leaf	\(w_j^*\)	\(\hat{y}_i^{(1)} = 5 + 0.3 \times w_j^*\)	True \(y_i\)	Residual
1	1	Left	−1.667	5 − 0.5 = 4.5	2	−2.5
2	2	Left	−1.667	5 − 0.5 = 4.5	3	−1.5
3	3	Right	+1.667	5 + 0.5 = 5.5	7	+1.5
4	4	Right	+1.667	5 + 0.5 = 5.5	8	+2.5

The residuals have shrunk from \(\{3, 2, -2, -3\}\) to \(\{2.5, 1.5, -1.5, -2.5\}\). Tree 2 will fit these smaller residuals, and the process repeats — each iteration pulling predictions closer to the true targets.

7. Key Hyperparameters

Parameter	XGBoost name	Default	Effect	Tuning tip
Number of trees	`n_estimators`	100	More trees = lower training loss, but risk of overfitting without regularisation. Use early stopping to find optimal.	Set high (500–2000) and use early stopping. Do not tune directly.
Learning rate	`learning_rate` (`eta`)	0.3	Scales each tree's contribution. Lower = more trees needed but better generalisation.	0.01–0.1 with high `n_estimators`. Lower eta almost always wins given enough trees.
Max tree depth	`max_depth`	6	Deeper trees → more complex interactions captured but higher overfitting risk.	3–6 for most problems. Start at 4 or 6. Reduce if overfitting.
Row subsampling	`subsample`	1.0	Fraction of training rows used per tree. Adds stochasticity, reduces variance.	0.6–0.9. Values below 0.5 can introduce too much bias.
Column subsampling per tree	`colsample_bytree`	1.0	Fraction of features sampled per tree. Decorrelates trees like Random Forest.	0.5–0.8 for high-dimensional data.
Column subsampling per split	`colsample_bylevel`	1.0	Fraction of features sampled at every split within a tree. Finer-grained than `colsample_bytree`.	Often left at 1.0; only tune if `colsample_bytree` alone isn't enough.
L2 regularisation	`reg_lambda` (`lambda`)	1.0	Penalises large leaf weights. Shrinks predictions toward zero.	1–10. Increase when overfitting despite shallow trees.
L1 regularisation	`reg_alpha` (`alpha`)	0.0	Promotes sparse leaf weights (some leaves exactly zero).	Useful with many irrelevant features. Try 0.1–1.0.
Minimum split gain	`gamma` (`min_split_loss`)	0.0	A split is only made if Gain > γ. Acts as a pruning threshold.	0–5. Raise if trees are very deep and overfitting.
Min child weight	`min_child_weight`	1	Minimum sum of hessians \(\sum h_i\) in a leaf. For MSE (h=1) this equals minimum samples per leaf.	3–10 for noisy datasets. Prevents splits on very small groups.
Tree method	`tree_method`	`'auto'`	`'hist'` uses histogram-based splits (much faster, like LightGBM). `'exact'` tries every threshold.	Always use `'hist'` for large datasets (>10k rows).
Early stopping	`early_stopping_rounds`	None	Stops training if validation metric does not improve for N consecutive rounds.	Always use with a validation set. Set 20–50. Saves time and prevents overfitting automatically.
Parallelism	`n_jobs`	1	Number of CPU threads for split finding (not tree construction, which is sequential).	Set `n_jobs=-1` to use all cores. Big speedup on large datasets.

8. Bias–Variance Perspective

Boosting primarily attacks bias: each tree corrects errors the ensemble cannot yet explain. But without regularisation, it will eventually overfit (high variance). XGBoost controls variance through multiple mechanisms simultaneously:

Mechanism	Reduces	How
Low learning rate (\(\eta\))	Variance	Each tree contributes only a small fraction; harder to overfit any single tree
L2 regularisation (\(\lambda\))	Variance	Shrinks leaf weights, preventing extreme predictions
Minimum gain (\(\gamma\))	Variance	Prunes splits that reduce loss by less than \(\gamma\); shallower trees
Row subsampling	Variance	Stochasticity prevents over-reliance on any one sample
Column subsampling	Variance	Decorrelates trees, similar to Random Forest
More trees	Bias	Progressively corrects remaining errors
Larger depth	Bias	Captures higher-order interactions between features

The practical implication: start with a low learning rate and many trees with early stopping, then tune depth and regularisation. The learning rate is the most important single hyperparameter.

9. XGBoost vs LightGBM vs CatBoost

XGBoost spawned a generation of optimised gradient boosting libraries. Understanding the differences helps you pick the right one.

Property	XGBoost	LightGBM	CatBoost
Tree growth strategy	Level-wise (breadth-first)	Leaf-wise (best-first) — grows the leaf with highest gain	Symmetric (oblivious) trees — same split at every node on a level
Speed on large datasets	Fast (hist method)	Fastest — histogram binning + GOSS + EFB	Moderate — but very fast for categorical features
Categorical features	Must encode manually	Native support (integer-encoded)	Best — target encoding + ordered boosting natively
Memory	Moderate	Low (histogram binning)	Moderate–High
Overfitting resistance	Good with tuning	Leaf-wise can overfit on small datasets; needs regularisation	Strong — ordered boosting prevents target leakage
GPU support	Yes (`tree_method='gpu_hist'`)	Yes (`device='gpu'`)	Yes, built-in
Best for	General purpose, well-tuned	Very large datasets, speed-critical	Many categorical columns, avoiding manual encoding

Rule of thumb: Start with XGBoost. Switch to LightGBM when training speed is the bottleneck. Switch to CatBoost when you have many high-cardinality categorical columns.

10. Pros, Cons, and When to Use

Advantages

State-of-the-art on tabular data. XGBoost, LightGBM, or CatBoost wins or ranks top-3 in the vast majority of structured-data Kaggle competitions. For tabular data it routinely beats deep learning.
Handles mixed types natively. Continuous, ordinal, and (with encoding) categorical features all work directly.
Built-in regularisation. γ, λ, α, subsampling, and learning rate all control overfitting from different angles.
Missing value handling. XGBoost learns a default direction at each split for missing values — no imputation required.
Rich feature importance. Three types: gain (improvement to objective), cover (number of samples), and frequency (number of splits). SHAP values also work natively.
Early stopping. Automatically finds the optimal number of trees using a held-out validation set, eliminating one of the most impactful hyperparameter decisions.
No feature scaling needed. Tree splits are threshold-based and invariant to monotone transformations.

Disadvantages

Slower to train than Random Forest on small datasets because trees are sequential, not parallel.
More hyperparameters to tune. Learning rate, depth, subsampling, and regularisation all interact. Getting the best result requires more effort than RF.
Cannot extrapolate. Like all tree methods, XGBoost predicts within the range seen during training. Regression on out-of-distribution inputs will plateau.
Black box. An ensemble of hundreds of trees is not directly interpretable. Use SHAP for explanations.
Worse on unstructured data. Text, images, audio — deep learning dominates there.

When to use XGBoost

Situation	Recommendation
Tabular data, need top accuracy, willing to tune	Excellent choice — the go-to algorithm
Kaggle / competition, structured data	Start here (then try LightGBM / CatBoost)
Missing values present	Handle natively, no imputation needed
Need very fast training on millions of rows	Use LightGBM instead
Many high-cardinality categorical columns	Use CatBoost instead
Quick strong baseline with minimal tuning	Random Forest is easier to use
Image, text, or audio data	Use CNNs or Transformers
Hard interpretability requirement	Use logistic regression + SHAP

11. Python Implementation

10.1 XGBoost's Core Logic from Scratch

This implementation shows XGBoost's key ideas: computing \(g_i\) and \(h_i\), deriving the optimal leaf weight \(w^* = -G/(H+\lambda)\), and overriding the tree's default leaf predictions with the regularised XGBoost values. It uses scikit-learn's tree splitter purely for finding split thresholds.


import numpy as np
from sklearn.tree import DecisionTreeRegressor

class XGBoostScratch:
    """Demonstrates XGBoost's core math: gradients, hessians, and optimal leaf weights."""

    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3, lam=1.0):
        self.n_estimators = n_estimators
        self.eta          = learning_rate
        self.max_depth    = max_depth
        self.lam          = lam          # L2 regularisation lambda
        self.trees        = []
        self.base_score   = None

    @staticmethod
    def _mse_grad_hess(y, y_pred):
        """MSE: g_i = ŷ_i - y_i, h_i = 1 for all i."""
        return y_pred - y, np.ones_like(y)

    @staticmethod
    def _xgb_leaf_weight(g_leaf, h_leaf, lam):
        """Optimal leaf weight: w* = -G / (H + λ)"""
        return -g_leaf.sum() / (h_leaf.sum() + lam)

    def fit(self, X, y):
        self.base_score = float(np.mean(y))
        y_pred = np.full(len(y), self.base_score)

        for _ in range(self.n_estimators):
            g, h = self._mse_grad_hess(y, y_pred)

            # Train tree on pseudo-response -g/h = residual for MSE
            pseudo_response = -g / h
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, pseudo_response)

            # Replace leaf values with XGBoost optimal weights w* = -G/(H+λ)
            leaf_ids = tree.apply(X)
            for leaf in np.unique(leaf_ids):
                mask = leaf_ids == leaf
                w_opt = self._xgb_leaf_weight(g[mask], h[mask], self.lam)
                tree.tree_.value[leaf, 0, 0] = w_opt   # override sklearn's mean

            y_pred += self.eta * tree.predict(X)
            self.trees.append(tree)

    def predict(self, X):
        y_pred = np.full(len(X), self.base_score)
        for tree in self.trees:
            y_pred += self.eta * tree.predict(X)
        return y_pred

Test it on the California housing dataset:


from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBoostScratch(n_estimators=100, learning_rate=0.1, max_depth=4, lam=1.0)
model.fit(X_train, y_train)

rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
print(f"Scratch XGBoost RMSE: {rmse:.4f}")
# Expected: ~0.55-0.65 (real XGBoost gets ~0.45 with full optimisations)

10.2 Using the XGBoost Library for Real Projects

Install the library with pip install xgboost. The example below uses the Breast Cancer Wisconsin dataset (569 samples, 30 features, binary classification) and walks through training, early stopping, evaluation, and feature importance one step at a time.

Step 1: Load and split the data


import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}   Test: {X_test.shape}")
# Output: Train: (455, 30)   Test: (114, 30)

Step 2: Train with early stopping

We set n_estimators=1000 as an upper bound and let early_stopping_rounds=30 halt training automatically when the validation log-loss does not improve for 30 consecutive rounds. This finds the optimal number of trees without any manual search.


model = xgb.XGBClassifier(
    n_estimators       = 1000,   # upper bound; early stopping decides actual count
    learning_rate      = 0.05,
    max_depth          = 4,
    subsample          = 0.8,
    colsample_bytree   = 0.8,
    gamma              = 0.1,
    reg_lambda         = 1.0,
    reg_alpha          = 0.0,
    tree_method        = 'hist', # fast histogram-based splits
    early_stopping_rounds = 30,
    eval_metric        = 'logloss',
    random_state       = 42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

print(f"Best number of trees: {model.best_iteration + 1}")
print(f"Test accuracy:        {accuracy_score(y_test, model.predict(X_test)):.4f}")
# Typical output:
# Best number of trees: 187
# Test accuracy:        0.9737

Step 3: Classification report


print(classification_report(y_test, model.predict(X_test), target_names=data.target_names))
# Example output:
#               precision    recall  f1-score   support
#    malignant       0.97      0.95      0.96        42
#       benign       0.97      0.99      0.98        72
#     accuracy                           0.97       114

Step 4: Cross-validation for a stable accuracy estimate

We cannot pass model directly to cross_val_score because it has early_stopping_rounds set — scikit-learn's CV loop does not supply an eval_set, causing a crash. Instead, build a fresh estimator with the best tree count discovered during training:


# Use best_iteration (discovered by early stopping) — no eval_set needed
cv_model = xgb.XGBClassifier(
    n_estimators   = model.best_iteration + 1,
    learning_rate  = 0.05,
    max_depth      = 4,
    subsample      = 0.8,
    colsample_bytree = 0.8,
    gamma          = 0.1,
    reg_lambda     = 1.0,
    tree_method    = 'hist',
    random_state   = 42
)

cv_scores = cross_val_score(cv_model, X, y, cv=5, scoring='accuracy', n_jobs=-1)
print(f"5-fold CV accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")
# Typical output: 5-fold CV accuracy: 0.9736 +/- 0.0080

Step 5: Feature importance (three types)

XGBoost provides three importance metrics. Gain (average improvement to the objective per split) is the most informative. Cover (average samples per split) measures how broadly a feature is applied. Weight (total number of splits) is the least reliable for ranking.


fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, importance_type in zip(axes, ['gain', 'cover']):
    xgb.plot_importance(
        model,
        importance_type=importance_type,
        max_num_features=10,
        ax=ax,
        title=f"Top 10 Features — {importance_type.capitalize()}"
    )

plt.tight_layout()
plt.show()
# Top features by gain: typically worst radius, worst perimeter, worst area

Step 6: Hyperparameter tuning with XGBoost's native cross-validation

XGBoost's xgb.cv() is faster than scikit-learn's GridSearchCV for tuning the number of trees because it handles early stopping natively. First fix the number of trees, then tune the structural hyperparameters.


from sklearn.model_selection import GridSearchCV

# Step 1: find the best structural parameters with a fixed learning rate
param_grid = {
    'max_depth':         [3, 4, 6],
    'subsample':         [0.7, 0.8, 1.0],
    'colsample_bytree':  [0.7, 0.8, 1.0],
    'gamma':             [0, 0.1, 0.3],
}

base_model = xgb.XGBClassifier(
    n_estimators=200, learning_rate=0.05,
    tree_method='hist', random_state=42
)

grid_search = GridSearchCV(
    base_model, param_grid, cv=5,
    scoring='accuracy', n_jobs=-1, verbose=0
)
grid_search.fit(X_train, y_train)

print("Best params:", grid_search.best_params_)
print("Best CV acc:", round(grid_search.best_score_, 4))

12. Key Takeaways

Sequential correction: Each tree in XGBoost corrects the residuals of the ensemble so far. This reduces bias — the primary weakness of any single shallow tree.
Second-order Taylor: Using both gradient \(g_i\) and hessian \(h_i\) enables closed-form optimal leaf weights and a principled split gain formula that works for any differentiable loss.
Regularised objective: \(\Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2\) is baked directly into every split decision — not bolted on as an afterthought.
Optimal leaf weight: \(w^* = -G/(H+\lambda)\) is derived analytically. Higher \(\lambda\) shrinks predictions; higher \(G\) means the leaf wants a larger correction.
Split Gain: A split is only made when \(\frac{1}{2}[\text{Score}_L + \text{Score}_R - \text{Score}_\text{parent}] > \gamma\). This is the mathematical equivalent of minimal-cost-complexity pruning.
Early stopping is essential: Always use a validation set with early_stopping_rounds. It eliminates the most impactful hyperparameter (number of trees) automatically.
Low learning rate wins: Set learning_rate=0.01–0.05, keep n_estimators high, and let early stopping decide when to stop.

References

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of KDD 2016. The original XGBoost paper — all maths in Section 4 derives from here.
Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29(5), 1189–1232. Introduced gradient boosting and functional gradient descent.
Friedman, J. H. (2002). Stochastic Gradient Boosting. Computational Statistics & Data Analysis, 38(4), 367–378. Introduced row subsampling to gradient boosting.
Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017. Introduces GOSS and EFB for faster training.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), Ch. 10. Springer. Rigorous statistical treatment of boosting.
XGBoost documentation
Random Realizations: XGBoost Explained — excellent deep-dive on the maths.

Model Context Protocol (MCP): A Complete Beginner's Guide

MCP is the USB-C port for AI applications — one protocol that...

Random Forest: A Complete Beginner's Guide

Random Forest builds hundreds of deliberately different decision trees and takes a...

Found this useful?