A Beginner’s Guide to Lasso Regression (L1 Regularization)

Regularization Techniques I

Posted by Perivitta on October 03, 2025 · 9 mins read
A Beginner’s Guide to Lasso Regression

A Beginner’s Guide to Lasso Regression

1. Introduction

In previous posts, we explored Linear Regression, Ridge Regression, and evaluation metrics like RMSE and R². In this post, we’ll dive into Lasso Regression, short for Least Absolute Shrinkage and Selection Operator.

While Ridge Regression (L2 penalty) shrinks coefficients but keeps them nonzero, Lasso Regression (L1 penalty) can shrink some coefficients to exactly zero. This makes Lasso particularly powerful for feature selection, since it automatically drops irrelevant predictors.

2. The Lasso Cost Function

Ordinary Least Squares (OLS) minimizes squared errors:

$$ J(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

Ridge adds an L2 penalty:

$$ J_{ridge}(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p \beta_j^2 $$

Lasso instead uses an L1 penalty:

$$ J_{lasso}(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p |\beta_j| $$

3. Why Does Lasso Set Coefficients to Zero?

Geometrically, Ridge uses a circular (L2 ball) constraint, while Lasso uses a diamond-shaped (L1 ball) constraint. The corners of the diamond tend to align with axes, meaning solutions often hit an axis, forcing coefficients to zero.

Lasso vs Ridge geometry

Explanation: Notice how the sharp corners of the diamond (L1) are more likely to “touch” the optimal solution on an axis, forcing coefficients to exactly zero. Ridge’s circle (L2) only shrinks them but rarely sets them to zero.

Soft-thresholding

The Lasso solution is based on a process called soft-thresholding:

$$ S(z, \gamma) = \text{sign}(z) \cdot \max(|z| - \gamma, 0) $$

This pushes small coefficient values to exactly zero. Imagine a V-shaped function: flat at the bottom (coefficients vanish), slanted outside (coefficients shrink).

Soft-thresholding function

Explanation: Any coefficient smaller than γ in magnitude becomes exactly zero. This is how Lasso performs automatic feature selection.

4. Example with a Tiny Dataset

Let’s use a small dataset so we can follow the math step by step:

X y
1 2
2 3
3 5
4 7
OLS regression line on dataset

The OLS solution gives approximately:

$$ y = 0.5 + 1.5x $$

Step 1: Compute z

We compute correlation between X and y:

$$ z = \frac{1}{n} \sum x_i y_i = \frac{1}{4}(1*2 + 2*3 + 3*5 + 4*7) = 12.75 $$

Step 2: Apply Penalty

With λ = 1:

$$ \gamma = \frac{\lambda}{2n} = \frac{1}{8} = 0.125 $$

Step 3: Soft-thresholding

$$ \beta_1 = S(12.75, 0.125) = 12.625 $$

The coefficient shrinks slightly. Larger λ shrinks more, eventually reaching zero.

Step 4: Manual Python Demo


import numpy as np

X = np.array([1,2,3,4])
y = np.array([2,3,5,7])
n = len(y)

z = (1/n) * np.sum(X * y)
lam = 1
gamma = lam / (2*n)

def soft_threshold(z, gamma):
    if z > gamma:
        return z - gamma
    elif z < -gamma:
        return z + gamma
    else:
        return 0

beta1 = soft_threshold(z, gamma)
print("z =", z)
print("gamma =", gamma)
print("Updated coefficient β1 =", beta1)
print("Predictions:", beta1 * X)
        

5. Using Scikit-learn


from sklearn.linear_model import Lasso
import numpy as np
from sklearn.preprocessing import StandardScaler

X = np.array([[1],[2],[3],[4]])
y = np.array([2,3,5,7])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = Lasso(alpha=1)
model.fit(X_scaled,y)

print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)
        

Explanation: Unlike the manual demo, scikit-learn handles iterative updates and bias term. Notice that coefficients may shrink to zero depending on λ.

6. Visualizations

Lasso coefficient path

Explanation: Each colored line is a coefficient path. As λ increases (moving right), more coefficients are pulled to zero. This demonstrates Lasso’s ability to perform feature selection.

Error vs lambda in Lasso

Explanation: Training error always rises as λ increases (model becomes simpler). Validation error often decreases first (reducing overfitting), then rises again (underfitting). The “U-shape” shows why cross-validation is needed to find the optimal λ.

7. Ridge vs Lasso

Aspect Ridge Lasso
Penalty L2 (squared) L1 (absolute)
Effect on Coefficients Shrinks but never zero Can shrink to exactly zero
Use Case Handles multicollinearity Feature selection

8. Key Takeaways

  • Lasso adds an L1 penalty, unlike Ridge which uses L2.
  • Lasso can shrink coefficients to zero, performing automatic feature selection.
  • Always standardize features before applying Lasso.
  • The choice of λ (alpha) is crucial: too small → overfitting, too large → underfitting.

9. Conclusion

Lasso Regression is a powerful extension of linear regression. It combats overfitting through regularization, and simplifies models by removing irrelevant features. Ridge shrinks, Lasso selects.

10. References

  • Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
  • Scikit-learn Lasso Docs
  • Elements of Statistical Learning (Hastie, Tibshirani, Friedman)

Related Articles