How to Stop Overfitting with Math (Regularization Explained)
How to Stop Overfitting: The Math of Regularization (Ridge & Lasso)
Machine Learning is a balancing act. If your model is too simple (like a straight line), it fails to capture the pattern. This is Underfitting (High Bias). If you give it too much power (like a 30th-degree polynomial), it memorizes the noise instead of the signal. This is Overfitting (High Variance).
So, how do we force a complex model to choose simplicity over chaos? We don't do it by manually removing features. We do it by changing the math itself.
In this post, we will decode Regularization. We'll derive the math behind Ridge (L2) and Lasso (L1) regression, understand why one stabilizes your matrix algebra while the other deletes features, and implement them from scratch in Python.
The Intuition: The "Complexity Penalty"
In standard Linear Regression, our model has only one goal: Minimize the Error (specifically, the Sum of Squared Errors).
Loss = Error
An overfitted model achieves this goal perfectly by inflating its weights (coefficients) to massive numbers—often in the thousands—just to twist and turn through every single data point.
To fix this, we change the rules. We add a Penalty Term to the loss function. We tell the model: "Minimize the error, BUT you have to pay a price for the size of your weights."
Loss = Error + (λ * Penalty)
The Greek letter Lambda (λ) is our control knob. If λ is high, the penalty is severe, and the model is forced to be simple.
Ridge Regression (L2): The Stabilizer
In Ridge Regression, we define the "size" of the weights as their Squared Magnitude (L2 Norm).
The Loss Function:
J(w) = ||y - Xw||² + λ||w||²
The Mathematical Derivation
This is where the magic happens. When we take the derivative of this new loss function with respect to w and set it to zero, the penalty term transforms directly into a matrix operation.
- The derivative of the error term gives us our standard Normal Equation components.
- The derivative of the penalty
λw²gives us2λw.
When we rearrange the terms, this extra λ doesn't disappear. It gets added to the diagonal of our matrix.
The Ridge Solution:
w = (XᵀX + λI)⁻¹ Xᵀy
The Principal Engineer Insight: Why do we love Ridge? Not just because it stops overfitting, but because it fixes a fundamental flaw in linear algebra. In real-world datasets with correlated features, the matrix XᵀX is often singular (non-invertible). Adding λI (a value to the diagonal) guarantees that the matrix is invertible. It makes the math stable.
Lasso Regression (L1): The Feature Selector
In Lasso Regression, we penalize the Absolute Value of the weights (L1 Norm).
J(w) = ||y - Xw||² + λ∑|w|
Because the absolute value function has a sharp "corner" at zero, Lasso has a unique superpower: it can force weights to become exactly zero.
This acts as automatic Feature Selection. If you feed Lasso 100 features but only 3 are useful, it will crush the other 97 coefficients to zero, effectively deleting them from your model.
Note: Because of that sharp corner at zero, we cannot take a simple derivative. Lasso has no closed-form matrix formula like Ridge. We must use iterative solvers (like Coordinate Descent) to find the answer.
Implementation: Manual vs. Scikit-Learn
Let's prove the math works. We'll generate a chaotic, overfitting model using a 30th-degree polynomial, and then tame it using Ridge.
1. Manual Ridge Implementation (NumPy)
We can implement the Ridge formula directly by manipulating the matrix diagonal.
import numpy as np
# Assume X_poly is our 30th-degree feature matrix
lambda_val = 1.0
I = np.eye(X_poly.shape[1])
I[0, 0] = 0 # Crucial: Don't penalize the bias (intercept) term!
# The Stabilized Normal Equation
w_ridge = np.linalg.inv(X_poly.T @ X_poly + lambda_val * I) @ X_poly.T @ y
2. Production Implementation (Scikit-Learn)
In production, we use Scikit-Learn. For Lasso, this is mandatory since we need its iterative solver.
from sklearn.linear_model import Ridge, Lasso
# Ridge (L2) - Smooths the curve
ridge_model = Ridge(alpha=1.0).fit(X_poly, y)
# Lasso (L1) - Selects features
lasso_model = Lasso(alpha=0.1, max_iter=10000).fit(X_poly, y)
# Check Lasso's sparsity
print("Lasso Weights:", lasso_model.coef_)
# Result: mostly zeros!
Conclusion & The Next Step
We've successfully tamed overfitting.
- Ridge anchors the model, stabilizing the math and smoothing predictions.
- Lasso filters the model, removing noise and selecting features.
But this leads us to a deeper question. Lasso selects features by force. But what if the true pattern of our data isn't in any single feature, but in a hidden combination of them? To find these hidden structures, we need to look at the "DNA" of the matrix itself.
In the next post, we will decode one of the most powerful concepts in linear algebra: Eigenvectors and Eigenvalues.
Get the Code
Want to see the wiggly line smooth out in real-time? Check out the Google Colab Notebook to run this experiment yourself.
This post is part of the "Linear Algebra for Machine Learning" series. For the previous part, check out: The "Linear" Trick for Non-Linear Data (Polynomial Regression).
Comments
Post a Comment