Least Squares Approximation Explained (The Engine of Linear Regression)
The Math Behind "Best Fit": A Guide to Least Squares Approximation
In a perfect mathematical world, every system of equations has a clean solution. But the real world is not perfect. Real-world data is noisy, inconsistent, and often creates mathematical contradictions. So, what do we do when our system of equations `Ax=b` is unsolvable? Do we simply give up?
No. We find the best possible compromise. The way we define "best" is by finding a solution that makes the sum of the squared errors as small as possible. This powerful technique is the Least Squares Approximation, and it is the mathematical engine behind one of the most fundamental algorithms in machine learning: Linear Regression.
In this post, we'll decode the beautiful geometry that allows us to find this "best fit" solution, even when a perfect one doesn't exist.
Watch the video for the full visual story, then scroll down for the detailed derivation and code.
The Column Space
To find a solution, we must first understand why the problem is unsolvable. The equation `Ax=b` is asking: can we find a linear combination of the columns of A that perfectly equals the vector b?
Geometrically, this is asking: does our target vector b live inside the Column Space of A (the space spanned by A's columns)?
For an unsolvable system, the answer is no. The vector b is unreachable. So, if we can't get to b, what's the next best thing? It's the point on the plane of the column space that is closest to b. This closest point is the orthogonal projection of b onto the column space, which we'll call p.
Since p *is* in the column space, there must be a solution to this new, related problem:
A * x̂ = p
Our entire goal is to find this special vector x̂, the set of weights that produces the best possible approximation, p.
The "Least Squares" Connection
The error in our approximation is the leftover vector, e = b - p. Our goal is to make this error vector as short as possible—to minimize its length, ||e||.
For mathematical convenience (to avoid square roots), minimizing the length of a vector is identical to minimizing its squared length, ||e||². And what is the squared length of a vector? It's simply the sum of the squares of its components.
||e||² = e₁² + e₂² + ... + eₙ²
This is the "sum of the squared errors" we are trying to minimize! This is precisely where the name "Least Squares" comes from. The geometric goal of finding the shortest error vector is identical to the statistical goal of minimizing the sum of squared errors.
The Normal Equations
The key to finding this shortest possible error vector is geometry. The error vector e will be at its minimum length when it is orthogonal to the column space of A. This means e must be orthogonal to every single column of A.
We can write this condition compactly using the transpose of A. The statement Aᵀe = 0 is the formal mathematical way of saying "the error vector e is orthogonal to all the columns of A."
This single clue is all we need. We can now derive the solution:
Aᵀ(b - p) = 0(Substitute e = b - p)Aᵀ(b - Ax̂) = 0(Substitute p = Ax̂)Aᵀb - AᵀAx̂ = 0(Using the distributive property)AᵀAx̂ = Aᵀb(Rearrange the equation)
This final result is the master equation, known as the Normal Equations. We started with an unsolvable system and, through geometric reasoning, transformed it into a new, solvable system for our best-fit answer, x̂.
In Code: The NumPy Way
Let's solve the unsolvable system from our video's opening using Python and NumPy. The code is a direct implementation of the Normal Equations.
import numpy as np
# Our unsolvable system:
# 1*x + 0*y = 2
# 0*x + 1*y = 3
# 1*x + 1*y = 4
A = np.array([[1, 0], [0, 1], [1, 1]])
b = np.array([2, 3, 4])
# Construct the Normal Equations: AᵀA * x̂ = Aᵀb
A_transpose = A.T
lhs = A_transpose @ A # The left-hand side: AᵀA
rhs = A_transpose @ b # The right-hand side: Aᵀb
# Solve the system for x_hat.
# np.linalg.solve is more numerically stable than calculating the inverse.
x_hat = np.linalg.solve(lhs, rhs)
print(f"The 'least squares' solution x_hat is: {x_hat}")
# Output: The 'least squares' solution x_hat is: [1.66666667 2.66666667]
Conclusion
We started with a contradiction and ended with a compromise. By reframing the problem geometrically, we understood that the 'best' solution corresponds to the shortest possible error vector. This insight, combined with the power of orthogonality, allowed us to derive the Normal Equations—a direct formula for the 'best fit' answer.
This is the beautiful bridge between pure linear algebra and applied machine learning. The Least Squares method is the engine that drives Linear Regression, and now that we've decoded it, we're ready to cross that bridge.
Your Turn...
The "line of best fit" is a classic example of a least squares problem. What's another real-world scenario where you have noisy data and need to find the best possible compromise? Share your thoughts in the comments!
This post is part of the "Linear Algebra for Machine Learning" series. For the previous part, check out: The "Purification" Algorithm: A Guide to the Gram-Schmidt Process.
Comments
Post a Comment