Linear Regression from Scratch (Geometry, Math, and Code)
Linear Regression from Scratch: Geometry, Math, and Code
To a human, a dataset is just a spreadsheet. It’s a list of inputs (like "Hours Studied") and outputs (like "Exam Score"). But to a computer, this isn't a list. It is geometry.
Predicting a value isn't about "learning" in the human sense. Mathematically, it is the problem of finding the best possible solution to a system of linear equations that is fundamentally unsolvable.
In this post, we will decode the math of Linear Regression. We will move beyond the "black box" of libraries like Scikit-Learn and build the algorithm from scratch using NumPy and Linear Algebra. We will also prove that our manual implementation yields the exact same results as the industry-standard libraries.
Watch the full visual breakdown in the video below, or scroll down for the detailed code and derivation.
The Geometry: Why "Least Squares"?
We start by organizing our data into a Design Matrix (X) and a Target Vector (y). Our goal is to find a set of weights w such that:
Xw = y
However, in real-world data, we have more data points (rows) than features (columns). This makes the system overdetermined. Geometrically, the target vector y does not live inside the Column Space of X. The system is unsolvable.
So, we find a compromise. We look for the weights w that minimize the error vector e = y - Xw. Specifically, we want to minimize the squared length of this error (hence, "Least Squares").
The geometric solution occurs when the error vector is orthogonal (perpendicular) to the column space of our data. This gives us the famous Normal Equations:
XTXw = XTy
Implementation from Scratch (NumPy)
Let's convert this math into Python. We will use a small toy dataset to verify our results.
1. Setup and The Bias Trick
First, we define our data. Note that we must add a column of 1s to our feature matrix. This corresponds to the intercept (or bias) term in the model ($y = mx + c$). Without this, our line would be forced to pass through the origin.
import numpy as np
# Feature: Hours Studied
X_raw = np.array([[1], [2], [4], [5], [6]])
# Target: Exam Score
y = np.array([[45], [50], [75], [80], [95]])
# The Bias Trick: Add a column of 1s for the intercept
X = np.c_[np.ones(X_raw.shape[0]), X_raw]
print("Design Matrix X:\n", X)
2. Solving the Normal Equations
Now we construct the terms for the Normal Equation: XTX and XTy.
X_T = X.T
XTX = X_T @ X
XTy = X_T @ y
3. Inverse vs. Solve (Crucial Pro Tip)
Mathematically, we could find w by multiplying by the inverse: w = (XTX)-1 XTy. However, in production code, computing the explicit inverse is computationally expensive and numerically unstable.
Instead, we use np.linalg.solve, which uses a more efficient algorithm (like LU decomposition) to solve the system directly.
# Method 1: Explicit Inverse (Mathematically correct, computationally risky)
w_inv = np.linalg.inv(XTX) @ XTy
# Method 2: The "Solve" Method (Preferred for stability)
w_solve = np.linalg.solve(XTX, XTy)
print(f"Weights using Inverse: {w_inv.flatten()}")
print(f"Weights using Solve: {w_solve.flatten()}")
# Both will output: [33.7288, 9.8813]
Verification: Comparison with Scikit-Learn
Did we reinvent the wheel correctly? Let's check our results against LinearRegression from the industry-standard library, scikit-learn.
from sklearn.linear_model import LinearRegression
# Initialize and fit
# Note: sklearn handles the intercept automatically, so we pass X_raw
model = LinearRegression()
model.fit(X_raw, y)
print("--- Comparison ---")
print(f"Our Intercept: {w_solve[0][0]:.4f}")
print(f"Sklearn Intercept: {model.intercept_[0]:.4f}")
print(f"Our Slope: {w_solve[1][0]:.4f}")
print(f"Sklearn Slope: {model.coef_[0][0]:.4f}")
The Verdict: The numbers match perfectly. This proves that the "magic" inside Scikit-Learn is simply the Linear Algebra we just derived and implemented.
The Probabilistic View
Finally, why do we minimize the Squared error? Why not absolute error? It turns out, if you assume the noise in your data follows a Gaussian (Normal) Distribution, calculating the Maximum Likelihood Estimate (MLE) leads mathematically to the exact same objective function as Least Squares.
Geometry and Probability tell the exact same story.
Next Steps...
We trained our model on all the data we had. But how do we know if it generalizes to new data? This leads to the problem of Overfitting and the concept of Train-Test splits, which we will cover next.
Get the Code: You can run all the code from this post directly in your browser using the Google Colab Notebook here.
This post is part of the "Linear Algebra for Machine Learning" series. For the previous part, check out: The Math Behind "Best Fit".
Comments
Post a Comment