Why Linear Regression Actually Works: Simulating the Central Limit Theorem

Why Linear Regression Actually Works: Simulating the Central Limit Theorem

In Linear Regression, we almost always use Mean Squared Error (MSE) as our loss function. When we apply Gradient Descent, we assume this function forms a nice, convex shape. We find the bottom of this bowl, minimize the error, and assume our parameters are optimal.

But there is a hidden assumption here: We assume the noise (error) in our data follows a Normal Distribution.

In the real world, data is messy. It rarely follows a perfect Bell Curve. It might be uniform, skewed, or exponential. This leads to a critical question: Why does minimizing squared errors work even when the original data is NOT normal?

The answer lies in a statistical law so powerful it feels like a cheat code: The Central Limit Theorem (CLT).

What is the Central Limit Theorem?

The Central Limit Theorem states that if you extract sufficiently large samples (ideally n ≥ 30) from any random distribution and average them, the resulting distribution of those averages will tend to follow a Normal Distribution.

The keyword here is Average. It doesn't matter if the source data is flat, spiked, or skewed. If you average enough samples, the math forces the result into a Bell Curve.

The Simulation: Creating Order from Chaos

When I first heard this, I was skeptical. Why would the output distribution be independent of the input? Let's prove it with a Python simulation.

1. The "Messy" Source Data

We start by generating 10,000 random numbers using a Uniform Distribution. Think of this like rolling a die—every number has an equal probability. When plotted, it looks like a flat rectangle.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate 10,000 samples of Uniform Data
population_data = np.random.uniform(0, 100, 10000)

# The plot looks like a flat rectangle
sns.histplot(population_data, stat="density")
    

2. The "Triangle" (Averaging 2 Samples)

Now, let's take 2 random samples at a time, average them, and repeat this process 1,000 times. When we plot the result, the shape changes. It’s no longer a rectangle; it forms a Triangle.

3. The Bell Curve (Averaging 30 Samples)

This is where the magic happens. We extract 30 random samples, calculate their average, and repeat this process.

def simulate_clt(data, n_samples, n_experiments=1000):
    sample_means = []
    for _ in range(n_experiments):
        # Extract n samples and Average them
        sample = np.random.choice(data, n_samples)
        sample_means.append(np.mean(sample))
    return sample_means

# Run simulation with n=30
means = simulate_clt(population_data, 30)
sns.histplot(means, kde=True)
    

The result? A perfect Gaussian Bell Curve. We started with a flat uniform block, and simply by averaging, we forced the data to become Normal.

Why This Matters for Machine Learning

Does this feel like a cheat code? Why does this relate to Linear Regression?

Think about the Loss Function (Mean Squared Error). We are calculating the average of squared errors. The error in a real-world prediction is rarely caused by one thing. It is the sum of thousands of tiny, independent factors (sensor noise, atmospheric changes, rounding errors).

Because the total error is the sum (or average) of these independent factors, the Central Limit Theorem guarantees that the distribution of errors will be Gaussian.

This explains why MSE aligns so well with reality and why Linear Regression is so robust. We aren't assuming the world is simple; we are relying on the statistical fact that the sum of complexity becomes simple.

What's Next?

We now know why we use the Bell Curve. But how do we find the perfect Bell Curve that fits our specific data? How do we use this for training?

This requires a technique called Maximum Likelihood Estimation (MLE). It is the engine behind everything from Logistic Regression to Neural Networks, and that is what we will decode in the next post.

Get the Code

Want to see the distribution morph from a rectangle to a bell curve yourself? Check out the Google Colab Notebook.

This post is part of the "Probability for Machine Learning" series. For the previous part, check out: Probability isn't about Coin Flips.

Comments

Popular posts from this blog

Gram-Schmidt Process Explained (The Math of QR Decomposition)

Working with Save Files - IBM i

All about READ in RPGLE & Why we use it with SETLL/SETGT?