Why we use Mean Squared Error: Maximum Likelihood Estimation (Python Simulation)

Why We Use Mean Squared Error: Decoding Maximum Likelihood Estimation

In our previous exploration of the Central Limit Theorem, we established that the error in Linear Regression follows a Normal Distribution. This is a huge win—it gives us the shape of the error.

But there is no single distribution that fits every dataset. There are infinite Bell Curves—some wide, some narrow, some shifted to the left or right. This raises a critical question: How do we find the specific distribution parameters that best fit our data?

In simple words, we need to find the parameters (like Mean or Slope) that minimize the error. We do this using a technique called Maximum Likelihood Estimation (MLE).

Likelihood vs. Probability: What's the Difference?

In English, "Likelihood" and "Probability" are synonyms. In Mathematics, they are opposites.

  • Probability is the art of estimating an event (or data) occurring, given that we already have the model parameters. (Forward looking).
  • Likelihood is about estimating the model parameters, given that we already have the data. (Backward looking).

In Machine Learning, we start with the data. We need to find the model. Therefore, we calculate Likelihood.

The "Underflow" Bug & The Log-Likelihood Fix

The Likelihood Function calculates a score for how well a model fits. Mathematically, it is the product of the probability densities of every point in the dataset.

However, probability densities are often small numbers (e.g., 0.01). If you have a dataset with thousands of points, you are multiplying thousands of small numbers together.

0.01 * 0.01 * 0.01 ... quickly approaches Zero.

This causes Numerical Underflow—the computer runs out of precision and rounds the result to 0.0. To avoid this, we use Log-Likelihood.

By taking the Logarithm, we convert multiplication into addition. Instead of a crashing product, we get a stable sum of logs.

The Simulation: Finding the Hidden Mean

Let's simulate this in Python. We generate data from a Normal Distribution with a True Mean of 150 and a Standard Deviation of 10.

# 1. Generate Data (Truth = 150)
data = np.random.normal(150, 10, 1000)

# 2. Naive Approach (Product) -> Results in 0.0
def calculate_likelihood(data, mean, standard_deviation):
    # Get probability density for every point
    probabilities = norm.pdf(data, loc=mean, scale=standard_deviation)
    # Multiply them all probabilities
    return np.prod(probabilities)

# 3. Log Approach (Sum of Logs) -> Results in a stable negative number
def calculate_log_likelihood(data, mean, standard_deviation):
    probabilities = norm.pdf(data, loc=mean, scale=standard_deviation)
    # Take Log first, then Sum (Turns multiplication into addition)
    return np.sum(np.log(probabilities))
    

In the real world, we don't know the True Mean. So, we search for it. We generate a list of guesses (means ranging from 140 to 160) and calculate the Log-Likelihood for each.

When we plot the results, we see a curve forming a hill. The peak of this hill—the Maximum Likelihood—occurs at approximately 150.10. The math successfully recovered the hidden parameter.

The Grand Unification: MLE = MSE

How does this relate to Linear Regression? We use Mean Squared Error (MSE) as our loss function. Is this just a random choice?

Let's verify this visually. Using the same data, let's calculate the MSE for every guess in our list and plot it on the same graph as the Likelihood.

The Result: The Mean Squared Error is at its Minimum at the exact same point where the Likelihood is at its Maximum.

This proves that Mean Squared Error is statistically the best way to measure error for Linear Regression, assuming Gaussian noise.

Conclusion & What's Next?

Maximum Likelihood Estimation gives us the answer, but it requires a lot of data. What if we have very little data? Or what if we have prior knowledge about the world before we even look at the data?

For that, we need a different approach: Bayesian Inference. That is what we will decode in the next post.

Get the Code

Want to run the simulation yourself? Check out the Google Colab Notebook.

This post is part of the "Probability for Machine Learning" series. For the previous part, check out: Why Linear Regression Actually Works (Central Limit Theorem.

Comments

Popular posts from this blog

Create/Write Data to IFS file from SQL - IBM i

Linear Independence & Dependence Explained (The Key to Feature Engineering)

All about READ in RPGLE & Why we use it with SETLL/SETGT?