Bayesian Inference Explained: Math, Intuition & Python Code
Why MLE Fails: Decoding Bayesian Inference (Python Simulation)
Traditional statistical analysis (like Maximum Likelihood Estimation) relies solely on data. It does not consider any prior knowledge we might have about the environment. This works perfectly when we have a massive dataset.
But in reality, data is expensive. We often have very little data, yet we do have some prior knowledge about the problem we are trying to solve. To build robust models, it is best to utilize what we already know along with the data we observe.
This is achieved using a concept called Bayesian Inference.
Understanding Bayes Theorem
Before we code it, we need to understand the mathematics. Bayes Theorem is a way of calculating the probability of an event based on prior knowledge of conditions related to that event.
P(A|B) = [P(B|A) * P(A)] / P(B)
Let's break down the terminology:
- P(A|B) (Posterior Probability): The probability of event A occurring given that B has already occurred.
- P(B|A) (Likelihood): The probability of event B occurring given that A has already occurred.
- P(A) (Prior Knowledge): The prior knowledge about the event occurring, irrespective of what the data says.
- P(B) (Evidence): The new information provided by the data.
Why do we need a Prior?
The question that naturally comes out is: If we already have the prior knowledge of event A, why do we need to calculate the probability again?
The answer is simple: The Prior is just a theory.
For example, the probability of Heads when we flip a coin is 0.5. This is true in theory (with infinite experiments). However, if we only flip it 5 times, the outcome might be biased (e.g., 5 Heads). Bayes theorem helps us balance the Theory (Prior) against the Reality (Data).
Bayesian Inference in Machine Learning
Bayesian Inference is just Bayes theorem in action. We simply change the notation to fit Machine Learning parameters (θ) and Data (D):
P(θ|D) ∝ P(D|θ) * P(θ)
We often ignore $P(D)$ because it is a constant. So, the formula becomes: Posterior ∝ Likelihood * Prior Knowledge.
The Python Simulation
Let’s look at an example in Python. We assume a coin is fair (Probability of Heads = 0.5).
We generate a probability density function using a Beta Distribution (with alpha and beta as 10). This returns a bell-shaped curve centered at 0.5. This represents our Prior Belief.
Scenario 1: Small Data (The Correction)
Let’s simulate the coin flip 5 times. Suppose we get 3 Heads and 2 Tails.
- Maximum Likelihood Estimate (MLE): 3/5 = 0.6. (It overreacts to the small data).
- Bayesian Estimate (MAP): Using Bayes theorem, the result is 0.52.
Notice that 0.52 is much closer to our original belief (0.5). Bayesian Inference prevented us from overfitting to a small dataset.
Scenario 2: Big Data (Evidence Wins)
Now, let’s simulate the coin flip 100 times. Suppose we get 60 Heads and 40 Tails.
- MLE: 0.6.
- Bayesian Estimate: 0.58.
Notice what happened? The result moved away from our Prior (0.5) and toward the Maximum Likelihood (0.6). As the size of the dataset grows, the importance of the prior belief reduces. The evidence overwhelms the theory.
Conclusion & Next Steps
This simple mechanism—balancing the Prior against the Data—is the mathematical intuition behind Regularization. It explains how we prevent overfitting when data is scarce.
But so far, we have only looked at a simple coin flip. What if we want to classify something complex, like an email (Spam vs. Not Spam) based on thousands of words?
Calculating the dependencies between all these words is computationally impossible. So, we need to make a "Naive" assumption to solve it. This leads us to the Naive Bayes Classifier, which we will decode in the next post.
Get the Code
Want to run the simulation yourself? Check out the Google Colab Notebook.
This post is part of the "Probability for Machine Learning" series. For the previous part, check out: Why We Use Mean Squared Error (MLE).
Comments
Post a Comment