Posts

How Spam Filters Work: Coding Naive Bayes from Scratch

Image
Building a Spam Detector from Scratch: The Naive Bayes Classifier Three billion emails are sent every hour. And nearly half of them are spam. "Congratulations! You won free money. Click here now." If we look at this email, we immediately realize it is spam. But how does the machine realize it? It's an engineering problem . An email has thousands of words, and each word changes the probability of it being spam or not spam. Checking how these words relate to each other is computationally impossible. We need to make a naive assumption to solve it. This assumption leads us to one of the most effective classifiers: The Naive Bayes Classifier . The "Naive" Assumption The Naive Bayes classifier assumes that all features are conditionally independent given the class . What does this mean in simple terms? It assumes that the presence of one word does not affect the probability of another wo...

Bayesian Inference Explained: Math, Intuition & Python Code

Image
Why MLE Fails: Decoding Bayesian Inference (Python Simulation) Traditional statistical analysis (like Maximum Likelihood Estimation) relies solely on data. It does not consider any prior knowledge we might have about the environment. This works perfectly when we have a massive dataset. But in reality, data is expensive. We often have very little data, yet we do have some prior knowledge about the problem we are trying to solve. To build robust models, it is best to utilize what we already know along with the data we observe. This is achieved using a concept called Bayesian Inference . Understanding Bayes Theorem Before we code it, we need to understand the mathematics. Bayes Theorem is a way of calculating the probability of an event based on prior knowledge of conditions related to that event. P(A|B) = [P(B|A) * P(A)] / P(B) Let's break down the terminology: P(A|B) (Posterior Probability): ...

Why we use Mean Squared Error: Maximum Likelihood Estimation (Python Simulation)

Image
Why We Use Mean Squared Error: Decoding Maximum Likelihood Estimation In our previous exploration of the Central Limit Theorem , we established that the error in Linear Regression follows a Normal Distribution. This is a huge win—it gives us the shape of the error. But there is no single distribution that fits every dataset. There are infinite Bell Curves—some wide, some narrow, some shifted to the left or right. This raises a critical question: How do we find the specific distribution parameters that best fit our data? In simple words, we need to find the parameters (like Mean or Slope) that minimize the error. We do this using a technique called Maximum Likelihood Estimation (MLE) . Likelihood vs. Probability: What's the Difference? In English, "Likelihood" and "Probability" are synonyms. In Mathematics, they are opposites. Probability is the art of estimating an event (or data) o...

Why Linear Regression Actually Works: Simulating the Central Limit Theorem

Image
Why Linear Regression Actually Works: Simulating the Central Limit Theorem In Linear Regression, we almost always use Mean Squared Error (MSE) as our loss function. When we apply Gradient Descent, we assume this function forms a nice, convex shape. We find the bottom of this bowl, minimize the error, and assume our parameters are optimal. But there is a hidden assumption here: We assume the noise (error) in our data follows a Normal Distribution. In the real world, data is messy. It rarely follows a perfect Bell Curve. It might be uniform, skewed, or exponential. This leads to a critical question: Why does minimizing squared errors work even when the original data is NOT normal? The answer lies in a statistical law so powerful it feels like a cheat code: The Central Limit Theorem (CLT) . What is the Central Limit Theorem? The Central Limit Theorem states that if you extract sufficiently large samples (ideally n ≥ 30) fr...

Probability isn't about Coin Flips: The Math of Uncertainty (For AI)

Image
Probability isn't about Coin Flips: The Math of Uncertainty (For AI) When we talk about Probability, most people think of gambling. Flipping coins, rolling dice, or picking cards. This is the Frequentist view: the idea that if you repeat an event infinite times, the frequency converges to a number (e.g., 50% Heads). But in Artificial Intelligence, we don't care about flipping coins. When a self-driving car sees a shadow on the road, it can't replay that scenario a million times to see if it crashes. It has to make a decision NOW , based on incomplete information. In Machine Learning, Probability isn't about frequency. It is a Calculus of Belief . It is a way to quantify how confident the machine is in its own view of the world. Welcome to Season 2 of Decoding Complexities. We have mastered the Geometry of Data (Linear Algebra); now we master the Logic of Uncertainty. The Core Shift: Fixed Data vs. Uncertain ...

The "Fundamental Theorem" of Linear Algebra: SVD Explained

Image
The Fundamental Theorem of Linear Algebra: SVD (Singular Value Decomposition) We have spent the last few weeks mastering Eigenvectors and PCA. We learned how to find the hidden axes of data. But there was always a catch: Eigenvectors only work on Square Matrices. But look at the real world. A dataset of Users vs. Movies is a rectangular matrix. An image is a rectangular matrix of pixels. A spreadsheet of stock prices is rectangular. If you try to calculate the Eigenvectors of a rectangular matrix, the math breaks. So, how do we find the hidden structure of any matrix, of any shape? The answer is the Singular Value Decomposition (SVD) . It is arguably the most important theorem in all of Linear Algebra. The Intuition: Breaking Down the Rectangle SVD states that any matrix A can be broken down into three clean components: A = U Σ Vᵀ Let's use a "Netflix" analogy where Matrix A represents Use...

How to Crush Big Data: The Math of PCA (Principal Component Analysis)

Image
How to Crush Big Data: The Math of PCA (Principal Component Analysis) In the real world, data is massive. A single image has millions of pixels. A financial model has thousands of market indicators. We call this the Curse of Dimensionality . When you have 10,000 features, your data becomes sparse, distance calculations break down, and models become incredibly slow. Visualization? Impossible. So, how do we fix this? How do we take a massive, high-dimensional monster and squash it down to just 2 or 3 dimensions without losing the important information ? The answer is Principal Component Analysis (PCA) . In this post, we will strip away the complexity and build PCA from scratch in Python using a 5-step linear algebra recipe. The Intuition: Information = Variance Before the math, we need to define our goal. What does it mean to "keep information"? In Data Science, Information is Variance (Spread). If data...