How Spam Filters Work: Coding Naive Bayes from Scratch

Building a Spam Detector from Scratch: The Naive Bayes Classifier

Three billion emails are sent every hour. And nearly half of them are spam.

"Congratulations! You won free money. Click here now."

If we look at this email, we immediately realize it is spam. But how does the machine realize it?

It's an engineering problem. An email has thousands of words, and each word changes the probability of it being spam or not spam. Checking how these words relate to each other is computationally impossible.

We need to make a naive assumption to solve it. This assumption leads us to one of the most effective classifiers: The Naive Bayes Classifier.

The "Naive" Assumption

The Naive Bayes classifier assumes that all features are conditionally independent given the class.

What does this mean in simple terms? It assumes that the presence of one word does not affect the probability of another word appearing, as long as we know the class.

For example, in a spam email, the algorithm assumes that the word "free" appearing is independent of the word "money" appearing, given that we know it's spam.

But Wait... That's Not True!

You're right! Words like "free" and "money" often appear together in spam emails. That's why it's called naive.

But here's the remarkable part: Despite this assumption being wrong, the classifier works incredibly well in practice. The naive assumption simplifies computation dramatically—instead of calculating dependencies between all possible word combinations, we just multiply individual probabilities.

How Bayes Theorem Applies to Classification

For classification, we want to find the probability of a class given the features. In our case: What is the probability that an email is spam, given the words it contains?

P(Spam|Words) = [P(Words|Spam) × P(Spam)] / P(Words)

Let's break down the terminology:

  • P(Spam|Words) (Posterior): The probability that the email is spam given the words.
  • P(Words|Spam) (Likelihood): The probability of seeing these words in a spam email.
  • P(Spam) (Prior): How common spam emails are overall.
  • P(Words) (Evidence): How common these words are in general.

The Naive Simplification

With the naive assumption, calculating P(Words|Spam) becomes much simpler. Instead of calculating the probability of all words together, we multiply the probabilities of each individual word:

P(Words|Spam) = P(word₁|Spam) × P(word₂|Spam) × P(word₃|Spam) × ...

To classify a new email, we calculate this probability for both spam and not spam. Whichever probability is higher—that's our prediction!

A Simple Example: The Zero-Probability Problem

Let's walk through a simple example to see exactly how this works—and discover a critical problem.

We start with just four emails. Two are spam, two are not spam:

  • Email 1: "free money win" → SPAM
  • Email 2: "meeting tomorrow project" → NOT SPAM
  • Email 3: "free click win prize" → SPAM
  • Email 4: "report meeting deadline" → NOT SPAM

Step 1: Calculate Prior Probabilities

P(Spam) = 2/4 = 0.5 and P(Not Spam) = 2/4 = 0.5

Step 2: Count the Words

We count how frequently each word appears in spam vs. not spam emails. For example, "free" appears 2 times in spam emails, "meeting" appears 2 times in not spam emails.

The Problem: A New Word

Now let's try to classify: "urgent free money"

We notice the word "urgent" never appeared in our training data. So:

  • P(urgent|spam) = 0
  • P(urgent|not spam) = 0

When we calculate: P(spam|email) = P(spam) × P(urgent|spam) × P(free|spam) × P(money|spam)

Since P(urgent|spam) = 0, the entire product becomes zero. The same happens for not spam.

Result: Zero probability for both classes. We cannot classify!

The Solution: Laplace Smoothing

We add 1 to all word counts. This is called Laplace smoothing or add-one smoothing:

P(word|class) = (count of word in class + 1) / (total words in class + vocabulary size)

Now no word has zero probability. Even "urgent" gets a small non-zero probability, and we can classify successfully!

This ensures that a single unseen word doesn't break the entire classification.

Building a Full Spam Detector

Now let's build a complete spam detector with a larger dataset of 30 emails (15 spam, 15 not spam).

Step 1: Preprocess the Text

First, we clean the data. For each email:

  • Convert everything to lowercase
  • Remove punctuation and special characters
  • Split into individual words (tokens)
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    return words
    

Step 2: Calculate Prior Probabilities

From our training data:

  • P(Spam): 15/30 = 0.5
  • P(Not Spam): 15/30 = 0.5

Step 3: Calculate Word Probabilities with Laplace Smoothing

For each word in our vocabulary, we calculate how often it appears in spam vs. not spam emails, applying Laplace smoothing:

For example:

  • The word "free" might appear frequently in spam (P(free|spam) = 0.08) but rarely in not spam (P(free|not spam) = 0.01)
  • The word "meeting" has high probability in not spam, low probability in spam

Step 4: Use Log Probabilities

When we multiply many probabilities, the product becomes extremely small—this causes numerical underflow. By taking the logarithm, we convert multiplication into addition:

log(a × b × c) = log(a) + log(b) + log(c)

This makes computation numerically stable while preserving the relative ordering of probabilities.

Step 5: Classification

When a new email arrives, we:

  1. Extract its words: "Congratulations! You won free money. Click here now."
  2. Calculate log probabilities for spam and not spam
  3. Compare: If P(spam|email) > P(not spam|email), classify as spam
# Calculate log probabilities
log_prob_spam = np.log(prior_spam)
log_prob_not_spam = np.log(prior_not_spam)

for word in email_words:
    log_prob_spam += np.log(P_word_given_spam[word])
    log_prob_not_spam += np.log(P_word_given_not_spam[word])

# Classify
prediction = 1 if log_prob_spam > log_prob_not_spam else 0
    

A Classification Example

Let's classify: "free money click"

We start with the prior probabilities (in log space):

  • log P(spam) = log(0.5) = -0.69
  • log P(not spam) = log(0.5) = -0.69

For each word, we add the log probability:

  • "free": High in spam (0.08), low in not spam (0.01) → Strong spam indicator
  • "money": High in spam (0.06), low in not spam (0.005) → Strong spam indicator
  • "click": High in spam (0.07), low in not spam (0.02) → Strong spam indicator

After adding all log probabilities:

  • log P(spam|email) = -4.1
  • log P(not spam|email) = -8.5

Since -4.1 > -8.5, we predict SPAM

Why Log Probabilities?

When we multiply many small probabilities (like 0.001 × 0.002 × 0.003...), the result becomes too small for computers to handle accurately—this is called numerical underflow.

By taking the logarithm, we convert multiplication into addition:

log(a × b × c) = log(a) + log(b) + log(c)

This makes computation numerically stable while preserving the relative ordering of probabilities. More words with low probabilities won't cause the product to vanish—we just keep adding log values.

Visualizing Important Words

In our Python implementation, we visualize which words are most indicative of each class by calculating log probability ratios:

  • Top Spam Words: free, click, win, urgent, money, cash, offer
  • Top Not Spam Words: meeting, tomorrow, please, report, project, team, thanks

This gives us insight into what the classifier has learned and helps debug misclassifications.

Why Does This Work Despite Being "Naive"?

Even though words are not truly independent, the classifier still works well because it's comparing two options. It doesn't need to get the exact numbers right—it just needs to rank spam higher than not spam for spam emails, and not spam higher than spam for legitimate emails.

  1. Relative Comparison: It doesn't need exact probabilities—just needs correct relative ranking
  2. Error Cancellation: The errors from the independence assumption often cancel out when comparing classes
  3. Simplification Benefits: We can train on less data and make predictions faster

The naive assumption is wrong, but it's useful. And in machine learning, useful often beats correct.

Strengths of Naive Bayes

  • Simple and Fast: Easy to implement and train, even with large datasets
  • Works with Small Data: Unlike deep learning models that need millions of examples, Naive Bayes works with just hundreds
  • Handles High Dimensions: With thousands of words, most classifiers struggle. Naive Bayes thrives
  • Provides Probabilities: Gives confidence scores, not just binary predictions
  • Robust to Irrelevant Features: If some words don't help distinguish classes, they don't hurt the classifier

Limitations

  • Independence Assumption: Cannot learn how words relate—won't learn that "free money" together is spammier than just "free" or just "money"
  • Needs Smoothing: Without Laplace smoothing, a single unseen word breaks the classifier
  • Probability Calibration: The relative ranking is good, but actual probability values might be off

Real-World Applications

Naive Bayes is used beyond spam detection:

  • Sentiment Analysis: Is this movie review positive or negative?
  • Document Classification: Which category does this news article belong to?
  • Medical Diagnosis: Given these symptoms, what disease is most likely?
  • Real-time Prediction Systems: Because it's so fast, it's used in production where speed matters

Validation: Comparing with Scikit-learn

In our Python notebook, we validate our from-scratch implementation by comparing it with scikit-learn's MultinomialNB classifier. Both produce the same results, confirming our implementation is correct!

from sklearn.naive_bayes import MultinomialNB

# Train scikit-learn model
sklearn_nb = MultinomialNB(alpha=1.0)  # alpha=1.0 for Laplace smoothing
sklearn_nb.fit(X_train, y_train)

# Our accuracy: 100%
# Scikit-learn accuracy: 100% ✓
    

The Beauty of Simplicity

The beauty of Naive Bayes is in its simplicity. It takes a complex problem (text classification with thousands of features), makes a bold simplifying assumption (independence), and still delivers excellent results.

It's a perfect example of how sometimes, the simplest approach is the best approach.

But We're Just Getting Started...

Naive Bayes treats every word equally. But think about it—the word "urgent" carries far more information than the word "the".

How do we measure the importance of words mathematically? How do we quantify information itself?

The answer: Entropy, the math of surprise. And it changes everything.

That's what we'll explore next in our journey through Information Theory.

What We Built in the Python Notebook

In the companion notebook, we implement the entire pipeline from scratch:

  • Simple 4-email toy example demonstrating the zero-probability problem
  • Laplace smoothing solution
  • Full 30-email spam detector implementation
  • Text preprocessing and tokenization
  • Prior probability calculation
  • Word likelihood calculation with smoothing
  • Log probability implementation to prevent underflow
  • Classification function
  • Word importance visualization
  • Comparison with scikit-learn validation

We explore the mathematical foundation in detail, showing how Bayes theorem transforms into the classification rule, why log probabilities prevent numerical underflow, and why Laplace smoothing is necessary.

Get the Code

Want to build your own spam detector from scratch? Check out the Python Notebook with complete implementation and visualizations.

This post is part of the "Probability for Machine Learning" series. For the previous part, check out: Why MLE Fails: Decoding Bayesian Inference.

Comments

Popular posts from this blog

Character to Numeric conversion in RPG - Allow Blanks & Thousands Separator - IBM i

Extract a portion of a Date/Time/Timestamp in RPGLE - IBM i

All about READ in RPGLE & Why we use it with SETLL/SETGT?