Overfitting in Machine Learning: Why 100% Accuracy is a Trap
The Golden Rule of Machine Learning: Why 100% Accuracy is a Trap
In our last post, we built a Linear Regression model that fit our data perfectly using the Normal Equations. But in the real world, a "perfect" fit is often a disaster waiting to happen.
Imagine a student who scores 100% on every single practice exam. They seem like a genius. But then, you give them a final exam with questions they haven't seen before, and they fail miserably. Why? Because they didn't learn the concepts. They just memorized the answer key.
In machine learning, this phenomenon is called Overfitting. It is the most dangerous trap in data science. In this guide, we will learn how to detect it and how to prevent it using the "Golden Rule" of ML: The Train-Test Split.
Watch the video for the visual explanation, then scroll down for the Python code.
The Intuition: Signal vs. Noise
To understand overfitting, we need to understand the nature of data. Real-world data is composed of two things:
- Signal: The true underlying pattern (e.g., "Studying more generally leads to higher scores").
- Noise: Random fluctuations (e.g., lucky guesses, bad sleep, measurement errors).
Our goal is to build a model that learns the Signal and ignores the Noise.
If we force our model to try too hard—if we give it too much complexity—it will start to connect every single dot. It will drive the error on the training data to zero. But in doing so, it learns the random noise specific to that dataset. When it sees new data with different random noise, its predictions will be wildly wrong.
The Solution: The Train-Test Split
How do we stop our model from "cheating"? We use the Train-Test Split. We surgically divide our dataset into two invisible vaults:
- The Training Set (80%): This is the study guide. We let the model see these points and learn from them.
- The Test Set (20%): This is the final exam. We hide this data from the model completely.
Once the model is trained, we unlock the Test Set and ask the model to predict those values. If the model performs well on the Training set but fails on the Test set, we have mathematical proof that it has Overfitted.
Implementation in Python
Let's simulate a classroom of 100 students and implement this workflow using Scikit-Learn.
1. Generate Data
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Data of 100 Students (Hours Studied vs Exam Scores)
# Hours Studied
hours = np.array([
0.5, 0.8, 1.1, 1.4, 1.6, 1.9, 2.1, 2.3, 2.5, 2.8,
3.0, 3.2, 3.5, 3.7, 3.9, 4.1, 4.3, 4.5, 4.8, 5.0,
5.2, 5.4, 5.6, 5.8, 6.0, 6.2, 6.5, 6.7, 6.9, 7.1,
7.3, 7.5, 7.8, 8.0, 8.2, 8.4, 8.6, 8.9, 9.1, 9.3,
9.5, 9.7, 0.6, 1.3, 2.2, 3.4, 4.6, 5.7, 6.8, 7.9,
8.8, 9.9, 0.9, 1.8, 2.7, 3.6, 4.9, 5.9, 6.6, 7.4,
8.3, 9.2, 0.7, 1.5, 2.4, 3.3, 4.2, 5.3, 6.4, 7.2,
8.5, 9.6, 0.4, 1.7, 2.6, 3.8, 4.7, 5.5, 6.3, 7.7,
8.7, 9.8, 0.3, 1.2, 2.9, 3.1, 4.4, 5.1, 6.1, 7.0,
8.1, 9.4, 2.0, 4.0, 6.0, 8.0, 3.5, 5.5, 7.5, 9.5
]).reshape(-1, 1)
# Exam Scores
scores = np.array([
38, 42, 45, 48, 50, 53, 55, 58, 60, 63,
65, 67, 70, 72, 74, 76, 78, 80, 83, 85,
87, 89, 91, 93, 95, 97, 98, 99, 99, 100,
99, 98, 97, 95, 93, 91, 89, 87, 85, 83,
81, 79, 39, 47, 56, 69, 79, 88, 96, 94,
86, 78, 43, 52, 61, 71, 82, 90, 95, 92,
84, 82, 41, 49, 58, 68, 77, 85, 93, 91,
83, 77, 37, 51, 60, 73, 81, 87, 92, 90,
85, 76, 36, 46, 64, 66, 79, 84, 91, 93,
82, 75, 54, 75, 91, 94, 70, 87, 92, 74
])
2. The Golden Rule: Split the Data
We use Scikit-Learn's `train_test_split` to randomly shuffle and divide the data.
# Split: 80% Train, 20% Test
X_train, X_test, y_train, y_test = train_test_split(
hours, scores, test_size=0.2, random_state=42
)
3. Train the Model
Notice that we fit the model only on the training data.
model = LinearRegression()
model.fit(X_train, y_train)
4. Evaluating the Model
We use RMSE (Root Mean Squared Error) to evaluate performance. This tells us, on average, how many points our prediction is off by.
# Generate predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)
# Calculate RMSE
train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
print(f"Training RMSE: {train_rmse:.2f}")
print(f"Test Set RMSE: {test_rmse:.2f}")
# Output might look like:
# Training RMSE: 10.59
# Test Set RMSE: 9.50
Conclusion
In our example, the Training RMSE and Test RMSE are very close. This is the sign of a healthy model. It means we have learned the general pattern (the Signal) rather than just memorizing the training data.
This concludes our arc on simple Linear Regression. But so far, we've assumed the world is made of straight lines. What happens when the data is curved? In the next post, we will break the linearity assumption and enter the world of Polynomial Regression.
Resources
Want to run this code yourself? Click here to open the Google Colab Notebook.
This post is part of the "Linear Algebra for Machine Learning" series. For the previous part, check out: The Math of Linear Regression (From Geometry to Code).
Comments
Post a Comment