Polynomial Regression Explained: Math, Python Code & Overfitting

The "Linear" Trick for Non-Linear Data: Understanding Polynomial Regression

Linear Regression is the workhorse of machine learning. It's simple, interpretable, and fast. But it has one fatal flaw: it assumes the world is a straight line.

Real-world data is messy. It curves, it fluctuates, and it rarely follows a simple y = mx + c relationship. When you try to fit a straight line to curved data, you get Underfitting—a model that is too simple to capture the underlying pattern.

So, do we need a complex non-linear algorithm to solve this? Surprisingly, no. We can use the exact same Linear Regression algorithm we already know. We just need to use a clever "engineering trick" on our data first.

This is the story of Polynomial Regression and the art of Feature Engineering.

The Core Insight: Change the Data, Not the Model

If a straight line y = w₀ + w₁x doesn't fit, our intuition is to change the equation to include curves, like y = w₀ + w₁x + w₂x².

This looks like a non-linear equation. But look closely at the weights (the parameters w). The target y is still a linear combination of the weights.

To the model, isn't a mathematical operation; it's just another feature. If we treat x as "Feature 1" and as "Feature 2", the equation becomes:

y = w₀ + w₁(Feature 1) + w₂(Feature 2)

This is still Linear Regression! We haven't changed the math of the model; we have simply engaged in Feature Engineering. We expanded our input data to include powers of x, allowing a linear model to fit a curve.

Implementation: The Matrix Expansion

Let's see how this works in code. We start by generating some synthetic data that follows a curve.

1. Generating the Data

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Hours Studied
hours = np.array([
    0.5, 0.8, 1.1, 1.4, 1.6, 1.9, 2.1, 2.3, 2.5, 2.8,
    3.0, 3.2, 3.5, 3.7, 3.9, 4.1, 4.3, 4.5, 4.8, 5.0,
    5.2, 5.4, 5.6, 5.8, 6.0, 6.2, 6.5, 6.7, 6.9, 7.1,
    7.3, 7.5, 7.8, 8.0, 8.2, 8.4, 8.6, 8.9, 9.1, 9.3,
    9.5, 9.7, 0.6, 1.3, 2.2, 3.4, 4.6, 5.7, 6.8, 7.9,
    8.8, 9.9, 0.9, 1.8, 2.7, 3.6, 4.9, 5.9, 6.6, 7.4,
    8.3, 9.2, 0.7, 1.5, 2.4, 3.3, 4.2, 5.3, 6.4, 7.2,
    8.5, 9.6, 0.4, 1.7, 2.6, 3.8, 4.7, 5.5, 6.3, 7.7,
    8.7, 9.8, 0.3, 1.2, 2.9, 3.1, 4.4, 5.1, 6.1, 7.0,
    8.1, 9.4, 2.0, 4.0, 6.0, 8.0, 3.5, 5.5, 7.5, 9.5
]).reshape(-1, 1)

# Exam Scores
scores = np.array([
    38, 42, 45, 48, 50, 53, 55, 58, 60, 63,
    65, 67, 70, 72, 74, 76, 78, 80, 83, 85,
    87, 89, 91, 93, 95, 97, 98, 99, 99, 100,
    99, 98, 97, 95, 93, 91, 89, 87, 85, 83,
    81, 79, 39, 47, 56, 69, 79, 88, 96, 94,
    86, 78, 43, 52, 61, 71, 82, 90, 95, 92,
    84, 82, 41, 49, 58, 68, 77, 85, 93, 91,
    83, 77, 37, 51, 60, 73, 81, 87, 92, 90,
    85, 76, 36, 46, 64, 66, 79, 84, 91, 93,
    82, 75, 54, 75, 91, 94, 70, 87, 92, 74
])

# Split: 75% Train, 25% Test
X_train, X_test, y_train, y_test = train_test_split(hours,
                                                    scores,
                                                    test_size=0.25,
                                                    random_state=42)

2. The Manual Way (NumPy)

To fit a polynomial curve, we manually add the column to our design matrix.

def polynomial_features(X, degree):
    """
    Builds polynomial expansion:
    degree=2  -> [1, x, x^2]
    degree=3  -> [1, x, x^2, x^3]
    """
    X_poly = np.hstack([X**d for d in range(1, degree+1)])
    return np.column_stack((np.ones(X.shape[0]), X_poly))

degree = 2

X_train_poly = polynomial_features(X_train, degree)
X_test_poly  = polynomial_features(X_test,  degree)

X_T_X = X_train_poly.T @ X_train_poly
X_T_X_Inv = np.linalg.inv(X_T_X)
X_T_y = X_train_poly.T @ y_train

# Solve Normal Equation
theta = np.dot(X_T_X_Inv, X_T_y)

print("Learned Parameters:")
for i, w in enumerate(theta.flatten()):
    print(f"w{i} = {w:.4f}")

3. The Scikit-Learn Way

In production, we use Scikit-Learn's PolynomialFeatures transformer, which handles the expansion automatically for any degree.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(degree=2)

model = LinearRegression()
model.fit(poly.fit_transform(X_train), y_train)

print("Learned Parameters:")
print(f"w0 = {model.intercept_:.4f}")
for i, w in enumerate(model.coef_[1:],1):
    print(f"w{i} = {w:.4f}")

The Danger Zone: Overfitting

If adding is good, is adding x¹⁰⁰ better? No.

As you increase the degree of the polynomial, the model becomes too flexible. It starts to wiggle wildly to hit every single data point, including the noise. This is called Overfitting (High Variance).

  • Degree 1: Underfitting (High Bias) - Too simple.
  • Degree 2: Good Fit - Captures the trend.
  • Degree 20: Overfitting (High Variance) - Memorizes the noise.

This tradeoff is fundamental to machine learning. We need a way to keep the model flexible enough to learn curves, but disciplined enough to avoid wiggling out of control.

What's Next?

How do we restrain a powerful model? How do we force it to keep the curve smooth? The answer lies in a technique called Regularization (Ridge and Lasso Regression), which we will decode in the next post.

Your Turn...

Try running the code above with degree=50. What happens to the curve? Share your observations in the comments!

Click here to open the Google Colab Notebook.

This post is part of the "Linear Algebra for Machine Learning" series. For the previous part, check out: The Math of Linear Regression: From Geometry to Code.

Comments

Popular posts from this blog

Character to Numeric conversion in RPG - Allow Blanks & Thousands Separator - IBM i

Retrieve Job log information from SQL - IBM i

All about READ in RPGLE & Why we use it with SETLL/SETGT?