How to Crush Big Data: The Math of PCA (Principal Component Analysis)
How to Crush Big Data: The Math of PCA (Principal Component Analysis)
In the real world, data is massive. A single image has millions of pixels. A financial model has thousands of market indicators.
We call this the Curse of Dimensionality. When you have 10,000 features, your data becomes sparse, distance calculations break down, and models become incredibly slow. Visualization? Impossible.
So, how do we fix this? How do we take a massive, high-dimensional monster and squash it down to just 2 or 3 dimensions without losing the important information?
The answer is Principal Component Analysis (PCA). In this post, we will strip away the complexity and build PCA from scratch in Python using a 5-step linear algebra recipe.
The Intuition: Information = Variance
Before the math, we need to define our goal. What does it mean to "keep information"?
In Data Science, Information is Variance (Spread). If data is squashed together, it looks like noise. If it is spread out, we can distinguish patterns.
Our goal is to find new axes (Principal Components) that maximize this spread.
The Algorithm: A 5-Step Recipe
PCA isn't magic; it is a deterministic algorithm. Here is the blueprint:
- Centering: Shift the data so the mean is zero.
- Standardization: Scale features so they are comparable.
- Covariance Matrix: Calculate how features vary together.
- Eigen-Decomposition: Find the axes (Eigenvectors) of that matrix.
- Projection: Rotate the data to align with the new axes.
Step 1 & 2: The Data Problem
Let's imagine a dataset of 200 people with two features: Height (cm) and Weight (kg).
# Generate correlated data
height = np.random.normal(170, 10, n_samples)
weight = (0.5 * height) - 20 + noise
X = np.column_stack((height, weight))
Height is around 170. Weight is around 70. The scales are totally different. If we run PCA on raw data, "Height" will dominate the math simply because the numbers are bigger.
We fix this by Standardizing (subtracting the mean and dividing by standard deviation).
X_centered = X - np.mean(X, axis=0)
X_std = X_centered / np.std(X_centered, axis=0)
Step 3 & 4: The Mathematical Engine
Now we need a map of how the data is shaped. We calculate the Covariance Matrix ($\Sigma$).
Σ = (1/n) * XᵀX
To find the "Principal Axes" of this shape, we calculate the Eigenvectors and Eigenvalues.
# Step 3: Covariance
cov_matrix = np.cov(X_std.T)
# Step 4: Eigen Decomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
The Eigenvectors point in the direction of the spread (Height and Weight increasing together). The Eigenvalues tell us the magnitude of that spread.
Step 5: The Projection (Rotation)
Finally, we multiply our data by the sorted Eigenvectors. This effectively rotates the entire dataset.
X_pca = X_std @ eigenvectors
The Result:
- PC1 (Principal Component 1): Represents "Overall Size." It captures 86% of the variance.
- PC2 (Principal Component 2): Represents "Build" (Shape). It captures only 14% of the variance.
In a high-dimensional dataset, you might find that the top 3 components hold 95% of the information. You can throw away the other 9,997 dimensions and barely lose a thing.
What's Next?
We have mastered PCA for square Covariance Matrices. But in the real world, raw data matrices are often rectangular (rows ≠ columns).
How do we decompose any matrix of any shape? This leads us to the Fundamental Theorem of Linear Algebra: The Singular Value Decomposition (SVD). That is the topic of our next post.
Get the Code
Want to run this yourself? Check out the Google Colab Notebook to visualize the transformation step-by-step.
This post is part of the "Linear Algebra for Machine Learning" series. For the previous part, check out: The Hidden "DNA" of Matrices: Eigenvalues & Eigenvectors.
Comments
Post a Comment