Finding the True Dimension of Your Data (Basis & Rank Explained)
From Chaos to Order: Finding the Basis of Your Data
Imagine you have a dataset with hundreds of features, each represented as a vector. This collection of vectors can describe a vast, high-dimensional world of possibilities. But is this the most efficient way to describe that world? Do we really need hundreds of vectors if the underlying structure is much simpler?
This leads to a fundamental engineering question: What is the smallest possible set of building blocks we need to describe our entire data space without losing any information? The answer lies in the elegant concepts of Span, Basis, and Dimension.
In this post, we'll decode these concepts to build a powerful mental model for understanding the true "shape" and complexity of your data.
Watch the video for the full visual explanation, then scroll down for the detailed definitions and examples.
The "Playground": Generating Sets and Span
Let's start with a collection of vectors. Any set of vectors S = {v₁, v₂, ..., vₙ} can be used to generate other vectors through linear combinations (i.e., by adding and scaling them).
Generating Set
A set of vectors is called a Generating Set for a vector space V if every vector in V can be written as a linear combination of the vectors in the set. In other words, the set can "build" the entire space.
The Span
The Span of a set of vectors is the set of all possible linear combinations of those vectors. It's the complete "playground" or world that can be reached using only the vectors in your set.
However, a generating set can be inefficient. For example, the vectors {[2,1], [-1,1], [1,2]} can generate the entire 2D plane (R²). But, as we know from the last video, these vectors are linearly dependent because `[1,2] = [2,1] + [-1,1]`. The third vector is redundant. This leads us to search for a better, more efficient generating set.
The "Ultimate Starter Kit": The Basis
A Basis is the ultimate generating set. It's the most efficient possible "Lego starter kit" for a vector space. A set of vectors B is a basis for a vector space V if it satisfies two strict conditions:
- The vectors in B are Linearly Independent. (There is no redundancy in the set.)
- The vectors in B span the entire space V. (You can still reach every point in the space.)
The most famous example is the standard basis for R², which consists of the vectors i = [1, 0] and j = [0, 1]. They are clearly linearly independent, and they span the entire 2D plane.
Importantly, a basis is not unique. Any set of n linearly independent vectors will form a valid basis for an n-dimensional space. For example, the vectors {[1,1], [1,-1]} also form a perfectly valid basis for R². They simply represent a different coordinate system for the same 2D world.
Dimension and Rank: The Consequences of a Basis
Once we understand the basis, the concepts of Dimension and Rank become incredibly simple and intuitive.
Dimension
The Dimension of a vector space is simply the number of vectors in any of its bases. While a space can have infinite different bases, they will all contain the exact same number of vectors. If the basis for a space has 2 vectors, its dimension is 2. If it has 3, its dimension is 3.
Rank of a Matrix
The Rank of a matrix A is the dimension of its output space (also known as its column space). It tells you the "true dimensionality" of the information that comes out of a matrix transformation. A matrix that squashes a 3D space down to a 2D plane has a rank of 2. A 'full rank' matrix preserves the dimension of the space, while a 'rank deficient' matrix loses dimension, meaning some information is irreversibly lost.
Calculation in Practice: The Math & The Code
The most reliable way to find the rank of a matrix is to perform Gaussian Elimination to find its Row Echelon Form (REF). The rank is then simply the number of non-zero rows (or pivots) in the resulting matrix.
# Consider this 3x3 matrix:
A = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
# After Gaussian Elimination, its REF is:
A_ref = [[1, 2, 3],
[0, -3, -6],
[0, 0, 0]]
# There are 2 non-zero rows, so the Rank(A) = 2.
In Python, we let NumPy handle this for us with a single, powerful function.
import numpy as np
# A rank-deficient matrix where the third row is a linear combination of the first two
# R3 = 2*R2 - R1
A = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
# Calculate the rank
rank = np.linalg.matrix_rank(A)
print(f"The rank of the matrix is: {rank}")
# Output: The rank of the matrix is: 2
Conclusion
We've completed the journey from a messy Generating Set to an efficient Basis. The size of that basis defined the Dimension of our space, and the dimension of a matrix's output defined its Rank. These concepts are the formal language we use to describe our data's world efficiently and are foundational for advanced techniques like PCA.
Your Turn...
Did the "Lego starter kit" analogy help clarify the concept of a Basis? What's the highest-dimensional data you've ever had to work with? Let me know in the comments!
This post is part of the "Linear Algebra: The Core of Machine Learning" series. For the previous part, check out: This is Why Your Machine Learning Model is Unstable (Linear Independence).
Comments
Post a Comment