Why LoRA Works: The SVD Math Behind Efficient Fine-Tuning
We keep talking about Gradient Descent when it comes to training a machine learning model. We calculate the gradient, adjust weights to minimize the loss. That sounds simple enough.
But what does it mean to train a large language model with billions of parameters? Gradient descent still works exactly as it should — you compute gradients with respect to every single parameter and store optimizer states for all of them.
What's the problem then?
It works perfectly, but requires a lot of compute power and tens of gigabytes just to hold the training state. It requires GPUs, possibly a connection to a data center.
What if you are a student or a researcher with just a laptop? Do you give up on fine-tuning a model with billions of parameters?
What if you could train a model where, instead of modifying billions of weights, you only tune a few parameters — and still get nearly the same result?
The answer is Low Rank Approximation. And to understand it, we need to understand Singular Value Decomposition first.
What is a Matrix, Really?
You know matrices as arrays of numbers. But geometrically, a matrix is a transformation. It takes vectors and moves them — rotates them, stretches them, compresses them.
Singular Value Decomposition states that any real matrix can be decomposed into exactly three operations performed in sequence:
M = U Σ Vᵀ
Step 1 — Vᵀ: The First Rotation
Vᵀ rotates your input vector into a new coordinate frame. The columns of V are orthonormal — they form a perfectly perpendicular basis. Think of it as reorienting your perspective before any stretching happens.
Step 2 — Σ: The Scaling
Σ is a diagonal matrix. Each diagonal entry is called a singular value, and it either stretches or compresses the vector along one axis of the new frame. Two things to note:
- Singular values are non-negative
- They are arranged in descending order — largest first
Step 3 — U: The Second Rotation
U rotates the scaled result into the final output orientation.
That is SVD. Rotation. Scale. Rotation. Every matrix, no exceptions.
import numpy as np
# A simple 2×2 matrix
M = np.array([[3.0, 1.0],
[0.5, 2.0]])
U, s, Vt = np.linalg.svd(M)
Sigma = np.diag(s)
# Intermediate stages of transformation
stage1 = Vt @ circle # First rotation (Vᵀ x)
stage2 = Sigma @ stage1 # Scaling (Σ Vᵀ x)
stage3 = U @ stage2 # Second rotation (U Σ Vᵀ x)
The Key Observation: Singular Values Are Not Equal
Look at those singular values. In almost every weight matrix in a trained neural network, they are not equal — not even close.
A small number of them are large. The rest decay rapidly toward zero.
What does this mean geometrically? Most of what this matrix does happens along a small number of directions. The rest contributes so little that it is effectively noise.
This is not an accident. It is a property of matrices that have been trained on real data. Hu et al. call this the low intrinsic rank hypothesis.
Low-Rank Approximation: Keep Only What Matters
If most of the information lives in the top few singular values, what happens if you just keep those and discard the rest?
You get an approximation of the original matrix:
M ≈ Uᵣ Σᵣ Vᵣᵀ
Where r is the number of singular values you keep. Not exact — but close. And the remarkable result, which the Eckart–Young theorem proves formally, is that this is the best possible rank-r approximation. Not "a good one." The provably optimal one.
Measuring it on DistilBERT
To see this concretely, we take weight matrices from DistilBERT — a real pre-trained transformer. Each attention layer has Q, K, V, and Out projection matrices, each of size 768 × 768. There are 6 layers, giving us 24 matrices total.
from transformers import AutoModel
model = AutoModel.from_pretrained('distilbert-base-uncased')
model.eval()
# Collect all attention projection matrices
weight_matrices = {}
for name, param in model.named_parameters():
if any(k in name for k in ['q_lin', 'k_lin', 'v_lin', 'out_lin']) and param.ndim == 2:
weight_matrices[name] = param.detach().numpy()
We compute SVD for each matrix, normalize the singular values against the largest one, and plot the decay curve. We also mark the "elbow" — the rank at which a normalized singular value drops below 0.1, meaning it contributes less than 1% of the leading singular value's energy.
Singular Value Energy
To quantify how much information is retained at a given rank, we use singular value energy — the cumulative sum of squared singular values, divided by the total sum of squares:
energy = cumsum(σᵢ²) / sum(σᵢ²)
When we plot this for all 24 weight matrices and mark the average rank needed to hit three energy thresholds, the result is striking:
- 90% energy → average rank 315 (41% of full rank)
- 95% energy → average rank 399 (52% of full rank)
- 99% energy → average rank 537 (70% of full rank)
Even to capture 99% of the original matrix's information, you only need 70% of its dimensions. The tail is genuinely redundant.
Reconstructing at Different Ranks
We rebuild the weight matrix at ranks 1, 8, 64, 256, 512, and 768 (full rank) and compare:
ranks = [1, 8, 64, 256, 512, W0.shape[1]]
frob_total = np.linalg.norm(W0, 'fro')**2
for r in ranks:
Wr = (U0[:, :r] * s0[:r]) @ Vt0[:r, :]
err = np.linalg.norm(W0 - Wr, 'fro')**2
pct_captured = (1 - err / frob_total) * 100
At rank 1, the reconstruction looks nothing like the original. At rank 64, structure starts to emerge — yet we have only captured about 50% of the energy. At rank 512, the reconstruction is nearly identical to the full-rank matrix.
But here is the problem: 70% of the parameters is nowhere near "billions down to thousands." That gap is where LoRA steps in.
Enter LoRA: Don't Reconstruct — Update
Low-rank approximation tried to reconstruct the full weight matrix using fewer parameters. LoRA does something different — it does not touch the original matrix at all. It learns an update on top of it.
During fine-tuning:
- The original weight matrix W is frozen — its parameters do not change
- A separate update matrix ΔW is learned and added on top
W' = W + ΔW
The output of any layer becomes:
h = W'x = Wx + ΔWx
The LoRA Constraint: ΔW Must Be Low Rank
Instead of storing ΔW as a full d×k matrix, LoRA parameterizes it as the product of two small matrices:
ΔW = B × A
Where:
- B has shape d × r
- A has shape r × k
- r << min(d, k) — r is very small, typically 4–64
Initialization
Two key choices at initialization:
- B is initialized to zero — so ΔW = BA = 0 at the start of training. The model begins from the pretrained weights unchanged.
- A is initialized with small random values — giving gradients something to flow through from the first step.
After Training
The update can be folded back into the original weights:
W_final = W + BA
Zero inference overhead. The model at deployment is a single weight matrix — no LoRA layers remain.
The Mathematics: Why rank(BA) ≤ r
The rank inequality for matrix products gives us a formal guarantee:
rank(BA) ≤ min(rank(B), rank(A)) ≤ r
By choosing r small, you are enforcing low-rank structure on ΔW. The product can only be as expressive as the smaller of its two factors.
There is an elegant connection to SVD here. A full SVD of ΔW would give you UΣVᵀ. LoRA does not explicitly compute U, Σ, V — but by choosing B and A as the learnable parameters, you are parameterizing the same structure. Gradient descent discovers which low-rank subspace fits your task.
The singular value spectrum of the learned ΔW — examined after training — shows rapid decay. The model found what it needed in a small subspace. Exactly as SVD predicts.
The Parameter Reduction: By the Numbers
For a weight matrix of shape d × k:
- Full fine-tuning requires d × k trainable parameters
- LoRA requires d×r + r×k = r × (d + k) parameters
The reduction factor is:
reduction = (d × k) / (r × (d + k))
For a square matrix (d = k), this simplifies to:
reduction = d / (2 × r)
With r = 8 and d = 768 (DistilBERT): 768 / (2 × 8) = 48× reduction.
With r = 8 and d = 4,096 (LLaMA-7B): 4,096 / (2 × 8) = 256× reduction.
configs = [
(768, 768, "DistilBERT attention"),
(4096, 4096, "LLaMA-7B attention"),
(4096, 11008, "LLaMA-7B FFN"),
(8192, 8192, "LLaMA-70B attention"),
]
ranks = [4, 8, 16, 32, 64]
for (d, k, name) in configs:
full_params = d * k
for r in ranks:
lora_params = r * (d + k)
reduction = full_params / lora_params
print(f"{name:<25 full="{full_params:" r="{r:<3}">10,} LoRA={lora_params:>7,} reduction={reduction:.0f}x")25>
When you run peft.get_peft_model() and watch the trainable parameter count drop from 7 billion to 4 million — that is the Eckart–Young theorem running in your terminal. That is a century of linear algebra, paying out.
An Important Clarification: W₀'s Rank vs. ΔW's Rank
There is a distinction worth being precise about.
When we computed the optimal rank for DistilBERT's attention matrices at 99% energy, the result was around 537 — barely any parameter saving from reconstruction. That looks like a contradiction with LoRA using r = 8.
It is not a contradiction. They answer different questions:
- W₀'s singular value spectrum asks: how many dimensions do you need to reconstruct the pretrained matrix itself? W₀ is a dense, complex matrix carrying everything learned during pretraining. Reconstructing it faithfully requires hundreds of dimensions.
- ΔW's singular value spectrum asks: how many dimensions does the fine-tuning update need? ΔW is learned for a specific task — and empirically, it turns out that task-specific updates live in a very small subspace. r = 8 is often enough.
W₀'s spectrum tells you why the design makes sense — the weight space has low-dimensional structure. ΔW's spectrum tells you which r to use.
Choosing the Right Rank
Choosing the value of r is crucial.
- If r is too small, you miss the signal — not enough directions to represent the fine-tuning update.
- If r is too large, noise starts appearing — extra directions fill with irrelevant information.
In practice, r is generally 4 or 8, as suggested in the original LoRA paper and validated across the majority of large language models.
One thing to be precise about: the 70% figure we measured for DistilBERT is the rank needed to reconstruct W₀ itself at 99% energy. LoRA does not reconstruct W₀ — it learns ΔW, the fine-tuning update. And ΔW, empirically, has far lower intrinsic dimensionality. The weight matrices are large. The update is not. SVD is the reason you can tell the difference.
Connecting the Full Arc
Let's bring this together cleanly.
- A matrix is a transformation — rotation, scale, rotation. That is SVD.
- The scaling step reveals low-rank structure — trained weight matrices do their real work along a small number of directions. The rest is nearly zero.
- Fine-tuning updates live in the same small subspace — you don't need to parameterize the full update. Two matrices whose product has the right rank is sufficient.
- That is LoRA.
LoRA is not a heuristic. It is not a trick. It is SVD, applied to the problem of efficient adaptation.
M = U Σ Vᵀ → ΔW = B A
One is the mathematical justification. The other is the engineering implementation. They are the same idea.
Now you know how fine-tuning a model with billions of parameters on a laptop is made possible. The weight matrices are large. The update is not. And SVD is the reason you can tell the difference.
What We Built in the Notebook
The companion notebook walks through every step from scratch:
- SVD on a toy 2×2 matrix with animated transformation stages (Vᵀ → Σ → U)
- Singular value spectra for 6 randomly sampled DistilBERT attention weight matrices — normalized and plotted with elbow markers
- Cumulative singular value energy across all 24 attention matrices — average rank at 90%, 95%, and 99% thresholds
- Low-rank reconstruction at r = 1, 8, 64, 256, 512, 768 — heatmaps and residual error maps on a shared colour scale
- Parameter count comparison table across DistilBERT, LLaMA-7B, LLaMA-7B FFN, and LLaMA-70B at ranks 4, 8, 16, 32, 64
Every visualization is reproducible, every step is annotated.
Get the Code
The full implementation is in the Python Notebook — including all visualizations, the rank diagnostic, and the parameter comparison tables.
What's Next
Whether you are training from scratch or fine-tuning with LoRA, Gradient Descent is the engine behind all of it. In the next episode we build it from scratch — derive the update rule, implement it in code, and watch it actually move down a loss surface.
No shortcuts. No library calls hiding the math. Just the algorithm, written from the ground up, so you understand exactly what happens every time a model takes a step.
This post is part of the Decoding Complexities — Season 3 series. For the previous episode, check out: The Gradient Vector — Direction of Steepest Ascent.
Comments
Post a Comment