Why LoRA Works: The SVD Math Behind Efficient Fine-Tuning
We keep talking about Gradient Descent when it comes to training a machine learning model. We calculate the gradient, adjust weights to minimize the loss. That sounds simple enough. But what does it mean to train a large language model with billions of parameters? Gradient descent still works exactly as it should — you compute gradients with respect to every single parameter and store optimizer states for all of them. What's the problem then? It works perfectly, but requires a lot of compute power and tens of gigabytes just to hold the training state. It requires GPUs, possibly a connection to a data center. What if you are a student or a researcher with just a laptop? Do you give up on fine-tuning a model with billions of parameters? What if you could train a model where, instead of modifying billions of weights, you only tune a few parameters — and still get nearly the same result? The answer is Low Rank Approximation . And to understand it, ...