The Gradient Vector: The One Mathematical Object That Trains Every Neural Network
If you are standing on a mountain, you check the slope and start walking to reach the bottom of the mountain.
How does the machine do it? If the input data contains just one feature, then the machine will adjust its weight to minimize the loss. It means it either has to move right or left depending upon the slope.
What if there are more than one feature?
ChatGPT has 175 billion parameters. Training it means finding the lowest point in a 175 billion dimensional space.
That is the mathematical problem every AI trainer faces. And, there is a solution for this.
The Single Parameter Case
If we just have a single parameter, the problem gets easy — we calculate the slope and decide which way to move.
The moment we have another parameter, just slope doesn't work anymore. Imagine standing on a mountain or somewhere on a bowl — you have infinite directions to move. And which direction to choose is the key.
We can't calculate slope in all directions at once. Then the obvious question is: how do we find the slope in a multi-dimensional space?
Partial Derivatives: One Direction at a Time
What if we calculate the slope for one direction at a time, and keep all others constant?
In this case, we have two parameters — Weight and Bias. Our loss function is the sum of the squares of weight and bias.
First, we freeze the bias, assume it as a constant and calculate the derivative. That curly symbol — ∂ — is called the partial derivative. It is not the same as a regular derivative, where we only had a single parameter.
Next, we freeze the weight and calculate the partial derivative for the bias.
# Calculate the partial derivative of weight while bias remains constant
def partial_w(w):
return 2 * w
# Compute the bias partial derivative by freezing the weight parameter
def partial_b(b):
return 2 * b
# Assume initial coordinates at weight 3 and bias 4
w, b = 3.0, 4.0
print(f"Slope along the weight axis: {partial_w(w)}") # → 6.0
print(f"Slope along the Bias axis: {partial_b(b)}") # → 8.0
We have slopes in two different directions.
If you are standing at a weight of 3 and a bias of 4, then the slope along the weight axis is 6 and the slope along the Bias axis is 8.
We have two numbers, which means two directions. We cannot move into two different directions at once. We need a way to combine them together.
The Gradient Vector
We pack these slopes into a single vector. And that is the solution.
We take partial derivatives, one per parameter — and stack them into a single vector. In linear algebra, that vector is called the Gradient.
It is denoted by the upside-down triangle — ∇ — called Nabla. If you ever find it, you know it indicates a vector of all partial derivatives of a loss function, one entry per parameter.
No matter how many dimensions we have — 2, 3 or 175 billion — the gradient vector always points in the exact direction of steepest uphill climb.
import numpy as np
# Initialize the model weights and bias
w, b = 3.0, 4.0
# Combine individual partial derivatives into the Gradient Vector
gradient = np.array([partial_w(w), partial_b(b)])
print(f"Gradient vector: {gradient}") # → [6. 8.]
print(f"Direction: steepest uphill climb")
print(f"Magnitude: {np.linalg.norm(gradient):.4f}") # → 10.0
Gradient Descent: The Engine of Every Neural Network
Our goal is to reduce the loss, so we negate the gradient — multiply by negative one. Now the vector points to the steepest downhill direction.
You compute the gradient, take a step in the opposite direction, recompute the gradient, take another step and repeat.
That loop is called Gradient Descent. It is the engine that trains every neural network.
When we see a training loss curve drop smoothly toward zero, that is what is happening underneath — repeated millions of times. It's not magic, it's calculus.
The Learning Rate: How Much to Step
Now we know which direction to step, but we don't know how much the step size should be. Does it really matter? What if we take a big step or a tiny step?
Too Large: The Loss Explodes
If the step size is too large, we overshoot the minimum entirely. We land on the other side of the bowl, higher than where we started. The next gradient points back the other way. We overshoot again. The loss doesn't decrease — it explodes.
In practice, the training loss starts printing NaN. Your model is dead.
Let's simulate how the loss function explodes when the learning rate is deliberately too large:
# Simulating gradient descent — with a step size problem
def loss(w, b):
return w**2 + b**2
w, b = 3.0, 4.0
step_size = 1.2 # deliberately too large
print("Step | w | b | Loss")
print("─" * 45)
for i in range(8):
L = loss(w, b)
print(f" {i:2d} | {w:+.4f} | {b:+.4f} | {L:.4f}")
grad_w = partial_w(w)
grad_b = partial_b(b)
w = w - step_size * grad_w
b = b - step_size * grad_b
Too Small: Training Stalls
If the step size is too small, the loss barely reduces. Training the model takes forever, and often gets stuck.
Just Right: Efficient Convergence
With an optimal step size, the model converges efficiently. The loss drops smoothly toward zero.
Technically, the step size is called the Learning Rate. And choosing it correctly is one of the most critical decisions in training a model.
The learning rate is not just a hyperparameter. It is the mechanism by which the gradient's direction gets translated into an actual update. Choosing an incorrect learning rate makes the gradient useless.
Summary
- A derivative measures slope in one dimension. The moment we have another parameter, one derivative isn't enough.
- Partial derivatives let you measure the slope one variable at a time — freeze everything else and differentiate normally.
- Pack all those partial derivatives into a single vector and we get the gradient. It is the object that points in the direction of steepest ascent, irrespective of the number of dimensions.
- Flip the sign of the gradient, take a step and repeat — that is Gradient Descent.
But Do We Really Need to Update All the Weights?
We now understand how to modify the weights. But do we really need to update all the weights?
The simple answer is No. And the reason for this comes from matrix decomposition.
That is Singular Value Decomposition — the mathematical foundation of Low-Rank Adaptation, the technique that makes fine-tuning billions of parameters possible.
And that is what we are going to see in the next video.
What We Built in the Python Notebook
In the companion notebook, we implement everything from scratch:
- Partial derivative functions for a 2-parameter loss surface
- Gradient vector construction using NumPy
- Magnitude calculation using
np.linalg.norm - Gradient descent simulation with a deliberately oversized step size
- Comparison of convergence across small, large, and optimal learning rates
Get the Code
Want to experiment with gradient descent from scratch? Check out the Python Notebook with the complete implementation and visualizations.
This post is part of the "Mathematics for Machine Learning" series. For the next part, check out: Singular Value Decomposition and Low-Rank Adaptation (LoRA).
Comments
Post a Comment