The Gradient Vector: The One Mathematical Object That Trains Every Neural Network

If you are standing on a mountain, you check the slope and start walking to reach the bottom of the mountain.

How does the machine do it? If the input data contains just one feature, then the machine will adjust its weight to minimize the loss. It means it either has to move right or left depending upon the slope.

What if there are more than one feature?

ChatGPT has 175 billion parameters. Training it means finding the lowest point in a 175 billion dimensional space.

That is the mathematical problem every AI trainer faces. And, there is a solution for this.

The Single Parameter Case

If we just have a single parameter, the problem gets easy — we calculate the slope and decide which way to move.

The moment we have another parameter, just slope doesn't work anymore. Imagine standing on a mountain or somewhere on a bowl — you have infinite directions to move. And which direction to choose is the key.

We can't calculate slope in all directions at once. Then the obvious question is: how do we find the slope in a multi-dimensional space?

Partial Derivatives: One Direction at a Time

What if we calculate the slope for one direction at a time, and keep all others constant?

In this case, we have two parameters — Weight and Bias. Our loss function is the sum of the squares of weight and bias.

First, we freeze the bias, assume it as a constant and calculate the derivative. That curly symbol — ∂ — is called the partial derivative. It is not the same as a regular derivative, where we only had a single parameter.

Next, we freeze the weight and calculate the partial derivative for the bias.

# Calculate the partial derivative of weight while bias remains constant
def partial_w(w):
    return 2 * w

# Compute the bias partial derivative by freezing the weight parameter
def partial_b(b):
    return 2 * b

# Assume initial coordinates at weight 3 and bias 4
w, b = 3.0, 4.0
print(f"Slope along the weight axis: {partial_w(w)}")   # → 6.0
print(f"Slope along the Bias axis:   {partial_b(b)}")   # → 8.0

We have slopes in two different directions.

If you are standing at a weight of 3 and a bias of 4, then the slope along the weight axis is 6 and the slope along the Bias axis is 8.

We have two numbers, which means two directions. We cannot move into two different directions at once. We need a way to combine them together.

The Gradient Vector

We pack these slopes into a single vector. And that is the solution.

We take partial derivatives, one per parameter — and stack them into a single vector. In linear algebra, that vector is called the Gradient.

It is denoted by the upside-down triangle — ∇ — called Nabla. If you ever find it, you know it indicates a vector of all partial derivatives of a loss function, one entry per parameter.

No matter how many dimensions we have — 2, 3 or 175 billion — the gradient vector always points in the exact direction of steepest uphill climb.

import numpy as np

# Initialize the model weights and bias
w, b = 3.0, 4.0

# Combine individual partial derivatives into the Gradient Vector
gradient = np.array([partial_w(w), partial_b(b)])
print(f"Gradient vector: {gradient}")        # → [6. 8.]
print(f"Direction: steepest uphill climb")
print(f"Magnitude: {np.linalg.norm(gradient):.4f}")   # → 10.0

Gradient Descent: The Engine of Every Neural Network

Our goal is to reduce the loss, so we negate the gradient — multiply by negative one. Now the vector points to the steepest downhill direction.

You compute the gradient, take a step in the opposite direction, recompute the gradient, take another step and repeat.

That loop is called Gradient Descent. It is the engine that trains every neural network.

When we see a training loss curve drop smoothly toward zero, that is what is happening underneath — repeated millions of times. It's not magic, it's calculus.

The Learning Rate: How Much to Step

Now we know which direction to step, but we don't know how much the step size should be. Does it really matter? What if we take a big step or a tiny step?

Too Large: The Loss Explodes

If the step size is too large, we overshoot the minimum entirely. We land on the other side of the bowl, higher than where we started. The next gradient points back the other way. We overshoot again. The loss doesn't decrease — it explodes.

In practice, the training loss starts printing NaN. Your model is dead.

Let's simulate how the loss function explodes when the learning rate is deliberately too large:

# Simulating gradient descent — with a step size problem
def loss(w, b):
    return w**2 + b**2

w, b = 3.0, 4.0
step_size = 1.2          # deliberately too large

print("Step | w        | b        | Loss")
print("─" * 45)
for i in range(8):
    L = loss(w, b)
    print(f"  {i:2d} | {w:+.4f}  | {b:+.4f}  | {L:.4f}")
    grad_w = partial_w(w)
    grad_b = partial_b(b)
    w = w - step_size * grad_w
    b = b - step_size * grad_b

Too Small: Training Stalls

If the step size is too small, the loss barely reduces. Training the model takes forever, and often gets stuck.

Just Right: Efficient Convergence

With an optimal step size, the model converges efficiently. The loss drops smoothly toward zero.

Technically, the step size is called the Learning Rate. And choosing it correctly is one of the most critical decisions in training a model.

The learning rate is not just a hyperparameter. It is the mechanism by which the gradient's direction gets translated into an actual update. Choosing an incorrect learning rate makes the gradient useless.

Summary

A derivative measures slope in one dimension. The moment we have another parameter, one derivative isn't enough.
Partial derivatives let you measure the slope one variable at a time — freeze everything else and differentiate normally.
Pack all those partial derivatives into a single vector and we get the gradient. It is the object that points in the direction of steepest ascent, irrespective of the number of dimensions.
Flip the sign of the gradient, take a step and repeat — that is Gradient Descent.

But Do We Really Need to Update All the Weights?

We now understand how to modify the weights. But do we really need to update all the weights?

The simple answer is No. And the reason for this comes from matrix decomposition.

That is Singular Value Decomposition — the mathematical foundation of Low-Rank Adaptation, the technique that makes fine-tuning billions of parameters possible.

And that is what we are going to see in the next video.

What We Built in the Python Notebook

In the companion notebook, we implement everything from scratch:

Partial derivative functions for a 2-parameter loss surface
Gradient vector construction using NumPy
Magnitude calculation using np.linalg.norm
Gradient descent simulation with a deliberately oversized step size
Comparison of convergence across small, large, and optimal learning rates

Get the Code

Want to experiment with gradient descent from scratch? Check out the Python Notebook with the complete implementation and visualizations.

This post is part of the "Mathematics for Machine Learning" series. For the next part, check out: Singular Value Decomposition and Low-Rank Adaptation (LoRA).

Search This Blog

Decoding Complexities

The Gradient Vector: The One Mathematical Object That Trains Every Neural Network

The Single Parameter Case

Partial Derivatives: One Direction at a Time

The Gradient Vector

Gradient Descent: The Engine of Every Neural Network

The Learning Rate: How Much to Step

Too Large: The Loss Explodes

Too Small: Training Stalls

Just Right: Efficient Convergence

Summary

But Do We Really Need to Update All the Weights?

What We Built in the Python Notebook

Get the Code

Comments

Post a Comment

Popular posts from this blog

Working with Save Files - IBM i

How Spam Filters Work: Coding Naive Bayes from Scratch

What happens if Journal Sequence Number reaches its Maximum Value?