Day2: Gradient Tracer

Lesson 2 60 min

Gradient Tracer: Building Autograd from First Principles

Component Architecture

The `torch.autograd` Trap

State Machine

Flowchart

Here is how most ML tutorials teach gradients:

python

import torch
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = (x[0] ** 2) * torch.sin(x[1]) + torch.exp(x[0])
y.backward()
print(x.grad)  # tensor([11.0901,  3.9216])

One call. Done. The answer appears. But what *is* x.grad? Where did it come from?
PyTorch built an invisible directed acyclic graph (DAG) behind your back, walked
it in reverse, and applied the chain rule at every node — all hidden inside a C++
runtime you have never read. The gradient is correct, but you learned nothing about
how it was produced.

This matters because when your gradient is NaN, or when you accidentally share a
computation graph across training steps and PyTorch throws:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed).

…you have no mental model to debug it. You are poking a black box.

This lesson opens the box.

---

## The Failure Mode: What Naive from-Scratch Looks Like

The obvious first attempt at "compute gradients manually" is finite differences:

python

def numerical_grad(f, x, h=1e-15):  # ← THE BUG
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        grad[i] = (f(x_plus) - f(x)) / h
    return grad

Run this with h = 1e-15 on f(x) = x² * sin(x) at x = 2.0:

numericalgrad: [-0.00000000e+00] ← all zeros! analyticalgrad: [ 6.14112001] ← correct

This is **catastrophic cancellation**. At h = 1e-15, f(x+h) and f(x) are
identical to float64 precision (which has ~15-16 significant digits). Subtracting
two nearly-equal numbers destroys all meaningful bits. The "gradient" becomes
numerical noise divided by a tiny number — which yields zero or garbage.

The stable zone is h ≈ 1e-5. This is not arbitrary: it balances truncation error
(the Taylor series error scales as h) against cancellation error (which scales as
ε_machine / h). Their product is minimized near h = sqrt(ε_machine) ≈ 1e-8,
but 1e-5 gives a comfortable margin in practice.

This is lesson one: even before you touch a neural network, numerical precision
bites you. Knowing *why* makes you dangerous in the best way.

---

## The ScratchAI Architecture: A Tape-Based Gradient Engine

We implement **reverse-mode automatic differentiation** — the algorithm that powers
every major ML framework's backward() call.

### The Core Idea: Record, Then Replay in Reverse

Every time you perform an arithmetic operation (add, multiply, sin, exp…), we:
1. **Compute the forward value** using NumPy
2. **Record a backward closure** — a function that knows how to push gradient
   *back* to the inputs of this operation

The entire computation is a directed acyclic graph (DAG) of Tensor nodes. Each
Tensor stores:

python

self.data        # np.ndarray — the actual value
self.grad        # np.ndarray — accumulated gradient (∂L/∂self)
self._backward   # Callable   — "how to push grad to my parents"
self._children   # set        — nodes I was computed from

After the forward pass, calling backward() does a **topological sort** of the
DAG (children before parents, reversed), then calls each _backward closure in
that order. Each closure applies the local Jacobian (chain rule) and accumulates
into its parents' .grad.

### Why Topological Sort?

You can only apply the chain rule correctly if you have already accumulated *all*
gradient contributions to a node before propagating further back. Topological order
guarantees this: we process outputs before inputs, leaf nodes last.

### The Jacobian: What We Are Actually Computing

For a scalar function L and a vector input x ∈ ℝⁿ, the gradient is:

∇L = [∂L/∂x₁, ∂L/∂x₂, ..., ∂L/∂xₙ]

For a vector-to-vector function f: ℝⁿ → ℝᵐ, this generalizes to the **Jacobian
matrix** J ∈ ℝᵐˣⁿ:

J[i,j] = ∂fᵢ/∂xⱼ

Finite-difference approximation computes each column of J independently by
perturbing one input at a time — O(n) forward passes for n inputs. Reverse-mode
AD computes the full gradient vector in **one backward pass**, regardless of n.
This is why reverse-mode (not forward-mode) dominates ML: you have millions of
parameters but only one scalar loss.

---

Learning Objectives

✓ By the end of this module, learners will be able to:
✓ Understand the concept of a Gradient Tracer and its role in visualizing how gradients flow through a model
✓ Explain how gradients are computed using backpropagation in machine learning models
✓ Trace and analyze gradient values across layers to identify issues like vanishing gradients and exploding gradients
✓ Use tools or code to implement a basic gradient tracing mechanism in frameworks like PyTorch or TensorFlow
✓ Interpret gradient behavior to debug and optimize deep learning models effectively

💬 Discuss this topic