Day2: Gradient Tracer

Lesson 2 60 min

Gradient Tracer: Building Autograd from First Principles

Component Architecture

ScratchAI pipeline - highlighted Data Input tensor Forward f(x) on tape Tape + backward() Graph · topo sort ∂L/∂x per node Grad check analytical vs numerical J Optimizer SGD step θ ← θ – lr·∇ x, y arrays updated θ This lesson Adjacent ops Data / output Reverse-mode AD (this lesson) computes ∇L in ONE backward pass for ALL parameters. Finite differences (the verification tool) need one forward pass per input dimension. That difference is why reverse-mode dominates ML at scale.

The torch.autograd Trap

State Machine

Tensor lifecycle — from creation through convergence or divergence Created data=ndarray · grad=0 Forward — tape active _backward closure registered Loss computed scalar output · .backward() ready Gradient computed .grad populated · topo order done Converged ✓ Diverged ✗ arithmetic op final output .backward() ||∇L|| < 10 ||∇L|| → ∞ graph freed after backward .zero_grad() + θ update next epoch

Flowchart

Forward tape construction → topological sort → backward gradient accumulation Inputs (blue) Operations (green) Outputs (coral) Input x shape (n,) · float64 (n,) x ** 2 out = data² · back: 2x · out.grad (n,) x.sin() out = sin(data) · back: cos(x) · out.grad (n,) (x²) * sin(x) product rule · += per parent (n,) + x.exp() ∂/∂x of eˣ = eˣ stored at creation (n,) Output y .backward() seeds grad=1 (n,) ← grad flows backward _topo_sort() visits children before parents · outputs first · inputs last · gradient += never =
Here is how most ML tutorials teach gradients:
python
import torch
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = (x[0] ** 2) * torch.sin(x[1]) + torch.exp(x[0])
y.backward()
print(x.grad)  # tensor([11.0901,  3.9216])
One call. Done. The answer appears. But what *is* x.grad? Where did it come from?
PyTorch built an invisible directed acyclic graph (DAG) behind your back, walked
it in reverse, and applied the chain rule at every node — all hidden inside a C++
runtime you have never read. The gradient is correct, but you learned nothing about
how it was produced.

This matters because when your gradient is NaN, or when you accidentally share a
computation graph across training steps and PyTorch throws:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed).

…you have no mental model to debug it. You are poking a black box.

This lesson opens the box.

---

## The Failure Mode: What Naive from-Scratch Looks Like

The obvious first attempt at "compute gradients manually" is finite differences:
python
def numerical_grad(f, x, h=1e-15):  # ← THE BUG
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        grad[i] = (f(x_plus) - f(x)) / h
    return grad
Run this with h = 1e-15 on f(x) = x² * sin(x) at x = 2.0:

numericalgrad: [-0.00000000e+00] ← all zeros! analyticalgrad: [ 6.14112001] ← correct

This is **catastrophic cancellation**. At h = 1e-15, f(x+h) and f(x) are
identical to float64 precision (which has ~15-16 significant digits). Subtracting
two nearly-equal numbers destroys all meaningful bits. The "gradient" becomes
numerical noise divided by a tiny number — which yields zero or garbage.

The stable zone is h ≈ 1e-5. This is not arbitrary: it balances truncation error
(the Taylor series error scales as h) against cancellation error (which scales as
ε_machine / h). Their product is minimized near h = sqrt(ε_machine) ≈ 1e-8,
but 1e-5 gives a comfortable margin in practice.

This is lesson one: even before you touch a neural network, numerical precision
bites you. Knowing *why* makes you dangerous in the best way.

---

## The ScratchAI Architecture: A Tape-Based Gradient Engine

We implement **reverse-mode automatic differentiation** — the algorithm that powers
every major ML framework's backward() call.

### The Core Idea: Record, Then Replay in Reverse

Every time you perform an arithmetic operation (add, multiply, sin, exp…), we:
1. **Compute the forward value** using NumPy
2. **Record a backward closure** — a function that knows how to push gradient
   *back* to the inputs of this operation

The entire computation is a directed acyclic graph (DAG) of Tensor nodes. Each
Tensor stores:
python
self.data        # np.ndarray — the actual value
self.grad        # np.ndarray — accumulated gradient (∂L/∂self)
self._backward   # Callable   — "how to push grad to my parents"
self._children   # set        — nodes I was computed from
After the forward pass, calling backward() does a **topological sort** of the
DAG (children before parents, reversed), then calls each _backward closure in
that order. Each closure applies the local Jacobian (chain rule) and accumulates
into its parents' .grad.

### Why Topological Sort?

You can only apply the chain rule correctly if you have already accumulated *all*
gradient contributions to a node before propagating further back. Topological order
guarantees this: we process outputs before inputs, leaf nodes last.

### The Jacobian: What We Are Actually Computing

For a scalar function L and a vector input x ∈ ℝⁿ, the gradient is:

∇L = [∂L/∂x₁, ∂L/∂x₂, ..., ∂L/∂xₙ]

For a vector-to-vector function f: ℝⁿ → ℝᵐ, this generalizes to the **Jacobian
matrix** J ∈ ℝᵐˣⁿ:

J[i,j] = ∂fᵢ/∂xⱼ

Finite-difference approximation computes each column of J independently by
perturbing one input at a time — O(n) forward passes for n inputs. Reverse-mode
AD computes the full gradient vector in **one backward pass**, regardless of n.
This is why reverse-mode (not forward-mode) dominates ML: you have millions of
parameters but only one scalar loss.

---
Need help?