Gradient Tracer: Building Autograd from First Principles
The torch.autograd Trap
Here is how most ML tutorials teach gradients:
One call. Done. The answer appears. But what *is* x.grad? Where did it come from?
PyTorch built an invisible directed acyclic graph (DAG) behind your back, walked
it in reverse, and applied the chain rule at every node — all hidden inside a C++
runtime you have never read. The gradient is correct, but you learned nothing about
how it was produced.
This matters because when your gradient is NaN, or when you accidentally share a
computation graph across training steps and PyTorch throws:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed).
…you have no mental model to debug it. You are poking a black box.
This lesson opens the box.
---
## The Failure Mode: What Naive from-Scratch Looks Like
The obvious first attempt at "compute gradients manually" is finite differences:
Run this with h = 1e-15 on f(x) = x² * sin(x) at x = 2.0:
numericalgrad: [-0.00000000e+00] ← all zeros! analyticalgrad: [ 6.14112001] ← correct
This is **catastrophic cancellation**. At h = 1e-15, f(x+h) and f(x) are
identical to float64 precision (which has ~15-16 significant digits). Subtracting
two nearly-equal numbers destroys all meaningful bits. The "gradient" becomes
numerical noise divided by a tiny number — which yields zero or garbage.
The stable zone is h ≈ 1e-5. This is not arbitrary: it balances truncation error
(the Taylor series error scales as h) against cancellation error (which scales as
ε_machine / h). Their product is minimized near h = sqrt(ε_machine) ≈ 1e-8,
but 1e-5 gives a comfortable margin in practice.
This is lesson one: even before you touch a neural network, numerical precision
bites you. Knowing *why* makes you dangerous in the best way.
---
## The ScratchAI Architecture: A Tape-Based Gradient Engine
We implement **reverse-mode automatic differentiation** — the algorithm that powers
every major ML framework's backward() call.
### The Core Idea: Record, Then Replay in Reverse
Every time you perform an arithmetic operation (add, multiply, sin, exp…), we:
1. **Compute the forward value** using NumPy
2. **Record a backward closure** — a function that knows how to push gradient
*back* to the inputs of this operation
The entire computation is a directed acyclic graph (DAG) of Tensor nodes. Each
Tensor stores:
After the forward pass, calling backward() does a **topological sort** of the
DAG (children before parents, reversed), then calls each _backward closure in
that order. Each closure applies the local Jacobian (chain rule) and accumulates
into its parents' .grad.
### Why Topological Sort?
You can only apply the chain rule correctly if you have already accumulated *all*
gradient contributions to a node before propagating further back. Topological order
guarantees this: we process outputs before inputs, leaf nodes last.
### The Jacobian: What We Are Actually Computing
For a scalar function L and a vector input x ∈ ℝⁿ, the gradient is:
∇L = [∂L/∂x₁, ∂L/∂x₂, ..., ∂L/∂xₙ]
For a vector-to-vector function f: ℝⁿ → ℝᵐ, this generalizes to the **Jacobian
matrix** J ∈ ℝᵐˣⁿ:
J[i,j] = ∂fᵢ/∂xⱼ
Finite-difference approximation computes each column of J independently by
perturbing one input at a time — O(n) forward passes for n inputs. Reverse-mode
AD computes the full gradient vector in **one backward pass**, regardless of n.
This is why reverse-mode (not forward-mode) dominates ML: you have millions of
parameters but only one scalar loss.
---