Tensor Ops Lab: Building Autograd From Scratch
The torch.autograd Trap
Here is how most tutorials teach automatic differentiation:
One import. Three lines. You get a gradient. Tutorial complete.
Except — you have no idea what just happened. You pressed a button labeled "compute all partial derivatives in a neural network with potentially 175 billion parameters" and it worked instantly, which means you have zero model for why it works or when it breaks. When your training loop produces NaN gradients at epoch 47, or you accidentally run .backward() twice and get garbage weights, or you use an in-place operation inside a computation graph and your loss mysteriously never decreases — none of your PyTorch intuition helps you. Because you never built any.
This lesson fixes that. We will implement autograd from scratch using only NumPy. By the time you finish, PyTorch's computation graph will not feel like magic — it will feel like a design decision you yourself could have made.
The Failure Mode: Silent Graph Corruption
Before we build the right thing, let's watch the wrong thing fail. Suppose you try to compute gradients using finite differences — the brute-force numerical approach:
This is mathematically correct. It is also catastrophically slow: for a network with N parameters, it requires 2N forward passes to compute one gradient update. A small MLP with 10,000 parameters needs 20,000 forward passes per training step. Reverse-mode autodiff (backpropagation) computes the exact same gradients in one forward pass + one backward pass, regardless of N. That is not a minor optimization — it is the reason deep learning became computationally feasible at all.
Now here's the silent error that kills beginners who do try to build a graph manually. Imagine you compute a forward pass, store intermediate results for the backward pass, then modify a tensor in-place before calling backward:
No error. No warning. Wrong gradient. This is the in-place mutation trap, and PyTorch raises a RuntimeError: a leaf Variable that requires grad has been used in an in-place operation because engineers like us burned hours debugging this before the check was added.
The ScratchAI Architecture: A Computation Graph in 150 Lines
Our autograd engine has one central class: Tensor. It wraps a NumPy array and adds three fields:
Every operation (add, multiply, matmul, relu, etc.) creates a new Tensor and attaches a _backward closure that encodes the chain rule for that specific operation. The graph is built implicitly — no explicit "graph object" exists. The closure captures references to the input tensors, so following _children pointers traces the full computation path.
The data flow is:
The topological sort is the key algorithmic insight: we must compute ∂Loss/∂node for a node before we can compute ∂Loss/∂node's_children. Going in reverse topological order guarantees that. It's DFS post-order on a DAG, which is twelve lines of Python.
Implementation Deep Dive
The __mul__ Operation — Tracing the Chain Rule as Code
The += is intentional and crucial. If a tensor appears in multiple branches of the computation graph (e.g., weight matrix shared across time steps), its gradient contributions from each branch must accumulate — that is what += enforces. This is exactly what PyTorch's gradient accumulation does, and it is also why you must call optimizer.zero_grad() before each training step — otherwise last iteration's gradients pollute this iteration's.
Broadcasting and Shape Consistency
NumPy broadcasting is a source of subtle gradient bugs. When you add a (batch, output) tensor to a bias of shape (output,), NumPy broadcasts silently. The backward pass must sum over the broadcast dimensions to return the gradient to the bias's original shape:
Forgetting this sum produces shape mismatches or, worse, silently wrong gradients when shapes accidentally align.
ReLU — Piecewise Linearity Has a Simple Backward
The (out.data > 0) mask is the entire derivative of ReLU. Where the input was negative, the gate is closed — gradient is zeroed. Where positive, gradient passes through unchanged. This mask being zero for large portions of the network is "dying ReLU", a training stability problem you can directly observe in the gradient histograms our app shows.
detach() and no_grad — Stopping the Graph
Sometimes you want to use a tensor's values without involving them in gradient computation (e.g., when you're computing a metric, or detaching an RNN state). Our implementation:
Calling detach() creates a leaf tensor — no _children, no _backward. The autograd engine has no path to follow back through it. This is what PyTorch's .detach() and torch.no_grad() context manager implement under the hood.
Production Readiness — Metrics to Watch
Running python train.py logs four metrics every epoch. Here's what each one tells you:
Loss Curve Shape
A healthy loss curve decreases smoothly and flattens asymptotically. If it oscillates wildly, your learning rate is too high. If it's perfectly flat from epoch 1, your weights are initialized to zero (a catastrophic mistake — zero init means all neurons compute identical outputs and receive identical gradients, so the network never differentiates). If it drops then suddenly spikes to inf, you've hit a numerical instability — usually log(0) in a loss function or an exploding gradient.
Weight Histogram Per Epoch
Load best_weights.npy after training and plot np.histogram(weights). At initialization, weights should follow a rough normal distribution (we use Xavier init: std = sqrt(2 / (fan_in + fan_out))). As training progresses, the distribution should shift and spread — different neurons specializing. If all weights collapse to a similar value, your network has found a degenerate solution.
Gradient Norm
We log np.linalg.norm(grad.flatten()) per layer each epoch. For stable training on this lesson's scale, expect values between 0.01 and 5.0. If a gradient norm exceeds 10, you're approaching an exploding gradient situation. The fix is gradient clipping: grad = grad * min(1.0, threshold / norm).
Train vs. Validation Accuracy Gap
Any gap greater than 10 percentage points that widens across epochs is overfitting. With a model this size on synthetic data, you should not see it — but it's a habit worth developing now.
Step-by-Step Guide
Prerequisites
Generate and Launch
Verification
The Streamlit app opens at localhost:8501. Use the sliders to set values for tensors a, b, c. Click "Run Forward Pass" — the computation graph appears with each node's value. Click "Run Backward Pass" — gradient values populate each node. The leaf tensors (a, b, c) show ∂Loss/∂a in orange.
Open model.py and locate the Tensor.__mul__ method. That single closure is the chain rule for multiplication. Everything else is scaffolding around that same pattern applied to each operation.
Click "Simulate Error" to trigger the in-place mutation trap — watch the gradient become incorrect while the forward pass looks fine. That's the bug that has silently broken more training loops than almost anything else.
Homework — Production Challenge
Implement the sigmoid operation in model.py using only __mul__, __add__, and __pow__ — do not write a new _backward closure. Instead, express sigmoid as: σ(x) = 1 / (1 + exp(-x)) and decompose it into operations your engine already handles. Verify that the gradient σ(x) * (1 - σ(x)) emerges automatically from your existing graph machinery. If it does, you've proven that you don't need hand-coded derivatives for composite functions — the chain rule composes them for free. That is the entire philosophical foundation of modern deep learning frameworks.
Next Lesson: Lesson 05 — Building a Linear Layer from scratch: from Tensor primitives to a fully vectorized forward pass with Xavier initialization and configurable activation functions.