Tensor Ops Lab

Lesson 1 60 min

Tensor Ops Lab: Building Autograd From Scratch

The `torch.autograd` Trap

Here is how most tutorials teach automatic differentiation:

python

import torch
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x
y.backward()
print(x.grad)  # 8.0

One import. Three lines. You get a gradient. Tutorial complete.

Except — you have no idea what just happened. You pressed a button labeled "compute all partial derivatives in a neural network with potentially 175 billion parameters" and it worked instantly, which means you have zero model for why it works or when it breaks. When your training loop produces NaN gradients at epoch 47, or you accidentally run .backward() twice and get garbage weights, or you use an in-place operation inside a computation graph and your loss mysteriously never decreases — none of your PyTorch intuition helps you. Because you never built any.

This lesson fixes that. We will implement autograd from scratch using only NumPy. By the time you finish, PyTorch's computation graph will not feel like magic — it will feel like a design decision you yourself could have made.

The Failure Mode: Silent Graph Corruption

Before we build the right thing, let's watch the wrong thing fail. Suppose you try to compute gradients using finite differences — the brute-force numerical approach:

python

def numerical_gradient(f, x, h=1e-5):
    grad = np.zeros_like(x)
    for i in range(x.size):          # loop over every parameter
        x_plus  = x.copy(); x_plus.flat[i]  += h
        x_minus = x.copy(); x_minus.flat[i] -= h
        grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

This is mathematically correct. It is also catastrophically slow: for a network with N parameters, it requires 2N forward passes to compute one gradient update. A small MLP with 10,000 parameters needs 20,000 forward passes per training step. Reverse-mode autodiff (backpropagation) computes the exact same gradients in one forward pass + one backward pass, regardless of N. That is not a minor optimization — it is the reason deep learning became computationally feasible at all.

Now here's the silent error that kills beginners who do try to build a graph manually. Imagine you compute a forward pass, store intermediate results for the backward pass, then modify a tensor in-place before calling backward:

python

a = Tensor(np.array([2.0]))
b = a * a          # b._backward closure captures 'a.data'
a.data += 10.0     # in-place mutation — a.data is now 12.0
b.backward()       # chain rule uses 12.0 instead of 2.0
print(a.grad)      # WRONG: should be 4.0, not 24.0

No error. No warning. Wrong gradient. This is the in-place mutation trap, and PyTorch raises a RuntimeError: a leaf Variable that requires grad has been used in an in-place operation because engineers like us burned hours debugging this before the check was added.

The ScratchAI Architecture: A Computation Graph in 150 Lines

Component Architecture

Our autograd engine has one central class: Tensor. It wraps a NumPy array and adds three fields:

Field	Type	Purpose
`data`	`np.ndarray`	The actual numbers
`grad`	`np.ndarray`	Accumulated gradient (∂Loss/∂self)
`_backward`	`Callable`	Closure: computes grad contribution to children
`_children`	`set[Tensor]`	Nodes this tensor was computed from
`_op`	`str`	Label for graph visualization

Every operation (add, multiply, matmul, relu, etc.) creates a new Tensor and attaches a _backward closure that encodes the chain rule for that specific operation. The graph is built implicitly — no explicit "graph object" exists. The closure captures references to the input tensors, so following _children pointers traces the full computation path.

The data flow is:

Code

Input Tensors (leaf nodes, no _children)
       ↓  [operation: matmul, add, relu…]
Intermediate Tensors (non-leaf, have _children)
       ↓  [chain of operations]
Loss Scalar (single number)
       ↓  loss.backward()
Topological Sort (reverse the graph, leaves last)
       ↓  call ._backward() on each node
Leaf Tensor .grad fields populated
       ↓  weight -= lr * weight.grad
Updated Weights

The topological sort is the key algorithmic insight: we must compute ∂Loss/∂node for a node before we can compute ∂Loss/∂node's_children. Going in reverse topological order guarantees that. It's DFS post-order on a DAG, which is twelve lines of Python.

Implementation Deep Dive

The `mul` Operation — Tracing the Chain Rule as Code

python

def __mul__(self, other: 'Tensor') -> 'Tensor':
    out = Tensor(self.data * other.data, _children=(self, other), _op='*')

    def _backward():
        # ∂(self * other)/∂self  = other  → chain rule: multiply by out.grad
        self.grad  += other.data * out.grad
        # ∂(self * other)/∂other = self
        other.grad += self.data  * out.grad

    out._backward = _backward
    return out

The += is intentional and crucial. If a tensor appears in multiple branches of the computation graph (e.g., weight matrix shared across time steps), its gradient contributions from each branch must accumulate — that is what += enforces. This is exactly what PyTorch's gradient accumulation does, and it is also why you must call optimizer.zero_grad() before each training step — otherwise last iteration's gradients pollute this iteration's.

Broadcasting and Shape Consistency

NumPy broadcasting is a source of subtle gradient bugs. When you add a (batch, output) tensor to a bias of shape (output,), NumPy broadcasts silently. The backward pass must sum over the broadcast dimensions to return the gradient to the bias's original shape:

python

def _backward():
    # Gradient of addition is identity — but must reduce over broadcast dims
    if self.data.shape != out.grad.shape:
        axes = tuple(range(out.grad.ndim - self.data.ndim))
        self.grad += out.grad.sum(axis=axes)
    else:
        self.grad += out.grad

Forgetting this sum produces shape mismatches or, worse, silently wrong gradients when shapes accidentally align.

ReLU — Piecewise Linearity Has a Simple Backward

python

def relu(self) -> 'Tensor':
    out = Tensor(np.maximum(0, self.data), _children=(self,), _op='ReLU')

    def _backward():
        # Gate: gradient flows only where forward pass was positive
        self.grad += (out.data > 0).astype(float) * out.grad

    out._backward = _backward
    return out

The (out.data > 0) mask is the entire derivative of ReLU. Where the input was negative, the gate is closed — gradient is zeroed. Where positive, gradient passes through unchanged. This mask being zero for large portions of the network is "dying ReLU", a training stability problem you can directly observe in the gradient histograms our app shows.

`detach()` and `no_grad` — Stopping the Graph

Sometimes you want to use a tensor's values without involving them in gradient computation (e.g., when you're computing a metric, or detaching an RNN state). Our implementation:

python

def detach(self) -> 'Tensor':
    """Return a new Tensor with same data but no graph connection."""
    return Tensor(self.data.copy())

Calling detach() creates a leaf tensor — no _children, no _backward. The autograd engine has no path to follow back through it. This is what PyTorch's .detach() and torch.no_grad() context manager implement under the hood.

Production Readiness — Metrics to Watch

Flowchart

Running python train.py logs four metrics every epoch. Here's what each one tells you:

Loss Curve Shape
A healthy loss curve decreases smoothly and flattens asymptotically. If it oscillates wildly, your learning rate is too high. If it's perfectly flat from epoch 1, your weights are initialized to zero (a catastrophic mistake — zero init means all neurons compute identical outputs and receive identical gradients, so the network never differentiates). If it drops then suddenly spikes to inf, you've hit a numerical instability — usually log(0) in a loss function or an exploding gradient.

Weight Histogram Per Epoch
Load best_weights.npy after training and plot np.histogram(weights). At initialization, weights should follow a rough normal distribution (we use Xavier init: std = sqrt(2 / (fan_in + fan_out))). As training progresses, the distribution should shift and spread — different neurons specializing. If all weights collapse to a similar value, your network has found a degenerate solution.

Gradient Norm
We log np.linalg.norm(grad.flatten()) per layer each epoch. For stable training on this lesson's scale, expect values between 0.01 and 5.0. If a gradient norm exceeds 10, you're approaching an exploding gradient situation. The fix is gradient clipping: grad = grad * min(1.0, threshold / norm).

Train vs. Validation Accuracy Gap
Any gap greater than 10 percentage points that widens across epochs is overfitting. With a model this size on synthetic data, you should not see it — but it's a habit worth developing now.

Step-by-Step Guide

State Machine

Prerequisites

bash

Python 3.11+
pip install numpy>=1.26 streamlit>=1.32 plotly>=5.20

Generate and Launch

bash

python setup.py          # generates lesson_04/ workspace
cd lesson_04
streamlit run app.py     # opens at localhost:8501

Verification

The Streamlit app opens at localhost:8501. Use the sliders to set values for tensors a, b, c. Click "Run Forward Pass" — the computation graph appears with each node's value. Click "Run Backward Pass" — gradient values populate each node. The leaf tensors (a, b, c) show ∂Loss/∂a in orange.

Open model.py and locate the Tensor.__mul__ method. That single closure is the chain rule for multiplication. Everything else is scaffolding around that same pattern applied to each operation.

Click "Simulate Error" to trigger the in-place mutation trap — watch the gradient become incorrect while the forward pass looks fine. That's the bug that has silently broken more training loops than almost anything else.

Homework — Production Challenge

Implement the sigmoid operation in model.py using only __mul__, __add__, and __pow__ — do not write a new _backward closure. Instead, express sigmoid as: σ(x) = 1 / (1 + exp(-x)) and decompose it into operations your engine already handles. Verify that the gradient σ(x) * (1 - σ(x)) emerges automatically from your existing graph machinery. If it does, you've proven that you don't need hand-coded derivatives for composite functions — the chain rule composes them for free. That is the entire philosophical foundation of modern deep learning frameworks.

Next Lesson: Lesson 05 — Building a Linear Layer from scratch: from Tensor primitives to a fully vectorized forward pass with Xavier initialization and configurable activation functions.

💬 Discuss this topic