Net Architect: Building Neural Network Modules from Scratch
The nn.Module Trap
Open any PyTorch tutorial and you'll see this pattern within the first ten lines:
It works. It's clean. And it hides everything you need to understand.
When you call nn.Linear(128, 64), PyTorch silently allocates a weight matrix
W ∈ ℝ^{128×64} and a bias vector b ∈ ℝ^{64}, initializes them with Kaiming
uniform sampling, registers them as *tracked parameters* in a global parameter
tree, and wires up automatic differentiation hooks. You typed 17 characters. You
learned nothing about what a "parameter" actually is.
This lesson rips the curtain back. We'll build the same system — Module,
Sequential, ModuleList, ModuleDict, Parameter, parameter sharing — in
pure NumPy. By the end, you'll understand exactly what PyTorch's nn.Module
does, why it needs to do it, and what breaks when the abstractions aren't there.
---
## The Failure Mode
Here's the naive approach a beginner takes:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 32 is different from 64)
This is the *best-case* failure — a crash you can see. The silent failure is
worse: you have three layers with compatible shapes by accident, train for 20
epochs, check accuracy... 11%. Your network learned nothing because you forgot
b1 has shape (64,) but you initialized it as np.zeros(1) and broadcasting
silently padded it.
With loose NumPy arrays there is no parameter inventory. You cannot ask "how
many trainable weights does this network have?" without counting by hand. You
cannot save and reload the model without writing custom serialization per layer.
You cannot share weights between layers without careful aliasing — and if you
copy instead of alias, you've broken parameter sharing silently.
The Module abstraction exists to solve exactly these problems.
---
## The ScratchAI Architecture
We implement four classes:
**Parameter** wraps a NumPy array and marks it as trainable. It holds .data
(the weight values) and .grad (accumulated gradients). That's it. This is the
atom of every neural network.
**Module** is the base class for every layer. Its __setattr__ override
inspects every attribute being assigned: if it's a Parameter, register it in
self._parameters. If it's another Module, register it in self._modules.
This gives us a recursive parameter tree automatically — no bookkeeping required.
Module
└─ parameters: dict[str, Parameter]
└─ modules: dict[str, Module] ← sub-modules, each with their own trees
└─ _buffers: dict[str, NDArray] ← non-trainable state (e.g., running mean)
The distinction between parameters and buffers is critical: both travel with the
model (they're saved/loaded together), but only parameters receive gradient
updates. A batch norm layer's running mean is a buffer. Its scale and shift are
parameters.
**Sequential** stores an ordered list of modules and implements forward() as
a simple loop: feed output of module i as input to module i+1. The entire
chain is assembled at construction time, and parameters() recursively walks
every sub-module to collect the full parameter set.
**Parameter sharing**: two Module instances can hold a *reference to the same
Parameter object*. Because Python assignment binds names to objects (not
copies), layer_b.W = layer_a.W means both layers literally use the same weight
array. When we walk the parameter tree, we use id() to deduplicate — a shared
parameter is counted once.
---