Day3:Net Architect

Lesson 3 60 min

Net Architect: Building Neural Network Modules from Scratch

Component Architecture

Data Layer Data NDArray input Preprocess Normalize, batch Module Sequential, Linear Parameter tree Lesson 07 Loss Cross-entropy Optimizer SGD, weight Δ gradient update loop Module tree (Lesson 07 scope) Parameter Sequential ModuleDict

The nn.Module Trap

State Machine

Defined Module.__init__() Parameters registered _parameters dict built Forward pass X @ W + b, activations Loss computed scalar, dlogits ready Updated W -= lr * grad → converged if loss plateaus low Diverged NaN / loss explodes reduce lr, fix init scale __setattr__ hook model(X) called cross_entropy() backward() grad_norm > 100 next epoch

Flowchart

Data flow through Sequential Legend Input Operation Output Input X shape: (batch, 128) Linear(128 → 64) X @ W + b W:(128,64) b:(64,) 128 weights/neuron (batch, 64) ReLU max(0, x) element-wise (batch, 64) Linear(64 → 10) X @ W + b W:(64,10) b:(10,) 64 weights/neuron (batch, 10) Logits out shape: (batch, 10)
Open any PyTorch tutorial and you'll see this pattern within the first ten lines:
python
import torch.nn as nn

class MyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(128, 64)
        self.fc2 = nn.Linear(64, 10)

    def forward(self, x):
        return self.fc2(torch.relu(self.fc1(x)))
It works. It's clean. And it hides everything you need to understand.

When you call nn.Linear(128, 64), PyTorch silently allocates a weight matrix
W ∈ ℝ^{128×64} and a bias vector b ∈ ℝ^{64}, initializes them with Kaiming
uniform sampling, registers them as *tracked parameters* in a global parameter
tree, and wires up automatic differentiation hooks. You typed 17 characters. You
learned nothing about what a "parameter" actually is.

This lesson rips the curtain back. We'll build the same system — Module,
Sequential, ModuleList, ModuleDict, Parameter, parameter sharing — in
pure NumPy. By the end, you'll understand exactly what PyTorch's nn.Module
does, why it needs to do it, and what breaks when the abstractions aren't there.

---

## The Failure Mode

Here's the naive approach a beginner takes:
python
import numpy as np

W1 = np.random.randn(128, 64) * 0.01
b1 = np.zeros(64)
W2 = np.random.randn(32, 10) * 0.01   # ← BUG: should be (64, 10)
b2 = np.zeros(10)

X = np.random.randn(32, 128)
h = np.maximum(0, X @ W1 + b1)        # shape: (32, 64) — correct
out = h @ W2 + b2                      # CRASH

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 32 is different from 64)

This is the *best-case* failure — a crash you can see. The silent failure is
worse: you have three layers with compatible shapes by accident, train for 20
epochs, check accuracy... 11%. Your network learned nothing because you forgot
b1 has shape (64,) but you initialized it as np.zeros(1) and broadcasting
silently padded it.

With loose NumPy arrays there is no parameter inventory. You cannot ask "how
many trainable weights does this network have?" without counting by hand. You
cannot save and reload the model without writing custom serialization per layer.
You cannot share weights between layers without careful aliasing — and if you
copy instead of alias, you've broken parameter sharing silently.

The Module abstraction exists to solve exactly these problems.

---

## The ScratchAI Architecture

We implement four classes:

**Parameter** wraps a NumPy array and marks it as trainable. It holds .data
(the weight values) and .grad (accumulated gradients). That's it. This is the
atom of every neural network.

**Module** is the base class for every layer. Its __setattr__ override
inspects every attribute being assigned: if it's a Parameter, register it in
self._parameters. If it's another Module, register it in self._modules.
This gives us a recursive parameter tree automatically — no bookkeeping required.

Module
└─ parameters: dict[str, Parameter]
└─
modules: dict[str, Module] ← sub-modules, each with their own trees
└─ _buffers: dict[str, NDArray] ← non-trainable state (e.g., running mean)

The distinction between parameters and buffers is critical: both travel with the
model (they're saved/loaded together), but only parameters receive gradient
updates. A batch norm layer's running mean is a buffer. Its scale and shift are
parameters.

**Sequential** stores an ordered list of modules and implements forward() as
a simple loop: feed output of module i as input to module i+1. The entire
chain is assembled at construction time, and parameters() recursively walks
every sub-module to collect the full parameter set.

**Parameter sharing**: two Module instances can hold a *reference to the same
Parameter object*. Because Python assignment binds names to objects (not
copies), layer_b.W = layer_a.W means both layers literally use the same weight
array. When we walk the parameter tree, we use id() to deduplicate — a shared
parameter is counted once.

---
Need help?