Implementation Deep Dive
Rank and Axes
A tensor in NumPy is simply an ndarray with a .ndim attribute
(the rank) and a .shape tuple (the size of each dimension):
| Rank | Example Shape | Real-World Meaning |
|---|
| 0D | () | A single scalar: one temperature reading |
| 1D | (128,) | A feature vector: one patient's 128 blood markers |
| 2D | (32, 128) | A batch of 32 feature vectors |
| 3D | (32, 28, 28) | A batch of 32 grayscale images |
| 4D | (32, 28, 28, 3) | A batch of 32 RGB images |
The axis index is its position in .shape. Axis 0 is almost always
the batch axis — the dimension over which you iterate samples.
Axis -1 is almost always the feature or channel axis — the dimension
that describes what a single sample is made of.
Strides: The Hidden Geometry
Every NumPy array carries a .strides tuple alongside .shape. Strides
tell you how many bytes to jump in memory to move one step along each
axis. For a float64 array of shape (4, 3):
arr = np.zeros((4, 3), dtype=np.float64)
print(arr.strides) # (24, 8) — 24 bytes per row, 8 bytes per column
Row-major (C-order) layout means the last axis has the smallest stride.
This matters enormously for cache performance. When you compute
arr.mean(axis=0) — reducing over rows — NumPy strides through memory
in large jumps (24 bytes each). When you compute arr.mean(axis=1) —
reducing over columns — it strides in 8-byte steps: sequential memory
access, maximally cache-friendly.
Understanding strides is how you explain why arr.T @ arr is slower
than arr @ arr.T for certain shapes: one order produces cache misses,
the other does not. This is not magic. It is geometry.
Reshape vs View
a = np.arange(12) # shape (12,)
b = a.reshape(3, 4) # shape (3, 4) — view, no data copy
c = a.reshape(3, 4).copy() # shape (3, 4) — new array, data copied
b[0, 0] = 999
print(a[0]) # 999 — b is a VIEW of a, same memory
print(c[0, 0]) # 0 — c is independent
reshape returns a view whenever the data is contiguous in memory.
It does not allocate new memory. Modifying b modifies a. This is
efficient but dangerous if you don't know it. reshape after a
non-contiguous operation (like a transpose) forces a copy. The rule:
call .copy() explicitly when you need independence; let NumPy give
you a view when you just need a different shape window onto the same data.
Vectorized Axis Statistics
Computing the mean of each column in a 2D array:
# Python loop — O(n) Python object allocations
col_means_loop = [arr[:, j].mean() for j in range(arr.shape[1])]
# Vectorized — one C-level loop over contiguous memory
col_means_vec = arr.mean(axis=0) # shape: (n_cols,)
The loop version creates n_cols temporary Python float objects. The
vectorized version allocates one output array of shape (n_cols,) and
fills it in a single pass through compiled code. For a (10000, 512)
array — typical for a small embedding matrix — the loop takes ~200ms;
arr.mean(axis=0) takes ~2ms. Same arithmetic. Hundred times faster.
The loop is in Python; the reduction is in C.
Production Readiness — Metrics to Watch
For this lesson, "production readiness" means trusting your array
operations. Check these before passing data to any model:
| Signal | What to Look For |
|---|
| Shape sanity | Does arr.shape[0] match your expected batch size? |
| Dtype | Is it float32 or float64? Mixed dtypes silently upcast. |
| NaN/Inf count | np.isnan(arr).sum() must be 0 before any training |
| Value range | Min/max per axis — unbounded inputs destabilize models |
| Memory footprint | arr.nbytes / 1e6 MB — fits in 4GB RAM? |
| Contiguity | arr.flags['C_CONTIGUOUS'] — non-contiguous arrays silently copy on reshape |
The train.py benchmark script measures slice and stat computation time
across tensor ranks and logs all six signals for every generated tensor.
Step-by-Step Guide
Prerequisites
Python 3.11+
pip install numpy>=1.26 streamlit>=1.32 plotly>=5.20
A terminal (any OS — macOS, Windows PowerShell, Linux bash)
The lesson folder: lesson_01/
Execution
cd lesson_01
pip install -r requirements.txt
streamlit run app.py
Verification
The Streamlit app opens at http://localhost:8501. Use the sidebar to
select a tensor rank (0D through 4D) or upload a CSV file. The main
panel shows the shape badge, strides table, memory footprint, and a
Plotly heatmap of the selected 2D slice.
Open model.py and locate the function named compute_axis_stats —
that is the entire lesson in twenty lines: pure NumPy, no loops, every
reduction axis explicit.
Click "Simulate Error" to trigger a deliberate bad reshape
(13 → (3,4)). Watch the error panel. Then click Reset to restore
valid defaults.
Homework — Production Challenge
Extend extract_slice to support 4D tensors interactively.
Currently, extract_slice handles 2D and 3D arrays. Modify model.py
so that when a 4D array of shape (B, H, W, C) is loaded, the sidebar
exposes four sliders — one per axis — and the heatmap shows the
2D cross-section at arr[b_idx, :, :, c_idx] for any chosen batch
index and channel index. No new libraries. Add a second Plotly subplot
showing the per-channel histogram for the selected batch item.
This single change is how every real image-debugging tool works.
Build it from scratch and you will never be confused by a "wrong channel"
bug again.
Next Lesson: Matrix Multiplication from Scratch — why np.dot is not
the same as element-wise multiplication, and how the dot product is the
computational heart of every layer in every neural network.