How a systolic array computes a matrix multiply

The weight-stationary systolic array is the dataflow that Google's TPU-v1 uses. TinyTPU implements this dataflow in real synthesizable SystemVerilog, not a simulation shortcut. This page explains why the design works, what the diagonal skew is, and why TPUs use this structure.

The processing element (PE)

The systolic array is a grid of identical cells called processing elements. Each PE does one thing per clock: a multiply-accumulate (MAC).

psum_out ≤ psum_in + weight_reg × act_in

Three signals go in; two come out:

act_in: the activation value arriving from the left. The PE passes it rightward (act_out <= act_in, registered).
psum_in: the partial sum arriving from above. The PE adds its own product and sends the total downward.
weight_reg: the stationary weight. This does not change during a run; it is loaded once before streaming starts.

Because weight and activation each touch the PE on the same clock cycle, there is no memory access during computation. The memory bandwidth bottleneck that plagues conventional processors is completely absent.

Weight-stationary dataflow

"Weight-stationary" describes where the data stays during computation. Matrix B is loaded as stationary weights: PE[i][j] holds B[i][j] for the entire duration of one matrix multiply. Matrix A then streams in from the left edge, one column-element per PE-row per cycle.

PE 0,0 B[0][0]

PE 0,1 B[0][1]

PE 0,2 B[0][2]

PE 0,3 B[0][3]

PE 1,0 B[1][0]

PE 1,1 B[1][1]

PE 1,2 B[1][2]

PE 1,3 B[1][3]

PE 2,0 B[2][0]

PE 2,1 B[2][1]

PE 2,2 B[2][2]

PE 2,3 B[2][3]

PE 3,0 B[3][0]

PE 3,1 B[3][1]

PE 3,2 B[3][2]

PE 3,3 B[3][3]

This dataflow has a critical advantage: because B never moves, every PE can accumulate partial sums into a running total without waiting for memory. Only A and the growing partial sums move, in lockstep with the clock, the heartbeat that gives the systolic array its name.

Why Google chose this for TPU-v1: in a deep learning inference pass, model weights are loaded once per forward pass and reused for every input example in a batch. Weight-stationary dataflow perfectly exploits this reuse. You pay the memory bandwidth cost of loading B once, then multiply it against thousands of A matrices "for free" in subsequent passes.

The diagonal skew: why it exists

Here is the problem: PE[0][0] is adjacent to the left edge, but PE[0][3] is three hops to the right. For the matmul to be correct, element A[0][k] must arrive at column k on the cycle that the partial sum for output element C[0][j] is accumulating at PE[0][j].

Since activations travel one PE-width per cycle, row i of A must be delayed by i cycles relative to row 0. This diagonal stagger is called the skew:

STREAM cycle	Row 0 west input	Row 1 west input	Row 2 west input	Row 3 west input
0	A[0][0]	0	0	0
1	A[0][1]	A[1][0]	0	0
2	A[0][2]	A[1][1]	A[2][0]	0
3	A[0][3]	A[1][2]	A[2][1]	A[3][0]
4	0	A[1][3]	A[2][2]	A[3][1]
5	0	0	A[2][3]	A[3][2]
6	0	0	0	A[3][3]

The diagonal is not artistic; it is a timing requirement. If you send all rows simultaneously (no skew), A[1][0] arrives at PE[1][0] one cycle before B[0][0]'s partial sum has propagated down from PE[0][0], corrupting the output.

This is why the waveform from a real RTL simulation of TinyTPU shows a moving diagonal of non-zero act_west values. That diagonal is the skew, and it is the visual signature of a correctly operating systolic array.

Where the skew is applied: controller.sv, not systolic_array.sv. The array module wires act_west[i] directly to row i's left edge. The controller is responsible for presenting the right element at the right cycle. The array is a dumb grid, which is exactly what makes it cheap to implement in silicon.

Cycle budget for a 4×4 pass

Phase	Cycles	What happens
`LOAD_WEIGHTS`	4	B is loaded column-by-column into the PE grid
`STREAM`	7	A streams in with diagonal skew; MACs fire
`DRAIN`	3	Final partial sums propagate to the south edge
Total	14	Matches `golden.py expected_cycles()`

Tiling: matrices larger than the array

A 4×4 array holds 16 weights. An 8×8 matrix multiply requires 64 weights, more than the array can hold at once. The solution is tiling: split the large matmul into 4×4 tiles and run multiple passes, accumulating partial results in software between passes.

In TinyTPU's L3 view, the TypeScript orchestration layer calls the 4×4 WASM core repeatedly, once per tile pair, and accumulates the results. Each individual tile still runs on real RTL, so the MAC arithmetic is hardware-correct. Only the inter-tile accumulation happens in software, and that is documented clearly in the UI.

For an 8×8 matmul, the tiling schedule is 2×2 output tiles × 2 K-dimension tiles = 8 RTL passes total. The assembled output matches matmul_golden(A, B) exactly.

See it execute

The L2 tab in the visualizer shows the full 4×4 array running cycle-by-cycle. Watch the activation diagonal form, the partial sums accumulate downward, and the final results drain out the bottom. Every number is a real RTL signal.

Open visualizer

Previous ← How it works The RTL to WASM to browser pipeline Next Architecture → Monorepo layout, state contract, build flow