The weight-stationary systolic array is the dataflow that Google's TPU-v1 uses. TinyTPU implements this dataflow in real synthesizable SystemVerilog, not a simulation shortcut. This page explains why the design works, what the diagonal skew is, and why TPUs use this structure.
The systolic array is a grid of identical cells called processing elements. Each PE does one thing per clock: a multiply-accumulate (MAC).
Three signals go in; two come out:
act_in: the activation value arriving from the left.
The PE passes it rightward (act_out <= act_in, registered).
psum_in: the partial sum arriving from above.
The PE adds its own product and sends the total downward.
weight_reg: the stationary weight.
This does not change during a run; it is loaded once before streaming starts.
Because weight and activation each touch the PE on the same clock cycle, there is no memory access during computation. The memory bandwidth bottleneck that plagues conventional processors is completely absent.
"Weight-stationary" describes where the data stays during computation. Matrix B is loaded as stationary weights: PE[i][j] holds B[i][j] for the entire duration of one matrix multiply. Matrix A then streams in from the left edge, one column-element per PE-row per cycle.
This dataflow has a critical advantage: because B never moves, every PE can accumulate partial sums into a running total without waiting for memory. Only A and the growing partial sums move, in lockstep with the clock, the heartbeat that gives the systolic array its name.
Here is the problem: PE[0][0] is adjacent to the left edge, but PE[0][3] is three hops to the right. For the matmul to be correct, element A[0][k] must arrive at column k on the cycle that the partial sum for output element C[0][j] is accumulating at PE[0][j].
Since activations travel one PE-width per cycle, row i of A must be delayed by i cycles relative to row 0. This diagonal stagger is called the skew:
| STREAM cycle | Row 0 west input | Row 1 west input | Row 2 west input | Row 3 west input |
|---|---|---|---|---|
| 0 | A[0][0] | 0 | 0 | 0 |
| 1 | A[0][1] | A[1][0] | 0 | 0 |
| 2 | A[0][2] | A[1][1] | A[2][0] | 0 |
| 3 | A[0][3] | A[1][2] | A[2][1] | A[3][0] |
| 4 | 0 | A[1][3] | A[2][2] | A[3][1] |
| 5 | 0 | 0 | A[2][3] | A[3][2] |
| 6 | 0 | 0 | 0 | A[3][3] |
The diagonal is not artistic; it is a timing requirement. If you send all rows simultaneously (no skew), A[1][0] arrives at PE[1][0] one cycle before B[0][0]'s partial sum has propagated down from PE[0][0], corrupting the output.
This is why the waveform from a real RTL simulation of TinyTPU shows a moving
diagonal of non-zero act_west values. That diagonal is the skew,
and it is the visual signature of a correctly operating systolic array.
controller.sv, not
systolic_array.sv. The array module wires act_west[i]
directly to row i's left edge. The controller is responsible for
presenting the right element at the right cycle. The array is a dumb grid,
which is exactly what makes it cheap to implement in silicon.
| Phase | Cycles | What happens |
|---|---|---|
LOAD_WEIGHTS | 4 | B is loaded column-by-column into the PE grid |
STREAM | 7 | A streams in with diagonal skew; MACs fire |
DRAIN | 3 | Final partial sums propagate to the south edge |
| Total | 14 | Matches golden.py expected_cycles() |
A 4×4 array holds 16 weights. An 8×8 matrix multiply requires 64 weights, more than the array can hold at once. The solution is tiling: split the large matmul into 4×4 tiles and run multiple passes, accumulating partial results in software between passes.
In TinyTPU's L3 view, the TypeScript orchestration layer calls the 4×4 WASM core repeatedly, once per tile pair, and accumulates the results. Each individual tile still runs on real RTL, so the MAC arithmetic is hardware-correct. Only the inter-tile accumulation happens in software, and that is documented clearly in the UI.
For an 8×8 matmul, the tiling schedule is 2×2 output tiles × 2 K-dimension
tiles = 8 RTL passes total. The assembled output matches
matmul_golden(A, B) exactly.
The L2 tab in the visualizer shows the full 4×4 array running cycle-by-cycle. Watch the activation diagonal form, the partial sums accumulate downward, and the final results drain out the bottom. Every number is a real RTL signal.