Real hardware · Live in browser

A TPU you can
watch run.

TinyTPU is a real 4×4 weight-stationary systolic array written in synthesizable SystemVerilog. Compiled to WebAssembly via Verilator and Emscripten - every number on screen is a live signal from actual hardware. Nothing is faked.

4 × 4
Processing elements
int8
Signed activations
32-bit
Accumulator width
WASM
Runtime target

Live Telemetry

Real-time RTL
Execution Trace.

Every clock cycle of the 4×4 systolic array captured directly from the WebAssembly-compiled SystemVerilog - weight loading, streaming inputs, drained outputs. Every signal is RTL.

Open full instrument →
live capture tiny_tpu_top
clk running
state source: debug output bus → every signal is RTL try your own matrices →
System Specs

Built for hardware-curious developers

Real synthesizable RTL. Real WASM execution. Real signals. No teaching animations, no JS reimplementations of the math.

01 / Spec Highlight

Synthesizable RTL

always_ff · always_comb · zero latches. Real SystemVerilog you can drop into any FPGA synthesis flow.

02 / Spec Highlight

Weight-Stationary

Matrix B loads as stationary weights into each PE. Authentic TPU-v1 dataflow - not a textbook approximation.

03 / Spec Highlight

WASM Live Execution

Verilator compiles the RTL to C++, Emscripten compiles that to WASM. The browser runs the actual hardware.

04 / Spec Highlight

Golden-Checked

RTL output must bit-match a numpy reference model before it earns the right to appear on screen.

05 / Spec Highlight

Debug Output Bus

PE weights, activations, partial sums, and FSM phase exposed via a stable top-level debug bus - no hacks.

06 / Spec Highlight

Progressive Disclosure

L1 single MAC cell → L2 the 4×4 array → L3 tiling matrices larger than the hardware. One concept at a time.

Execution chain

From HDL to pixels.
No shadow model.

The RTL is the single source of truth. Verilator compiles it to a cycle-accurate C++ model. Emscripten compiles that to WebAssembly. The React island reads hardware state out of the compiled binary - it never reimplements the math in JavaScript.

  1. 01
    rtl/*.sv
    SystemVerilog

    PEs, controller FSM, debug output bus

  2. 02
    verilator --cc
    C++ model

    Cycle-accurate compiled hardware model

  3. 03
    em++ -O3
    WebAssembly

    embind surface, MODULARIZE, ES6 export

  4. 04
    React island
    Signal renderer

    SVG pixels from live hardware state

Datapath

The diagonal
is the proof.

Matrix B loads as stationary weights. Matrix A streams from the west edge with row-skew - row i delayed i cycles so each activation meets the correct weight at the right clock edge. Partial sums accumulate downward and drain from the south edge, skewed.

If the diagonal is wrong, the multiply is wrong. That is why the interface foregrounds phase, flow, and per-PE state.

■ Active MAC ■ Weight-loaded ■ Idle

Ready to watch it run?

Open the instrument. Enter two matrices. Watch actual RTL execute in your browser.