TinyTPU executes real SystemVerilog hardware in your browser via a four-stage pipeline. Every number on screen is a live signal read directly from the compiled RTL binary. Nothing is reimplemented in JavaScript.
The core insight of TinyTPU is that Verilator and Emscripten, chained together, turn a synthesizable hardware description into a WebAssembly module that any browser can execute. The React visualizer is downstream of this: it reads state out of the compiled binary, it does not reproduce the math.
rtl/*.sv
The hardware is described in four synthesizable SystemVerilog files:
a processing element (pe.sv), a 4×4 grid of PEs
(systolic_array.sv), a control FSM
(controller.sv), and a top-level wrapper with a debug
output bus (tiny_tpu_top.sv).
The code uses only always_ff and always_comb
blocks. No #delays, no initial blocks in design modules,
no inferred latches. It is synthesizable: you can run it through any
FPGA synthesis tool without modification.
verilator --ccVerilator is a free, open-source tool that compiles SystemVerilog into cycle-accurate C++. "Cycle-accurate" means the generated model produces exactly the same output as the RTL would on real silicon for every clock edge; it is not an approximation.
The C++ class Vtiny_tpu_top is the RTL translated into
software. The build script runs:
verilator --cc rtl/pe.sv rtl/systolic_array.sv \
rtl/controller.sv rtl/tiny_tpu_top.sv \
--top-module tiny_tpu_top \
--Mdir wasm/obj_dir -Wall
This step also catches synthesizability violations. Any
UNOPTFLAT, BLKANDNBLK, or inferred-latch
warning fails the build. The RTL stays honest.
em++ -O3 -lembind
Emscripten is a complete C/C++ to WebAssembly compiler. The C++ harness
in wasm/harness.cpp owns a Vtiny_tpu_top
instance and exposes a TinyTpuSim class with methods
loadA(), loadB(), start(),
step(), and run().
Emscripten's embind layer makes this class callable from JavaScript without any manual FFI or memory management. The build produces two files:
web/public/tiny_tpu.mjs: ES-module JavaScript loaderweb/public/tiny_tpu.wasm: the compiled hardware binaryem++ -O3 -std=c++17 -lembind \
-I wasm/obj_dir -I "$VROOT/include" \
wasm/harness.cpp wasm/bindings.cpp \
wasm/obj_dir/Vtiny_tpu_top__ALL.cpp \
"$VROOT/include/verilated.cpp" \
-s MODULARIZE=1 -s EXPORT_ES6=1 \
-s ALLOW_MEMORY_GROWTH=1 \
-o web/public/tiny_tpu.mjs
The React island loads tiny_tpu.mjs inside a
useEffect, never at module top level, never during
server-side rendering. The Astro page uses
client:only="react" to guarantee the WASM never
runs at build time.
wasm-loader.ts wraps the embind API with TypeScript types
derived from the canonical state schema in
docs/STATE_SCHEMA.md. The useTinyTpu hook
calls sim.run() once and stores the full
CycleState[] array. The visualizer components render SVG
from the stored states. The WASM runs once, not on every animation frame.
window is not defined error. The client:only
directive and the useEffect-gated import together prevent
this. Both guards must be present.
Reading internal signals from a compiled Verilator model is fragile. Signal names change across versions and synthesis optimizes them away. TinyTPU instead exposes a stable, explicit debug output bus on the top-level module:
// tiny_tpu_top.sv: top-level debug ports output logic [2:0] dbg_fsm_state, // controller FSM output logic signed [7:0] [3:0][3:0] dbg_weight, // PE weight registers output logic signed [7:0] [3:0][3:0] dbg_act, // PE act_out (registered) output logic signed [31:0] [3:0][3:0] dbg_psum, // PE psum_out output logic signed [7:0] [3:0] dbg_west, // act_west inputs output logic signed [31:0] [3:0] dbg_south // psum_south outputs
The harness reads these ports after each eval() call and
populates a CycleState object. One field, actIn
for each PE, is derived rather than directly read:
// actIn is not a direct register, derived in harness.cpp actIn[i][j] = (j == 0) ? dbg_west[i] : dbg_act[i][j-1]
This derivation is correct: dbg_act[i][j-1] is the registered
act_out of the PE to the left, which is exactly the activation
signal entering PE[i][j] on the current cycle.
The RTL is verified by comparing its output to a numpy reference model in
sim/golden.py. For 20+ random int8 matrix pairs across sizes
2×2 to 4×4, the cocotb test suite runs the RTL via Verilator's Python
bindings and asserts bit-exact equality with numpy's integer
matmul:
def matmul_golden(A, B):
return (A.astype(np.int64) @ B.astype(np.int64)).astype(np.int64) If the RTL output does not match numpy for every random test case, the pipeline does not proceed. A wrong matmul is a beautiful lie. TinyTPU refuses to tell it.
numpy integer arithmetic is
exact (no floating-point rounding). An int8 matmul with 4 accumulation terms
fits in int32 without overflow. The golden model is obviously correct, so any
mismatch means the RTL is wrong.