From RTL to pixels: no shadow model

TinyTPU executes real SystemVerilog hardware in your browser via a four-stage pipeline. Every number on screen is a live signal read directly from the compiled RTL binary. Nothing is reimplemented in JavaScript.

The four-stage pipeline

The core insight of TinyTPU is that Verilator and Emscripten, chained together, turn a synthesizable hardware description into a WebAssembly module that any browser can execute. The React visualizer is downstream of this: it reads state out of the compiled binary, it does not reproduce the math.

  1. 01

    SystemVerilog RTL rtl/*.sv

    The hardware is described in four synthesizable SystemVerilog files: a processing element (pe.sv), a 4×4 grid of PEs (systolic_array.sv), a control FSM (controller.sv), and a top-level wrapper with a debug output bus (tiny_tpu_top.sv).

    The code uses only always_ff and always_comb blocks. No #delays, no initial blocks in design modules, no inferred latches. It is synthesizable: you can run it through any FPGA synthesis tool without modification.

    Why synthesizability matters: a description that cannot be mapped to real gates is not hardware; it is a simulation script. TinyTPU uses no such shortcuts.
  2. 02

    Verilator: verilator --cc

    Verilator is a free, open-source tool that compiles SystemVerilog into cycle-accurate C++. "Cycle-accurate" means the generated model produces exactly the same output as the RTL would on real silicon for every clock edge; it is not an approximation.

    The C++ class Vtiny_tpu_top is the RTL translated into software. The build script runs:

    verilator --cc rtl/pe.sv rtl/systolic_array.sv \
              rtl/controller.sv rtl/tiny_tpu_top.sv \
              --top-module tiny_tpu_top \
              --Mdir wasm/obj_dir -Wall

    This step also catches synthesizability violations. Any UNOPTFLAT, BLKANDNBLK, or inferred-latch warning fails the build. The RTL stays honest.

  3. 03

    Emscripten: em++ -O3 -lembind

    Emscripten is a complete C/C++ to WebAssembly compiler. The C++ harness in wasm/harness.cpp owns a Vtiny_tpu_top instance and exposes a TinyTpuSim class with methods loadA(), loadB(), start(), step(), and run().

    Emscripten's embind layer makes this class callable from JavaScript without any manual FFI or memory management. The build produces two files:

    • web/public/tiny_tpu.mjs: ES-module JavaScript loader
    • web/public/tiny_tpu.wasm: the compiled hardware binary
    em++ -O3 -std=c++17 -lembind \
         -I wasm/obj_dir -I "$VROOT/include" \
         wasm/harness.cpp wasm/bindings.cpp \
         wasm/obj_dir/Vtiny_tpu_top__ALL.cpp \
         "$VROOT/include/verilated.cpp" \
         -s MODULARIZE=1 -s EXPORT_ES6=1 \
         -s ALLOW_MEMORY_GROWTH=1 \
         -o web/public/tiny_tpu.mjs
  4. 04

    React island: signal renderer

    The React island loads tiny_tpu.mjs inside a useEffect, never at module top level, never during server-side rendering. The Astro page uses client:only="react" to guarantee the WASM never runs at build time.

    wasm-loader.ts wraps the embind API with TypeScript types derived from the canonical state schema in docs/STATE_SCHEMA.md. The useTinyTpu hook calls sim.run() once and stores the full CycleState[] array. The visualizer components render SVG from the stored states. The WASM runs once, not on every animation frame.

    SSR safety: Astro builds the page at compile time. Any attempt to import WASM during build causes a window is not defined error. The client:only directive and the useEffect-gated import together prevent this. Both guards must be present.

The debug output bus

Reading internal signals from a compiled Verilator model is fragile. Signal names change across versions and synthesis optimizes them away. TinyTPU instead exposes a stable, explicit debug output bus on the top-level module:

// tiny_tpu_top.sv: top-level debug ports
output logic [2:0]               dbg_fsm_state,  // controller FSM
output logic signed [7:0]  [3:0][3:0] dbg_weight, // PE weight registers
output logic signed [7:0]  [3:0][3:0] dbg_act,    // PE act_out (registered)
output logic signed [31:0] [3:0][3:0] dbg_psum,   // PE psum_out
output logic signed [7:0]  [3:0]      dbg_west,   // act_west inputs
output logic signed [31:0] [3:0]      dbg_south   // psum_south outputs

The harness reads these ports after each eval() call and populates a CycleState object. One field, actIn for each PE, is derived rather than directly read:

// actIn is not a direct register, derived in harness.cpp
actIn[i][j] = (j == 0) ? dbg_west[i] : dbg_act[i][j-1]

This derivation is correct: dbg_act[i][j-1] is the registered act_out of the PE to the left, which is exactly the activation signal entering PE[i][j] on the current cycle.

Golden-model verification

The RTL is verified by comparing its output to a numpy reference model in sim/golden.py. For 20+ random int8 matrix pairs across sizes 2×2 to 4×4, the cocotb test suite runs the RTL via Verilator's Python bindings and asserts bit-exact equality with numpy's integer matmul:

def matmul_golden(A, B):
    return (A.astype(np.int64) @ B.astype(np.int64)).astype(np.int64)

If the RTL output does not match numpy for every random test case, the pipeline does not proceed. A wrong matmul is a beautiful lie. TinyTPU refuses to tell it.

Why numpy? Python's numpy integer arithmetic is exact (no floating-point rounding). An int8 matmul with 4 accumulation terms fits in int32 without overflow. The golden model is obviously correct, so any mismatch means the RTL is wrong.