TinyTPU is a single monorepo with four distinct layers: the RTL, the WASM bridge, the web frontend, and the simulation harness. This page documents the boundaries between them and the data contract that keeps the visualizer honest.
Each directory has a single responsibility. Nothing crosses these boundaries except through the defined interfaces documented here.
rtl/ SystemVerilog, the source of truth pe.sv: MAC cell, stationary weight registersystolic_array.sv: 4x4 generate-loop of PEscontroller.sv: FSM: IDLE → LOAD_WEIGHTS → STREAM → DRAINtiny_tpu_top.sv: top wrapper + debug output busREADME.md: signal dictionary and dataflow specsim/ cocotb verification against golden model golden.py: numpy reference (ground truth)test_pe.py, test_systolic_array.py, test_top.pyMakefile: cocotb + Verilator backendwasm/ Verilator C++ harness → Emscripten → WASM harness.cpp: TinyTpuSim class, reads debug bus, emits CycleStatebindings.cpp: embind surface, JS-callable APIbuild.sh: verilator --cc + em++ invocationweb/ Astro + React + shadcn/ui + Tailwind src/lib/state-schema.ts: TypeScript types mirroring docs/STATE_SCHEMA.mdsrc/lib/wasm-loader.ts: typed async wrapper for WASM embind APIsrc/hooks/useTinyTpu.ts: CycleState[] + playback statesrc/components/Visualizer.tsx: React island (client:only)src/components/PEGrid.tsx: pure SVG, presentationalsrc/components/Controls.tsx: transport barsrc/components/MatrixInput.tsx: A/B matrix editorspublic/tiny_tpu.mjs, public/tiny_tpu.wasm: compiled artifactsdocs/ Extended documentation STATE_SCHEMA.md: canonical state contract (must sync with state-schema.ts)
The C++ harness and the React visualizer communicate through a single shared
contract: the CycleState object emitted once per clock tick.
This contract is defined in two places that must stay in sync:
docs/STATE_SCHEMA.md (prose) and
web/src/lib/state-schema.ts (TypeScript types).
Every field in CycleState maps to either a direct hardware
signal or a documented derivation from hardware signals. There are no
fabricated values.
| Field | TypeScript type | Hardware source | Notes |
|---|---|---|---|
cycle | number | counter in harness | Starts at 0 when start fires |
fsmState | "IDLE" | "LOAD_WEIGHTS" | "STREAM" | "DRAIN" | "DONE" | dbg_fsm_state (3-bit) | "DONE" is harness-derived |
pes[16] | PEState[] | dbg_weight / dbg_act / dbg_psum | Row-major: index i×4+j |
westInputs[4] | number[] | dbg_west | Signed int8, current cycle |
southOutputs[4] | SouthOutput[] | dbg_south + harness validity | valid = harness-computed |
done | boolean | done port of tiny_tpu_top | Single-cycle pulse |
| Field | Hardware source | Notes |
|---|---|---|
row, col | PE index (i, j) | 0–3 |
weight | dbg_weight[i][j] | Signed int8, stationary |
actIn | Derived (not a direct register read) | j==0 ? dbg_west[i] : dbg_act[i][j-1] |
psum | dbg_psum[i][j] | Signed int32, registered psum_out |
active | fsmState==STREAM && actIn!=0 | Harness-computed boolean |
dbg_act[i][j], which is pe[i][j].act_out, the
registered output of the passthrough, not the input of the current
cycle. The harness computes actIn as
(j==0) ? dbg_west[i] : dbg_act[i][j-1], which is the correct
activation signal entering PE[i][j] on this cycle. This derivation is
documented in both STATE_SCHEMA.md and rtl/README.md.
All RTL tooling (Verilator, Emscripten) runs inside WSL2 Ubuntu. The web frontend can run on either WSL2 or Windows; prefer WSL2 for consistency.
verilator --lint-only -Wall rtl/*.sv
Run this before every commit touching RTL. Any warning is a build failure. UNOPTFLAT (combinational loop), BLKANDNBLK
(mixed blocking/non-blocking), and inferred latches are all blocking
conditions.
source ~/.venvs/tinytpu/bin/activate
pytest sim/golden.py -q
cd sim && make MODULE=test_top TOPLEVEL=tiny_tpu_top \
VERILOG_SOURCES="../rtl/pe.sv ../rtl/systolic_array.sv \
../rtl/controller.sv ../rtl/tiny_tpu_top.sv"
The cocotb test suite runs the RTL via Verilator's Python bindings and
asserts bit-exact equality with sim/golden.py for 20+ random
int8 matrix pairs. A failing test means the RTL is wrong. Do not proceed.
bash wasm/build.sh # outputs web/public/tiny_tpu.mjs + web/public/tiny_tpu.wasm
The build script runs verilator --cc on the RTL (generates C++)
then invokes em++ to compile the C++ harness + Verilator model
to WebAssembly. Artifacts land in web/public/ so they are served
from the web root in both dev and production.
rtl/*.sv, re-run bash wasm/build.sh before testing
the frontend; otherwise the browser is running stale hardware.
cd web pnpm dev # dev server at localhost:4321 pnpm lint # eslint pnpm typecheck # astro check && tsc --noEmit pnpm build # production build → web/dist/
verilator --lint-only -Wall rtl/*.sv
cd sim && pytest golden.py -q && make MODULE=test_top \
TOPLEVEL=tiny_tpu_top \
VERILOG_SOURCES="../rtl/pe.sv ../rtl/systolic_array.sv \
../rtl/controller.sv ../rtl/tiny_tpu_top.sv"
cd web && pnpm lint && pnpm typecheck && pnpm build The frontend never reimplements the matmul in JavaScript for the animation. It reads state out of the compiled WASM binary, a binary that is itself a compiled representation of the SystemVerilog. If the RTL is wrong, the visualizer shows the wrong thing. If the RTL is right (golden-verified), the visualizer is right.
Verilator's public_flat attribute can expose internal signals
by name, but the names change across synthesis tools and Verilator versions.
TinyTPU instead exposes a stable, explicitly-typed debug output bus on
tiny_tpu_top. The harness reads these ports after each
eval(), the same way any downstream module would read outputs.
This keeps the viz interface stable and the RTL synthesizable.
Astro builds the site at compile time. Importing WASM during SSR causes
a window is not defined error; WASM requires a browser
environment. Every React island that touches WASM uses
client:only="react" and loads the module inside a
useEffect behind a typeof window !== "undefined"
guard. Both guards must be present; either alone is insufficient.
sim.run() steps the WASM through the full 14-cycle matmul
and returns the complete CycleState[] array once. The
visualizer animates by indexing into this pre-computed array. The WASM
does not execute on every animation frame. This decouples rendering
performance from simulation throughput.