A full Harvard-architecture processor designed under a unique set of embedded constraints — a 29-bit instruction width, a 24-bit data bus, an 8-entry register file, reverse-endian array allocations, and a 3-state multi-cycle FSM. Closes its own ISA, datapath, and memory subsystem, and successfully executes two custom assembly programs end-to-end on iVerilog.
A ground-up CPU and ISA designed against a strict, unique constraint sheet. Memory is strictly Harvard — instructions and data live in completely separate address spaces. The program loads at 0x200, the data bus is 24 bits wide, and every array allocation in the source uses the reverse-endian [0:N] form while the address math underneath still has to behave as little-endian.
The pipeline is multi-cycle: a 3-state FSM (FETCH → EXECUTE → HALT) walks each instruction through the register file, ALU, and either of the two memories. There's no branch prediction and no forwarding — just a clean, observable datapath with a Program Counter, a register file, and two BRAM-style memory blocks.
Add, subtract, and Set-Less-Than over the full 24-bit data bus. Exposes Zero / Carry / Negative status flags so future versions can fold them into branch conditions.
Eight 24-bit registers addressed by 3 bits. R0 is hard-wired to zero for cheap clears and zero-comparison. Writes commit on the positive clock edge only when write_enable is high.
1024-deep, 29-bit-wide ROM. Pre-loaded with two assembly programs at 0x200 (memory ops) and 0x300 (looping). Combinational read addressed by the PC.
1024-deep, 24-bit-wide RAM with synchronous write and combinational read. Pre-seeded with A=20 at 0x10 and B=22 at 0x20 so the test programs have something to chew on.
Slices the instruction into opcode[0:3], rd[4:6], rs1[7:9], rs2[10:12], and the 16-bit immediate imm[13:28]. Pure combinational.
3-state controller — FETCH, EXECUTE, HALT — driving the PC, write-back muxing, and memory enables. Async reset drops the PC at the configured entry point.
The instruction format trades a wide immediate for a slim register-field budget: 4 bits of opcode, three 3-bit register specifiers, and a generous 16-bit immediate that's wide enough to address any cell in the data memory directly.
| Binary | Mnemonic | Behavior |
|---|---|---|
| 0000 | ALU_ADD | Rd ← Rs1 + Rs2 |
| 0001 | ALU_SUB | Rd ← Rs1 − Rs2 |
| 0101 | ALU_SLT | Rd ← (Rs1 < Rs2) ? 1 : 0 |
| 1001 | OP_LI | Rd ← {8'b0, imm} |
| 1010 | OP_BEQZ | if (Rs1 == 0) PC ← PC + imm |
| 1011 | OP_JUMP | PC ← PC − imm (relative back-jump) |
| 1100 | OP_LW | Rd ← DMEM[imm] |
| 1101 | OP_SW | DMEM[imm] ← Rd |
| 1111 | OP_HALT | Stop execution (FSM → HALT) |
The control unit walks every instruction through a tight three-state loop. FETCH latches the instruction off the ROM bus and arms the decoder; EXECUTE routes data through the ALU, register file, or data memory and advances the PC; HALT traps the machine in place when the program ends.
Multi-cycle CPUs look clean on paper, but the moment write-enables, addresses, and the PC all live on the same clock edge, the datapath turns into a race. Two hazards in particular cost real time on the bench before they were pinned down.
During EXECUTE, the CPU signaled the register file to commit an ALU result. The write needed a clock edge to lock in, but on that same edge the FSM also incremented the Program Counter. Because write_addr was wired straight to the decoder, the destination field flipped to the next instruction's rd before the latch fired — silently writing the result into the wrong register and poisoning every downstream loop iteration.
Added a dedicated 3-bit pipeline register — write_addr_reg — that latches the destination address inside the EXECUTE cycle. The register file now sees a stable, pre-clocked target regardless of how fast the PC advances behind it.
The Store Word path used a registered write_enable flag. By the time it propagated, the PC had already jumped to OP_HALT, which forced the address bus to zero. The RAM dutifully wrote — but to the wrong cell — destroying the final result the program had just computed.
Re-engineered the DMEM write-enable from a registered signal into a real-time combinational wire: wire dmem_we = (state == EXECUTE && opcode == OP_SW);. The RAM now commits in the same cycle the EXECUTE state is asserted, before the FSM has any chance to advance the PC into the next instruction.
// Combinational write-enable — locks DMEM on the EXECUTE edge, // before the PC has a chance to advance into the next instruction. wire dmem_we = (state == EXECUTE && opcode == OP_SW); data_memory dmem ( .clk (clk), .write_enable (dmem_we), .addr (imm[6:15]), .write_data (read_data1), .read_data (dmem_out) );
The CPU is signed off by two custom assembly programs running on iVerilog. The first exercises every data-memory path; the second exercises every control-flow path. Both finish with a clean PASS in the testbench monitor.
Loads A=20 from 0x10, loads B=22 from 0x20, sums them through the ALU, stores the result to 0x30, halts. Validates the Store-Word race fix end-to-end — the result has to survive the HALT transition.
A Gauss-sum loop: SLT against an upper bound of 11, BEQZ to exit, ADD to accumulate, JUMP back to the top. Validates the register pipeline fix — every loop iteration writes to the correct register, even as the PC advances on the same edge.
The most useful lesson out of this project was just how aggressively a multi-cycle CPU punishes you for assuming a signal is “synchronous enough.” Both hazards looked like they should have worked — the write_enables were asserted, the addresses were correct on paper, the simulation transcript looked clean until I read it carefully. The fix in each case was to pull the timing one layer closer to the present — either by latching the address into a dedicated pipeline register, or by promoting a registered enable into a real-time combinational wire.
The other thing that stuck: a tiny, carefully chosen ISA is genuinely fun to design. 9 opcodes, one immediate field, one data bus — and you can still write a complete for loop with control flow that runs on real silicon-style logic. The constraint sheet felt restrictive on day one and ended up being the most interesting part of the project.