24-bit Multi-Cycle Harvard CPU

System overview

§ Custom spec

// What it is

A ground-up CPU and ISA designed against a strict, unique constraint sheet. Memory is strictly Harvard — instructions and data live in completely separate address spaces. The program loads at 0x200, the data bus is 24 bits wide, and every array allocation in the source uses the reverse-endian [0:N] form while the address math underneath still has to behave as little-endian.

The pipeline is multi-cycle: a 3-state FSM (FETCH → EXECUTE → HALT) walks each instruction through the register file, ALU, and either of the two memories. There's no branch prediction and no forwarding — just a clean, observable datapath with a Program Counter, a register file, and two BRAM-style memory blocks.

// MOD-01 · ALU

24-bit combinational ALU

Add, subtract, and Set-Less-Than over the full 24-bit data bus. Exposes Zero / Carry / Negative status flags so future versions can fold them into branch conditions.

// MOD-02 · REGFILE

8-register synchronous file

Eight 24-bit registers addressed by 3 bits. R0 is hard-wired to zero for cheap clears and zero-comparison. Writes commit on the positive clock edge only when write_enable is high.

// MOD-03 · IMEM

Instruction ROM

1024-deep, 29-bit-wide ROM. Pre-loaded with two assembly programs at 0x200 (memory ops) and 0x300 (looping). Combinational read addressed by the PC.

// MOD-04 · DMEM

Data RAM

1024-deep, 24-bit-wide RAM with synchronous write and combinational read. Pre-seeded with A=20 at 0x10 and B=22 at 0x20 so the test programs have something to chew on.

// MOD-05 · DECODER

29-bit instruction decoder

Slices the instruction into opcode[0:3], rd[4:6], rs1[7:9], rs2[10:12], and the 16-bit immediate imm[13:28]. Pure combinational.

// MOD-06 · CONTROL

Multi-cycle FSM

3-state controller — FETCH, EXECUTE, HALT — driving the PC, write-back muxing, and memory enables. Async reset drops the PC at the configured entry point.

Instruction Set Architecture

§ Custom ISA

// 29-bit layout

The instruction format trades a wide immediate for a slim register-field budget: 4 bits of opcode, three 3-bit register specifiers, and a generous 16-bit immediate that's wide enough to address any cell in the data memory directly.

[0:3] · 4b

opcode

operation

[4:6] · 3b

dest

[7:9] · 3b

rs1

src 1

[10:12] · 3b

rs2

src 2

[13:28] · 16b

immediate

constant / address / offset

// Total · 29 bits · MSB-first reverse-endian allocation

Opcode set

Binary	Mnemonic	Behavior
0000	ALU_ADD	Rd ← Rs1 + Rs2
0001	ALU_SUB	Rd ← Rs1 − Rs2
0101	ALU_SLT	Rd ← (Rs1 < Rs2) ? 1 : 0
1001	OP_LI	Rd ← {8'b0, imm}
1010	OP_BEQZ	if (Rs1 == 0) PC ← PC + imm
1011	OP_JUMP	PC ← PC − imm (relative back-jump)
1100	OP_LW	Rd ← DMEM[imm]
1101	OP_SW	DMEM[imm] ← Rd
1111	OP_HALT	Stop execution (FSM → HALT)

Engineering challenges

§ Hazards · solved

// Two race conditions

Multi-cycle CPUs look clean on paper, but the moment write-enables, addresses, and the PC all live on the same clock edge, the datapath turns into a race. Two hazards in particular cost real time on the bench before they were pinned down.

⚠ HAZARD-01 · REGISTER PIPELINE RACE

The destination register changed before the clock could latch the result.

// PROBLEM

During EXECUTE, the CPU signaled the register file to commit an ALU result. The write needed a clock edge to lock in, but on that same edge the FSM also incremented the Program Counter. Because write_addr was wired straight to the decoder, the destination field flipped to the next instruction's rd before the latch fired — silently writing the result into the wrong register and poisoning every downstream loop iteration.

// FIX

Added a dedicated 3-bit pipeline register — write_addr_reg — that latches the destination address inside the EXECUTE cycle. The register file now sees a stable, pre-clocked target regardless of how fast the PC advances behind it.

⚠ HAZARD-02 · MEMORY STORE TIMING

The data RAM was writing one cycle late — after HALT had wiped the address.

// PROBLEM

The Store Word path used a registered write_enable flag. By the time it propagated, the PC had already jumped to OP_HALT, which forced the address bus to zero. The RAM dutifully wrote — but to the wrong cell — destroying the final result the program had just computed.

// FIX

Re-engineered the DMEM write-enable from a registered signal into a real-time combinational wire: wire dmem_we = (state == EXECUTE && opcode == OP_SW);. The RAM now commits in the same cycle the EXECUTE state is asserted, before the FSM has any chance to advance the PC into the next instruction.

cpu.v · dmem write-enableverilog

// Combinational write-enable — locks DMEM on the EXECUTE edge,
// before the PC has a chance to advance into the next instruction.
wire dmem_we = (state == EXECUTE && opcode == OP_SW);

data_memory dmem (
    .clk        (clk),
    .write_enable (dmem_we),
    .addr       (imm[6:15]),
    .write_data (read_data1),
    .read_data  (dmem_out)
);

Verification

§ iVerilog · two programs

// Both PASSED

The CPU is signed off by two custom assembly programs running on iVerilog. The first exercises every data-memory path; the second exercises every control-flow path. Both finish with a clean PASS in the testbench monitor.

PROG-01 · MEMORY OPS

C = A + B

// LW · ADD · SW · HALT — exercises both memories

42 Mem[0x30]
FINAL VALUE

Loads A=20 from 0x10, loads B=22 from 0x20, sums them through the ALU, stores the result to 0x30, halts. Validates the Store-Word race fix end-to-end — the result has to survive the HALT transition.

→ MINIMAL PROGRAM · PASSED

PROG-02 · CONTROL FLOW

sum += i, for i = 0 → 10

// LI · SLT · BEQZ · ADD · JUMP · HALT — full loop

55 R1 (Sum)
FINAL VALUE

A Gauss-sum loop: SLT against an upper bound of 11, BEQZ to exit, ADD to accumulate, JUMP back to the top. Validates the register pipeline fix — every loop iteration writes to the correct register, even as the PC advances on the same edge.

→ LOOPING PROGRAM · PASSED

Functional pass

2 / 2

iVerilog · testbench monitor

Hazards solved

Regfile race · DMEM write race

Modules

ALU · RF · IMEM · DMEM · Decoder · Control

ISA depth

opcodes · ALU · branch · jump · LW · SW

Takeaways

§ Reflection

// What stuck

The most useful lesson out of this project was just how aggressively a multi-cycle CPU punishes you for assuming a signal is “synchronous enough.” Both hazards looked like they should have worked — the write_enables were asserted, the addresses were correct on paper, the simulation transcript looked clean until I read it carefully. The fix in each case was to pull the timing one layer closer to the present — either by latching the address into a dedicated pipeline register, or by promoting a registered enable into a real-time combinational wire.

The other thing that stuck: a tiny, carefully chosen ISA is genuinely fun to design. 9 opcodes, one immediate field, one data bus — and you can still write a complete for loop with control flow that runs on real silicon-style logic. The constraint sheet felt restrictive on day one and ended up being the most interesting part of the project.

A custom 24-bit multi-cycle CPU,
built ground-up in Verilog.

System overview

24-bit combinational ALU

8-register synchronous file

Instruction ROM

Data RAM

29-bit instruction decoder

Multi-cycle FSM

Instruction Set Architecture

Opcode set

Multi-cycle datapath

Engineering challenges

The destination register changed before the clock could latch the result.

// PROBLEM

// FIX

The data RAM was writing one cycle late — after HALT had wiped the address.

// PROBLEM

// FIX

Verification

C = A + B

sum += i, for i = 0 → 10

Toolchain

Takeaways