Close

KCP53010 Pipeline Register Bypass Works

A project log for Kestrel Computer Project

The Kestrel project is all about freedom of computing and the freedom of learning using a completely open hardware and software design.

samuel-a-falvo-iiSamuel A. Falvo II 08/14/2017 at 05:510 Comments

I just finished register bypass logic for the KCP53010 core.  This allows execute and memory pipeline stages to feed their destination register contents back to the decode stage before register write-back actually completes, preventing a pipeline stall when, for example, the destination of one instruction is used as the source for the next instruction or two.

I'm sure things are still buggy; however, my tests so far seems to indicate everything is working.

Prior to this logic being added, you had to manually 'pad' instructions in the pipeline.  E.g., to add a constant to a register, you'd need to feed the pipeline with five instructions, like so:

ADDI X1, X0, 256
NOP  ; or, ADDI X0, X0, 0
NOP
NOP
NOP
; At this point, the value will be stored in X1.
; We can now use it.
ADDI X1, X1, 256
NOP
...etc...

So, as you can imagine, if you wanted to execute something like:

ADDI X1, X0, 256   ; X1 := 256
ADDI X1, X1, 256   ; X1 := X1 + 256 = 512
SD   X1, 1(X2)

you would consume something on the order of 15 clock cycles.  With the feedback logic, we need not have to pad instructions out like this, since we can execute read-after-write instructions immediately:

ADDI X1, X0, 256   ; [1]
ADDI X1, X1, 256
SD   X1, 1(X2)
NOP  ; decode SB
NOP  ; effective address calculation [2]
NOP  ; store bits 63..48
NOP  ; store bits 47..32
NOP  ; store bits 31..16
NOP  ; store bits 15..0

The value for X1 in instruction [1] above actually gets written into the register file at point [2] in the instruction stream.  But, thanks to forwarding/feedback logic, we can use that value (and its subsequent replacement!) in intervening cycles.

This reduces the instruction stream's latency to just 9 clock cycles.  The bulk of the time is consumed by the lengthy store operation.

There are opportunities for speeding this up further; but, I'm going to leave it as-is for now.  I still need to implement pipeline stall logic, so that the pipeline stalls while a memory fetch or store operation is in-progress.

An obvious opportunity for performance enhancement is to perform memory writes in the background (ZipCPU does this, for example); however, this optimization may not always work for memory reads (you'd have to be careful about choosing your registers wisely to avoid blocking).  The logic to detect when to stall in this case is pretty tricky, so for now, I'd like to keep things simple.  9 cycles for a 64-bit, 3-instruction write sequence is not horrible.

Discussions