Instruction interleaving and processor options

A project log for IO881

An I/O Processor for 8-bit systems

JulianJulian 10/20/2018 at 02:001 Comment

The processor is designed to be able to process instructions as efficiently as possible with the resources available to it.  At its core, it is limited in throughput by availability of certain resources:

Because we can have a variety of configurations of these resources, we can easily produce a few different variants of the processor.  None of the variants I've examined have more than one execution core (which is the most complex part of the processor -- I haven't mapped it out in detail yet, but I estimate it will need at least 10 ICs) as the main point of supporting multiple tasks is to increase the utilisation of the execution core.

Here are some configurations that seem useful:

The simplest processor that could possibly work

A single memory block, a single register file, and just one instruction queue and decoder:



And so on.  If an instruction requires multiple cycles of execution, it just repeats the AM/MR phases.

All instructions take at least 3 cycles; instructions that reference memory or two different register locations will need 4.

A big advantage of this approach is simplicity: as well as not needing any duplicated resources, we can simplify the execution unit by removing the need for separate register read/write phases -- these can be controlled by microcode.

Doubling throughput using two memory banks

Two memory banks, two instruction queues, two instruction decoders, but otherwise the same, allows this interleaving pattern:



This, I think, is probably the sweet spot between cost and power, at least for 1980s technology.  The instruction queues and decoders are quite cheap (requiring a handful of FIFO chips and some fairly cheap PALs), yet doubling these components doubles the power of the entire processor.

Reaching optimum throughput

Adding an extra register bank along with the memory bank allows overlapping register access, as long as the channels associated with the processes are selected appropriately.  To take advantage of this usefully, however, also requires adding another pair of instruction queues (although probably not decoders: a decoder is only useful for at most two cycles for each byte of instruction data read, which means that each decoder is unused during execution of instructions it has decoded -- this can be rectified by allowing it to alternate between channels in different blocks) and another register file.  Unfortunately, the register file is likely the most expensive component of this system, so this is a much more expensive option.  It also only reaches peak throughput when at least 4 channels are in operation, and their allocations to registers and memory are compatible.


In this situation, channels A and B use memory bank 0 while C and D use memory bank 1, whereas A and C use register bank 0 and B and D use register bank 1, thus avoiding any conflicts.

As of right now, I'm continuing to primarily focus on the middle of these options, but I'm keeping in mind that the others might be useful too, so noting where the design would have to vary to support them.


zpekic wrote 02/27/2021 at 02:25 point

Hi Julian! This is an interesting project - one can see many CPUs, fewer VDPs but pretty much no I/O processors at all in "homebrew" community. I don't know how far along are you in design / implementation, but I would love if you could take a look at the microcode compiler I wrote which can generate 50% or more of the (VHDL) code needed to implement generic processor. Could be adapted to Verilog too, but the main thing it generates microcode ROM and lookup ROM that can be dropped into the design.

  Are you sure? yes | no