The processor is designed to be able to process instructions as efficiently as possible with the resources available to it. At its core, it is limited in throughput by availability of certain resources:
- Memory access - a memory access requires two processor cycles, one to initiate the access and one to read the result (which will be available with enough time to spare for some processing to be performed on the result, e.g. instruction decoding)
- Register file access - a 16-bit register file entry can be read or written in each cycle; this could be a pair of 8-bit registers or a single 12-bit register.
- Instruction decoders
- Instruction queues (which can be decoupled from decoders if necessary)
- Execution cores (which would usually contain a 3-stage pipeline including a register read stage, a microcoded execute stage, and a register write stage).
Because we can have a variety of configurations of these resources, we can easily produce a few different variants of the processor. None of the variants I've examined have more than one execution core (which is the most complex part of the processor -- I haven't mapped it out in detail yet, but I estimate it will need at least 10 ICs) as the main point of supporting multiple tasks is to increase the utilisation of the execution core.
Here are some configurations that seem useful:
The simplest processor that could possibly work
A single memory block, a single register file, and just one instruction queue and decoder:
- A0, A1 etc: letter represents a process, number identifies the instruction
- RR - Register read
- AM - ALU/Memory start
- MR - Memory result
- RW - Register write
And so on. If an instruction requires multiple cycles of execution, it just repeats the AM/MR phases.
All instructions take at least 3 cycles; instructions that reference memory or two different register locations will need 4.
A big advantage of this approach is simplicity: as well as not needing any duplicated resources, we can simplify the execution unit by removing the need for separate register read/write phases -- these can be controlled by microcode.
Doubling throughput using two memory banks
Two memory banks, two instruction queues, two instruction decoders, but otherwise the same, allows this interleaving pattern:
This, I think, is probably the sweet spot between cost and power, at least for 1980s technology. The instruction queues and decoders are quite cheap (requiring a handful of FIFO chips and some fairly cheap PALs), yet doubling these components doubles the power of the entire processor.
Reaching optimum throughput
Adding an extra register bank along with the memory bank allows overlapping register access, as long as the channels associated with the processes are selected appropriately. To take advantage of this usefully, however, also requires adding another pair of instruction queues (although probably not decoders: a decoder is only useful for at most two cycles for each byte of instruction data read, which means that each decoder is unused during execution of instructions it has decoded -- this can be rectified by allowing it to alternate between channels in different blocks) and another register file. Unfortunately, the register file is likely the most expensive component of this system, so this is a much more expensive option. It also only reaches peak throughput when at least 4 channels are in operation, and their allocations to registers and memory are compatible.
In this situation, channels A and B use memory bank 0 while C and D use memory bank 1, whereas A and C use register bank 0 and B and D use register bank 1, thus avoiding any conflicts.
As of right now, I'm continuing to primarily focus on the middle of these options, but I'm keeping in mind that the others might be useful too, so noting where the design would have to vary to support them.