An I/O Processor for 8-bit systems

Similar projects worth following
I/O processors are most commonly associated with large systems. For many, they're the defining feature of the mainframe system: handle IO on separate, small processors and drop the results directly into memory so that the main processor can concentrate on the real work. They have occasionally appeared on smaller systems: the Intel 8089, for example, was intended for use with the 8086 or 8088, although as it could only handle two channels and was tightly coupled to the processor (so you could only really have one of them in a system) it wasn't exactly the most useful IO processor in the world.

Here's an IO processor for smaller systems. It's designed to be usable on an 8 bit system with an Intel MCS-like bus (so an 8080, 8085, Z80, or 8088) and up to 24 address bits. It's also designed to be a bit more flexible: it can handle 8 actual devices, and also can handle processing for logical channels (e.g. for the result of merging data from two physical devices

My current plan is to separate the system into the following logical blocks:

  • A DMA controller, interfacing to the system and controlling access to the bus in order to allow the processor to load data from or store data into memory.  The precise details of this will vary according to the host processor, so I may design multiple versions of this block (although to start with I will be aiming for Z80 systems).
  • An interrupt controller.  This allows the processor to send interrupts to the host computer when tasks have finished or otherwise need attention.  Again, this will be closely coupled to the type of processor, so I may end up producing some variations on this design.
  • A register multiplexer.  This sits on the processor bus and allows the host computer to inspect or modify processor control registers.
  • A daisy-chainable processor block, which can handle 16 data channels, each of which may (optionally) be attached to one input stream and one output stream (where a stream is either a DMA block or a port on a connected device) and has a local IO bus onto which up to 8 individually addressable devices may be attached.  The processor will forward DMA/IRQ requests from processors further down the chain.  Realistically, 2 of these blocks would handle the IO requirements of most plausible small computer systems, and adding a third would make a system that could handle just about any foreseeable requirement (of a single user PC, at least).  For example the following setup would require 10 channels, and would therefore leave 6 available channels of a two-block configuration:
    • keyboard in
    • mouse in
    • 2 UARTs (each requiring one input and one output channel)
    • parallel out
    • disk in/out
    • video out

The processor itself will be design to be as simple as possible.  I'm currently considering using 4-bit opcodes for common operations, with a few escape sequences to allow for up to 61 total operations.  With either 2 or 3 registers, this should be enough to support moderately complex programs.  Microcoded to reduce complexity.  Probably a pair of 74181s for the ALU.

The most complex part will be the fact that each of the 16 channels needs to have the appearance of having its own processor.  My original design had a lot of registers -- it would have had somewhere around 12 74AS870s  (although those are very hard to get hold of at a reasonable price, and I'd need 8 times as many of the next logical contender, the 74*670, which is clearly infeasible).  I'm currently revising my design to use some fast static RAM chips for registers; these will run at double the clock rate of the processor in order to allow both a read and write cycle per processor cycle.  As I'm aiming to use authentic components that were available circa 1982 - 1983, I'm thinking of the TMS2149 (which was first available at the end of 1981 for ~$9 per chip, 1K x 4, 35ns), probably with a total of 4 chips (allowing access to 2 8-bit registers or 1 12-bit register per cycle).  Allowing for control logic and latching in addition to the 35ns access time of the RAM chips, this will likely limit maximum clock rate to ~12MHz.

An initial version of the design will be implemented for performance testing and tweaking in an FPGA.  As well as being easier to tweak and smaller than the final period-correct design, this will also run faster; I intend to use it in a system alongside a 20MHz Z80, and am planning on having it run at the same clock rate as the host processor.  But that's another project that I haven't yet uploaded any information about....

For the final design, I intend to only use components that would have been available at a reasonable cost circa 1982-1983.  But I'm also planning on minimizing board size as far as possible.  I want this system to have been a realistic component to install in a high-end home computer.  This means a lot of logic will be in PALs or GALs, probably...

Read more »

  • Instruction interleaving and processor options

    Julian10/20/2018 at 02:00 1 comment

    The processor is designed to be able to process instructions as efficiently as possible with the resources available to it.  At its core, it is limited in throughput by availability of certain resources:

    • Memory access - a memory access requires two processor cycles, one to initiate the access and one to read the result (which will be available with enough time to spare for some processing to be performed on the result, e.g. instruction decoding)
    • Register file access - a 16-bit register file entry can be read or written in each cycle; this could be a pair of 8-bit registers or a single 12-bit register.
    • Instruction decoders
    • Instruction queues (which can be decoupled from decoders if necessary)
    • Execution cores (which would usually contain a 3-stage pipeline including a register read stage, a microcoded execute stage, and a register write stage).

    Because we can have a variety of configurations of these resources, we can easily produce a few different variants of the processor.  None of the variants I've examined have more than one execution core (which is the most complex part of the processor -- I haven't mapped it out in detail yet, but I estimate it will need at least 10 ICs) as the main point of supporting multiple tasks is to increase the utilisation of the execution core.

    Here are some configurations that seem useful:

    The simplest processor that could possibly work

    A single memory block, a single register file, and just one instruction queue and decoder:



    • A0, A1 etc: letter represents a process, number identifies the instruction
    • RR - Register read
    • AM - ALU/Memory start
    • MR - Memory result
    • RW - Register write

    And so on.  If an instruction requires multiple cycles of execution, it just repeats the AM/MR phases.

    All instructions take at least 3 cycles; instructions that reference memory or two different register locations will need 4.

    A big advantage of this approach is simplicity: as well as not needing any duplicated resources, we can simplify the execution unit by removing the need for separate register read/write phases -- these can be controlled by microcode.

    Doubling throughput using two memory banks

    Two memory banks, two instruction queues, two instruction decoders, but otherwise the same, allows this interleaving pattern:



    This, I think, is probably the sweet spot between cost and power, at least for 1980s technology.  The instruction queues and decoders are quite cheap (requiring a handful of FIFO chips and some fairly cheap PALs), yet doubling these components doubles the power of the entire processor.

    Reaching optimum throughput

    Adding an extra register bank along with the memory bank allows overlapping register access, as long as the channels associated with the processes are selected appropriately.  To take advantage of this usefully, however, also requires adding another pair of instruction queues (although probably not decoders: a decoder is only useful for at most two cycles for each byte of instruction data read, which means that each decoder is unused during execution of instructions it has decoded -- this can be rectified by allowing it to alternate between channels in different blocks) and another register file.  Unfortunately, the register file is likely the most expensive component of this system, so this is a much more expensive option.  It also only reaches peak throughput when at least 4 channels are in operation, and their allocations to registers and memory are compatible.


    In this situation, channels A and B use memory bank 0 while C and D use memory bank 1, whereas A and C use register bank 0 and B and D use register bank 1, thus avoiding any conflicts.

    As of right now, I'm continuing to primarily focus on the middle of these options, but I'm keeping in mind that the others might be useful too, so noting where the design would have to vary to support them.

  • Decoded instruction format and special instructions

    Julian10/07/2018 at 21:48 0 comments

    Instructions are, as previously discussed, stored in a packed format.  They are unpacked by the instruction fetcher unit and stored in a FIFO for the execution unit to use.  A fetch operation may produce up to 2 instructions per fetch cycle (a fetch cycle takes two master clock cycles, as the memory they are fetching from is slower than the master clock; there are 2 instruction fetch units that operate on different instruction memory units and fill separate FIFOs, thus allowing for a total of 2 instructions per master clock cycle to be fetched), but it may also take multiple fetch cycles to produce a single instruction.  Hopefully, this will even out to 1 instruction per cycle over time.

    To keep things simple, the instruction fetcher is not directly connected to the register file (otherwise, we'd need some kind of arbitration over which unit gets to use the register file in which circumstances, which could easily get messy).  We therefore need to arrange for methods to get the program counter into and out of the register memory indirectly via the execution unit (which is the only direct connection).  The program counter realistically only needs to be stored in the register memory when the thread is not actually executing; while it is executing it can be cached in the decoder unit.  The decoder needs to be able to inform the execution unit of the address of any instruction where it may stop, and the execution unit must be able to pass a new address when necessary.  We therefore arrange for a bus between the two units to allow for this.

    We add two instructions that are not visible in the ISA but which are interpreted by the execution unit; the instruction decoder can use these to cause program counter transfers:

    • RESTOREPC causes the execution unit to load the current PC from register memory and execute a jump to it; note that this is the same behaviour as a typical JMP instruction, except that a different operand mode is required
    • SUSPEND causes the execution unit to capture the current PC from the fetch unit and store it in register memory; it also signals to the instruction fetcher that it is ready to start handling instructions for a new thread

    There are also no provisions for removing instructions from the queue if they turn out to be unnecessary.  We therefore arrange for the fetcher to stop fetching if it finds an instruction that could cause a jump or suspension of a thread.  It only resumes after the execution unit tells it what happened.

    The PULL instruction requires special handling:

    • The first time it is executed in any specific task invocation, it will immediately return the operand that was placed into the task FIFO.
    • On subsequent executions, it will either return an additional operand (if the next entry in the task FIFO is also for the same task) or suspend the task.

    The instruction fetcher therefore supplies a flag to the decoder unit that allows it to substitute a SUSPEND instruction for a PULL instruction in the latter case.  YIELD and PUT instructions may also suspend the thread, but the decision to do so is deferred until execution, so are not replaced with SUSPEND instructions.  YIELD, on the other hand, is a special case of PUT, so doesn't need a specific instruction.

    This means that the instructions that are required to be supported by the execution unit are as follows:

    HexMnemonicBrief explanation
    00SUSPENDStore current PC in register memory
    01PULL Retrieve task operand and store in destination register
    02JMPPass new PC to instruction fetcher
    03PUTSend a value to a given destination
    10XCHGFetch 16 bits of register memory, byte swap, and save back to original source
    11MOV88 bit move from immediate operand (includes SCSB instruction)
    12MOV128-to-12-bit move
    13ALUALU operation
    14LDBLoad byte from memory (includes XLAT)
    15LDBILoad byte from memory and postincrement address
    16STBStore byte in memory
    17STBIStore byte in memory and postincrement address...
    Read more »

  • Revising the register file, FPGA implementation

    Julian10/01/2018 at 20:43 0 comments

    The complexity of the register file is becoming an issue, so I'm rethinking.

    My target date for implementation is 1982. In 1982, while they were expensive, small 35ns static RAM chips in reasonable sizes (1Kx4) were available, e.g. the TMS2149.  This puts a limit on my performance -- I'll need to be able to access the register file at least twice per processor cycle, in order to perform both a read and a write -- but using these for registers will significantly decrease complexity.  My latest designs would have used something like 20+ individual register file chips to implement the whole register file, which would have used a substantial amount of board space.  That's not something that could really have been tolerated in the home computer market that I want this system to be a plausible candidate for, so it had to go.

    A 35ns access time, plus control signal generation and latching time, twice per processor cycle, means realistically my clock period can't even be as short as 80ns (12.5Mhz).  I may be able to achieve 12Mhz (83.3ns), which would be nice, as I wanted to be able to use this processor alongside a 6MHz Z80, and having them at integral multiples of clock speed would simplify the design.  If necessary, I'll put my RAM chips under a cooler, or maybe even overvolt them slightly.  These things tend to help quite a bit, according to the datasheets I've looked at for other devices.  And most RAM devices of the period could tolerate 5.5V supply happily. Unfortunately, I've been unable to find a datasheet for the TMS2149, so can't confirm anything for this device. 

    (UPDATE: having checked the memory requirements, I think the older 256x4 "93422A" type chips would be ideal -- unfortunately, I haven't been able to find any price information about those. They do however have 30ns typical access time; they also have non-multiplexed input and output pins, which may allow me to remove a set of tristate buffer chips from the data path).

    With a ~12Mhz limit, I'm going to have to profile very carefully.  Every cycle is going to count in order to use this for display generation, as I was hoping to.  I was planning on performing various operations (e.g. palette lookup) in code in the IO processor rather than implementing hardware for them, but that's looking less likely now.

    I've decided to go ahead and implement an FPGA version.  I need to have something I can tinker with so I can get some cycle-exact timings of tasks I have in mind.  My original idea was to implement a simulator, but I've been working on a design for a simple yet fast Z80 system, inspired by projects like Z80-MBC2 -- like that project I originally started with a microcontroller to provide IO interfaces, but I also wanted to implement cycle-stealing DMA and a microcontroller just isn't up to that job without significant support circuitry, so I'm ditching it and using an FPGA instead ... and if I'm using an FPGA anyway, why not put IO881 on there?

    I've got the start of some verilog code up on github.  It still has the old register file design as of right now, so I need to go back and update that to use RAM instead before continuing.  Next job after that is the instruction fetch system, which is where things start to get interesting.

  • CPLDs & registers

    Julian09/05/2018 at 07:58 0 comments

    Are there any CPLD architectures that can store more than 1 bit per logic element?  The size of my register file is getting quite large, and I'd really rather not have to use an FPGA for it, but none of the CPLDs I've looked at come anywhere close to having enough registers.

    That said, FPGA prices are much lower than I remember them being last time I used one.  I can get Spartan 6 XC6 (or, I suspect, cheap Chinese clones thereof) for only about double the cost of a cheap CPLD.  Implementing the entire processor in one would definitely feel like cheating, but I suspect I'd at least be able to come close.  It has 64KiB of Block RAM, which means the largest sensible configuration of RAM for the processor would use all of it. It also has around 5,000 LUTs... I'd need to use some of those to implement the register set. Something like 1,500 of them would disappear into that, leaving 3,500 of them to implement whatever I wanted.  That sure sounds like it should be enough for the entire processor.

    But that violates the spirit of a vintage project.  I'm only using modern parts to replace older parts that aren't available.

    Maybe just a prototype?  I've been planning a verilog implementation simply to check that my timing ideas are plausible, which makes it so tempting just to put it onto an FPGA on a board and hook it up to a host system to see what it can be made to do. 

  • Channel processes

    Julian09/03/2018 at 14:30 0 comments

    Each channel can have a total of up to 3 processes associated with it.  (There's a limit of 32 active processes imposed due to the register file organization, so not every channel can have that many, but that should be fine -- most applications I've looked at only need 1 or 2). Each of these processes has a different role, and this is reflected in slightly different behaviour of the PULL and YIELD instructions.

    In all cases the processes are started by a byte of data being added to the processor task FIFO and reaching the front of the queue. This byte is them returned from a PULL instruction. If a PULL is executed while the process is not the target of the front queue entry the process is suspended and the relevant process is started. The byte is stored in the target register operand of the instruction (which may be either A or B). 

    All processes in a channel share the same 4KiB segment of memory.  Each process has separate registers, except for the P2 and P3 registers which are shared between all processes in a channel and can be used as pointers to shared data structures or counters (e.g. to implement a circular buffer or a stack).

    The input handling process receives input from an external source (eg being passed from an associated device, sent from another channel, the result of DMA requests performed by any of the channel processes, and so on). There are several ways it can pass on data: it can store the data in the channel's memory, send it directly to a recipient port, or use the YIELD instruction to send it on to either the output process (if one exists) or to the channel output (otherwise).

    The output process is invoked at channel startup, every time the channel output value is received by the destination, and every time the input process is suspended while there is no output value available at present. It can also be invoked explicitly by the input process.  If invoked explicitly, PULL stores the value passed to YIELD; otherwise the value stored is not currently defined (but will likely either be 00 or FF).

    The request process is invoked by external control (e.g. a service request line from a device or a specific output port from the host computer to request it) or explicitly using the PUT instruction (which will be described in detail in a later log).  Its purpose is to perform operations that do not relate to handling incoming or outgoing data, e.g. setting up DMA load requests to generate data, configuring device settings, etc.  It may call YIELD, in which case the bytes passed to it are sent to the input process (if one exists), or the output process (if there is no input process but there is an output process), or are stored directly in the channel output (otherwise).

    If a process sends data to the channel output when there is already a value stored there, it overwrites the existing value.  A process may reasonably expect that there isn't a value present if it is connected to a hardware device and that device is always able to accept incoming data quickly; otherwise it would usually be best to store data locally and use the automatically invoked output process to refresh the output value when necessary.

  • Instruction set ideas

    Julian08/31/2018 at 07:49 0 comments

    Have sketched out a few ideas of useful programs to run, and used them to decide on a basic instruction set architecture.  I'm still experimenting and optimising, but my basic plan at the moment is:

    Register set

    Current plan is for each independent thread to have the following set of registers:

    • A and B - 8-bit mostly-general-purpose registers (a handful of instructions use these for specific purposes, but most operations can work on either)
    • P0 - P3 - 4 12-bit registers used for memory address pointers and counters.  Current planned uses only require 3 such registers, so the eventual implementation my only supply 3, but the ISA allows for a 2-bit field to select them so 4 are included in the design of the instruction set.
    • Base address registers for memory addresses and direct memory access.  These are not ISA-visible, but are set in channel configuration.
    • PC - program counter
    • CSB - Channel Status Byte - bitfield of:
      • 0x01 - channel input program enabled
      • 0x02 - channel processing program enabled
      • 0x04 - channel output program enabled
      • 0x08 - signal interrupt on input fetch while no enqueued value
      • 0x10 - signal interrupt immediately (bit gets cleared automatically when interrupt is acknowledged)
      • 0x20 - prefetch hint (may cause DMA load requests to be extended beyond requested addresses)

    Instruction packing

    Instructions are loaded from an 8-bit wide SRAM.  There are three instruction formats:

    • 4-bit with implicit operands
    • 8-bit with internal operand fields
    • 8-bit with an additional operand byte

    Two 4-bit instructions may be packed into a single byte.  An 8-bit instruction may either be aligned to the start of a byte, or it can be packed into the spare location after a 4-bit instruction in which case all bits of its second nibble are assumed to be zero.  Jump destinations must be byte aligned.  Packing is big-endian (i.e. the first instruction executed is in the most significant nibble).

    This means that the valid instructions for the second slot in a byte are the usual 4 bit instructions, NOP (1000), MOV P0L, A (1001), EXT R,1 (1010), JMP PC-1 (1011), ADD P0, A (1100), DLDB [P0] (1101), PUT #n, A (1110) and ADD rr, i6.

    Some instructions (with mnemonics IFxxx) are effectively prefixes that control conditional execution of the following instruction.  An advanced implementation could fuse these to operate in a single cycle, but I'm not going to do that for now (I may support it in a CPLD/FPGA version later, but it would be too complex for the low integration logic I'm planning to use here).

    Allowing 4-bit instructions makes the decoder design slightly harder, but the hope is that by packing more instructions per byte it should be possible to approach 1 instruction per clock cycle (as decode and execution will have to contend for access to the SRAM in many cases, and 2-byte instructions will of course always need 2 cycles to fetch).  A small queue (probably 4 instructions) will be used to prefetch instructions to even out delays. 

    Instruction format

    Bitfield identifiers:

    • rr,qq - selects one of the P0-P3 registers
    • s - 0 = A, 1 = B
    • b - for 8-12bit move operations, selects either the low or high 8 bits of the 12 bit register
    • i or j - bit is used as part of an immediate operand
    • d - shift direction, 0 = R, 1 = L
    • oo - shortened ALU operation code: 00 = ADD, 01 = SUB, 10 = AND, 11 = OR
    • pppp - 74181 opcode (see mnemonics below)
    • ww - an operation width: B=00, W=01, D=10, Q=11
    • n - 1 to negate (adds N to mnemonic)
    • c - channel condition (0 = CI - input thread running and waiting for data; 1 = CR - output ready)

    Mnemonic notes:

    • [x] is an indirect reference to the contents of memory pointed to...
    Read more »

  • Inspiration from the design of the CDC6600 Peripheral Processor

    Julian07/29/2018 at 10:49 2 comments

    The CDC6600 Peripheral Processor is the original example of a processor specialised for running IO processes, and it has a few interesting ideas that may be relevant.

    The CDC6600 IO system was described at the time as consisting of 10 processors, although today we'd probably describe it in different terms: a single processor core supporting 10 simultaneous threads.  Each thread is allocated time to send instructions to the ALU on a round-robin basis, getting a 100ns time slot every 1us.  By the time it gets another slot, the operation it started is guaranteed to have finished and be written back into its registers, so there's no possibility of what we'd call pipeline hazards today (I don't think the ALU was actually pipelined, but from the documentation I've read it seems as though it operated asynchronously, so similar considerations would have been required if threads were permitted to use it more frequently.

    This scheme provides maximum possible throughput when at least 10 channels are in operation (the CDC6600 supported 12 channels, each of which could be driven by any of the processor threads), but isn't particularly well adapted for my design: I want to be able to efficiently process data when only a handful of channels are in use.

    But it does provide an interesting way of thinking.  The bottleneck in the design was latency between issuing instructions to the ALU and their results being written back to registers.  That isn't the same bottleneck I have: my main bottleneck is memory bandwidth.  I have a 70ns RAM (because that was the fastest that was affordable at the time I want this design to be implementable) which, in order to simplify operations, I need to be able to access within a single cycle.  The RAM is used, among other things, to supply program instructions and operate as temporary storage for buffered data in the channels.  Every other component of this system is faster: the PALs I plan to use for instruction decoding and sequencing can operate in 25ns; the register files have 11ns access time for a read / 15ns setup for a write; I'll likely use a pair of 74181s as an ALU, which will finish operations within 40ns.   I've been planning on working with an 80ns cycle time (which should *just* squeeze a memory access and a register write into a single cycle, assuming I keep everything cool), but what happens if I have two RAMs, say one for odd-numbered channels and one for even numbers, and then always alternate cycles between odd and even.  Could I then run with 40ns cycle time? Maybe not, but 50ns might be achievable.  That would be a small reduction in throughput when working with single channels, but a massive improvement for two.

    Another source of inspiration is the instruction set.  So far, the instructions I've been thinking of have been reasonably generic, but the CDC6600 Peripheral Processor has some useful application-specific instructions, for instance branches based on the current state of a channel, instructions that perform DMA either from main memory or local memory directly to a device channel, and instructions that invoke a defined function for a specific channel (a function in this case is a command code sent across the link to the peripheral in a virtual stream parallel to the data transfer; it's assumed that the other side of the channel is a hardware device, but this gets more interesting with my design where a channel can be purely virtual, and this operation can be used to invoke a program in another context of the IOP).  There are also operations to execute programs on the main processor, but that's not something I'll be doing -- the CDC6600 is designed such that the peripheral processors are in control and the main processor runs programs at their request, which is probably good for a high performance scientific computer as it was intended, but not ideal for an 8-bit general purpose computer.  ...

    Read more »

  • Managing complexity

    Julian07/26/2018 at 01:37 0 comments

    With a system that handles large volumes of data and tries to keep that data into distinct channels, it becomes very easy for the complexity to get out of hand.  I hard started my design with the idea that every channel has certain kinds of data and each of those needs a separate register ... very quickly I ended up with a monstrosity that needed so many registers the board would have been huge.

    Taking a step back, I've decided to simplify things a little, by making the design a bit more general, and using that generality to implement as much as possible.  So, the current thinking is:

    • Each processor node has 16 channels.  The channel corresponds to a set of registers that are used to store permanent data needed by the channel.
    • The registers are general purpose, and can be used to store different data for different kinds of channel: it could be a DMA address, or pointers to start and end of a buffer in scratch memory, or simply scratch registers used for a data generation process (e.g. to produce a stream of psuedo-random numbers).  The processor doesn't need to know.  The number of registers is restricted in order to minimize size.  This means that a channel won't be able to both perform DMA and store the results in a buffer -- but it can perform DMA and pass results to another channel, so you can achieve this result if you commit two channels to it.  I think this is a reasonable compromise.
    • Each channel has multiple service routines associated that can be used in different circumstances: a source routine (that provides the data in the channel), a storage routine (that can store the data into memory) and a sink routine (that stores the channel data in its destination location).
    • Some macro-level routines are encoded as single instructions, e.g. DMA fetch & increment address, store in buffer, read buffer to output, DMA store & increment address, etc.  This lets them be microcoded to execute as quickly as possible, and hopefully in a single cycle.
    • The processor will have a FIFO into which requests to activate service routines are placed, along with data for them (e.g. when a channel receives data from an external source, this is pushed into the FIFO).
    • Whenever no service routine is executing, an entry is pulled out of the FIFO and used to determine what to do next.

    My aim is to be able to pull a byte from DMA, extra 2 4-bit fields from it, and push the two results to output ports, all in 4 cycles.  That'll require some efficient implementation, but I hope it will be possible.

View all 8 project logs

Enjoy this project?



Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates