With a system that handles large volumes of data and tries to keep that data into distinct channels, it becomes very easy for the complexity to get out of hand. I hard started my design with the idea that every channel has certain kinds of data and each of those needs a separate register ... very quickly I ended up with a monstrosity that needed so many registers the board would have been huge.
Taking a step back, I've decided to simplify things a little, by making the design a bit more general, and using that generality to implement as much as possible. So, the current thinking is:
- Each processor node has 16 channels. The channel corresponds to a set of registers that are used to store permanent data needed by the channel.
- The registers are general purpose, and can be used to store different data for different kinds of channel: it could be a DMA address, or pointers to start and end of a buffer in scratch memory, or simply scratch registers used for a data generation process (e.g. to produce a stream of psuedo-random numbers). The processor doesn't need to know. The number of registers is restricted in order to minimize size. This means that a channel won't be able to both perform DMA and store the results in a buffer -- but it can perform DMA and pass results to another channel, so you can achieve this result if you commit two channels to it. I think this is a reasonable compromise.
- Each channel has multiple service routines associated that can be used in different circumstances: a source routine (that provides the data in the channel), a storage routine (that can store the data into memory) and a sink routine (that stores the channel data in its destination location).
- Some macro-level routines are encoded as single instructions, e.g. DMA fetch & increment address, store in buffer, read buffer to output, DMA store & increment address, etc. This lets them be microcoded to execute as quickly as possible, and hopefully in a single cycle.
- The processor will have a FIFO into which requests to activate service routines are placed, along with data for them (e.g. when a channel receives data from an external source, this is pushed into the FIFO).
- Whenever no service routine is executing, an entry is pulled out of the FIFO and used to determine what to do next.
My aim is to be able to pull a byte from DMA, extra 2 4-bit fields from it, and push the two results to output ports, all in 4 cycles. That'll require some efficient implementation, but I hope it will be possible.