The CDC6600 Peripheral Processor is the original example of a processor specialised for running IO processes, and it has a few interesting ideas that may be relevant.
The CDC6600 IO system was described at the time as consisting of 10 processors, although today we'd probably describe it in different terms: a single processor core supporting 10 simultaneous threads. Each thread is allocated time to send instructions to the ALU on a round-robin basis, getting a 100ns time slot every 1us. By the time it gets another slot, the operation it started is guaranteed to have finished and be written back into its registers, so there's no possibility of what we'd call pipeline hazards today (I don't think the ALU was actually pipelined, but from the documentation I've read it seems as though it operated asynchronously, so similar considerations would have been required if threads were permitted to use it more frequently.
This scheme provides maximum possible throughput when at least 10 channels are in operation (the CDC6600 supported 12 channels, each of which could be driven by any of the processor threads), but isn't particularly well adapted for my design: I want to be able to efficiently process data when only a handful of channels are in use.
But it does provide an interesting way of thinking. The bottleneck in the design was latency between issuing instructions to the ALU and their results being written back to registers. That isn't the same bottleneck I have: my main bottleneck is memory bandwidth. I have a 70ns RAM (because that was the fastest that was affordable at the time I want this design to be implementable) which, in order to simplify operations, I need to be able to access within a single cycle. The RAM is used, among other things, to supply program instructions and operate as temporary storage for buffered data in the channels. Every other component of this system is faster: the PALs I plan to use for instruction decoding and sequencing can operate in 25ns; the register files have 11ns access time for a read / 15ns setup for a write; I'll likely use a pair of 74181s as an ALU, which will finish operations within 40ns. I've been planning on working with an 80ns cycle time (which should *just* squeeze a memory access and a register write into a single cycle, assuming I keep everything cool), but what happens if I have two RAMs, say one for odd-numbered channels and one for even numbers, and then always alternate cycles between odd and even. Could I then run with 40ns cycle time? Maybe not, but 50ns might be achievable. That would be a small reduction in throughput when working with single channels, but a massive improvement for two.
Another source of inspiration is the instruction set. So far, the instructions I've been thinking of have been reasonably generic, but the CDC6600 Peripheral Processor has some useful application-specific instructions, for instance branches based on the current state of a channel, instructions that perform DMA either from main memory or local memory directly to a device channel, and instructions that invoke a defined function for a specific channel (a function in this case is a command code sent across the link to the peripheral in a virtual stream parallel to the data transfer; it's assumed that the other side of the channel is a hardware device, but this gets more interesting with my design where a channel can be purely virtual, and this operation can be used to invoke a program in another context of the IOP). There are also operations to execute programs on the main processor, but that's not something I'll be doing -- the CDC6600 is designed such that the peripheral processors are in control and the main processor runs programs at their request, which is probably good for a high performance scientific computer as it was intended, but not ideal for an 8-bit general purpose computer.