(last updated 20180116)
The last opcodes that are not yet well defined are IN and OUT.
This is a necessary step now because I'm now doing all the sub-modules and this one too is necessary before I put everything together.
The annoying bit is that I have only 2 opcodes and 12 bits with already predefined fields. I made a few compromises but I hope that they are flexible enough.
One interesting aspect is that the values and address don't use the same paths, because they go to different busses and are decoded differently. There is no real "bus" with a common address path and bidirectional data lane. This brings some flexibility.
Another concern is the latency : it's possible to decode maybe 16 registers but more will increase the CDP length. However, limiting the address range to 16 I/O ports would be a big mistake... I have found a way to get 256 addresses, in direct and indexed mode. Well, only 128 write registers only, though, in direct mode, but it must be enough, right ?
- The SRC operand (REG/Imm3/Imm8) gives the I/O register address, giving an addressing range of 256 read registers in read-only, in immediate or register mode. However the latency might not allow more than 8 registers in A3P. In Imm8 mode, there might be enough slack for 16 or 32 registers. Imm8 is congruent to Imm3 so it's out of range of Imm3.
- The DST field is used normally, with the number of the core register to write.
- The instruction is conditional in Reg/Imm3 mode.
- The SRC field (REG/Imm3/Imm8) gives the value to write to the I/O register. Imm3 mode gives -4/3 range (good for small fields, clearing or filling), and Reg/Imm8 mode gives the full 0-255 range.
- The DST field gives the address of the I/O register to write to. This is limited to 0-7 range only. It is extended to 0-127 by confiscating the 4 condition bits in the Reg/Imm3 mode.
- Yes, OUT breaks the cherished orthogonality dogma and it's a bit of a kludge but the loss of conditionality should not be severe, compared to the other aspects. And 128 output registers should be enough for a 8-bitter.
The first implementation would probably implement only 8 registers, which is enough for a dumb application and even overkill for a LED blinker. However, add timers and other peripherals and that number will be really too small very fast.
4 I/O registers are reserved :
- Register n°0 is an index register, to extend addressing to 255 registers, read AND write.
- Register n°1 is a data register, for read and write.
- Register n°2 is an "offset" register used when using the 0-7 range
- Register n°3 is reserved and unallocated. Maybe a data register that auto-increments Reg0 upon every access ?
Not only does this extend the range for writes but it also allows scratchpad areas and I/O configuration with a code loop, instead of wasting precious code space.
Register IO2 is a more practical trick for selecting a group of 4 registers at once. It's useful for small/medium peripherals, a setup sequence uses just one instruction to select the group, and at most 4 more instructions to change the settings. It might be a way to extend to a ridiculous 1024 registers but only the required bits are implemented so don't play with the MSB.
Well, this is getting too complex...
The above system creates too many problems so I have to modify it.
First, the same address bus should be used for IN and OUT : this reduces the complexity, fanout, slowness etc. Which means that IN and OUT instructions have the same format.
There is also the need to get the address as early as possible during the instruction cycle. Going through the register set is not practical so only immediate addresses are allowed. Imm8 becomes the only source of port addresses (great for keeping the gates count low). This leaves one unused bit : R/I8 is now obsolete, it could be used for address expansion, or indirect addressing ?
So he have a full 256-bytes range for input and output, the data to read or write comes from the DST register.
It can't do much for the write bus which has a high fanout. The read bus however has an even larger fan-in and can't use MUX2 because the tree would grow too slowly. A3P can have OR3s that make trees with faster expansion, though there is the other cost of the decoder's latency.
I'm thinking about how to make a modular, "plug and compile" system in VHDL.
lun. nov. 13 06:07:36 CET 2017
The IO system works with a single address bus, a write bus for the OUT instruction (broadcast to the peripherals) and a read bus for the IN instruction.
IN is a tree of OR gates. I count on the synthesiser to optimise and balance it. The A3P FPGA family has 3-inputs OR gates, providing these fan-ins :
1 layer : 3 inputs
2 layers : 9 inputs
3 layers : 27 inputs
4 layers : 81 inputs
5 layers : 243 inputs.
The key is to decode the address bits as fast as possible, which also means with the least fanout possible.
One method is to use as many address bits as possible, to work in parallel and reduce the load on individual bits. There is potential for 9 bits of immediate address, using the Imm8 and R/I8 fields.
With more bits, not only it is possible to address more peripherals but also use fewer logic gates (with shorter
latency and less load) when the number of actual registers is low.
With 9 address bits, it is possible to address 9 separate registers individually : each address bit goes to a given register, in a "one-hot" encoding.
Add to this AND2 gates : the 9 bits are partitioned into 5 fields, mostly 2-bits fields that each decode to 4 addresses. This provides 4×4 addresses, plus one (which could be extended to 2 by sharing one bit with a neighbour), or a total of 18 addresses, at the short latency of one logic gate delay.
Bring in the AND3 gates, that make 3->8 decoders, and partition the 9 bits into 3 fields : that 24 addresses. Or a smaller range with 2:2:2:3=>4+4+4+8 = 20 addresses. This is great if you have 3 peripherals with 4 addresses each, and one with 8 addresses.
Of course, some more decoding is required (validate the IN/OUT opcode, another 3 bits to AND together) but the latency is kept very low. This comes at a price of flexibility because only one instruction form is allowed and addresses must be fixed. No register index or indirect access. But it's fast and provides enough addressing room for the first implementations.
Then it's possible to group as 4+5, providing 16+32=48 registers. That's already 4 layers of OR3 for the readback...
Notice also that decoding one subfield is independent from the other fields : if you give a wrong address, you could hit more than one register at once, giving mangled results. Be warned !
This system is "future proof" in the sense that further development will be able to use the full 512-bytes range. Meanwhile I need to come up with a configuration tool.
Yes I know it's a dirty kludge but it works :-D
Damnit ! I got my numbers wrong !
The code 0 for each decoder should not be used, so the corrected values are:
- 2:2:2:3 : 3+3+3+7 = 16
- 3:3:3 : 7+7+7 = 21
- 4:5 : 15+31 = 46
20180116 : I have been forced to recover one bit from the immediate addressing space. The R/I8 flag is now used for the IN / OUT flag and the Imm8 field leaves only 256 bytes of addressing space.
- 2:2:2:2 : 3+3+3+3 = 12
- 3:3:2 : 7+7+3 = 17
- 4:4 : 15+15 = 30