So far, the 2R1W opcodes have 1 full destination field (6 bits) that determines the globule number, one partial source register address within the same globule, and one full register address that can read from any other globule. It's quite efficient on paper but the corresponding circuit is not as straight-forward as I'd like.
Decoding the full source field creates significant delays because all the source globules must be checked for conflict, the 4 addresses must be routed/crossbarred, oh and the globules need 3-read register sets.
Inter-globule communication will be pretty intense and will determine the overall performance/efficiency of the architecture... it is critical to get it right. And then, I remember one of the lessons learned with the #YASEP Yet Another Small Embedded Processor : You can play with the destination address because you have time to process it while the value is being decoded, fetched and computed.
OTOH the source registers must be directly read. Any time spent tweaking it will delay the appearance of the result and increase the cost of branches, among others. So I evaluate a new organisation :
- one full (6 bits) source register address, that determines the globule
- one partial (4 bits) source register address, in the same globule as above
- one full destination address register that might be in a different globule.
This way, during the fetch and execute cycles, there is ample time to gather the 4 destination globule numbers and prepare a crossbar, eventually detect write after write conflicts, route the results to the appropriate globules...
The partial address field is a significant win when compared to FC0, there are 16 address bits compared to 18 in the 2000-era design. This means more space for opcodes or options in the instruction word. And moving the complexity to the write stage also reduces the size of the register sets, that now have only 2 read ports !
But communication will not be as flexible as before...
I have considered a few compromises. For example, having shared registers that are visible across globules. It would create more problems than it solves : it effectively reduces the number of real registers and makes allocation harder, among other things that could go wrong. Forget it.
Or maybe we can use more bits in the destination register. The basic idea is to use 1 bit per destination globule and we get a bitfield. The destination address has 4 bits to select the destination globule, and 4 bits for the destination register in the globule. The result could be written to 4 globules at once !
Decoding and detecting the hazards would be very easy, by working with 4 decoded bits directly. The control logic is almost straight-forward.
However this wastes all the savings in precious opcode bits and many codes would not be used or make sense.
A globule field with 3 bits would be a good compromise, and the most usual codes would be expanded before going to the hazard detection logic :
000 G0 001 G1 010 G2 011 G3 100 Gx, Gx+1 101 Gx, Gx+2 110 Gx, Gx+3 111 G0, G1, G2, G3 (broadcast)
The MSB selects between unicast and multicast. One instruction can write to one, two or four globules at once. The last option would be very efficient for 4-issue wide pipelines, and take more cycles for smaller versions.
x is the local globule, determined by the source's full address. Some wire swappings and we get the proper codes for every decode slot simultaneously...
Moving data to a different globule would still have one cycle of penalty (because transmission is slow across the chip) but this is overall much better than delaying the whole code when one source must be fetched in another globule. Furthermore, broadcast uses fewer instructions.
It is less convenient than reading from a different globule, because moves must be anticipated. This is just a different way to read/scan/analyse/partition the dataflow graph of the program... More effort for the compiler but less complexity for the whole processor. And a simpler core is faster, cheaper and more surface&power-efficient :-)