As the FC1 moved from a single-issue superpipeline to a superscalar architecture, instruction decoding faces new challenge. There is the potential for up to 4 instructions to be issued for every clock cycle and this is a lot !
Depending on the actual implementation and object code, 1, 2, 3 or 4 instructions might be issued to any of the 4 parallel execution units ("globules") and this requires significant efforts. Some processors like the Alpha 21264 managed this pretty well, two decades ago, so there is a hope it is doable. Later, the 21264 added OOX but there is no point to go this far for now.
Anyway there is pressure on both the implementations (1, 2, 3 or 4-issue wide versions) and the code generators to create efficient code that runs optimally everywhere.
One option I had considered for a while was to create a pseudo-instruction (or meta-instruction) that indicates the properties of the following instruction bundle. This is pretty under-efficient though and would waste code size for single-issue implementations.
I think I have found a much better option, explained below.
The instructions operate on one globule at a time, over the local 16-registers set. That's 4×2 source address bits for the source operands. The destination address requires 6 bits, for the 4-bits local address and two more bits to indicate the destination globule.
It is necessary to extend this system because globules would not be able to communicate so one source register gets more bits to access the global register set. Of course, an implementation might not be happy with it because it adds a 3rd read port to the register set for the sake of intercommunication and that would still not allow several instructions to read from the same globule at once, but this can be solved with hazard detection logic.
I consider adding another bit to the destination address field, which is a "duplicate instruction" flag. I borrow the idea from #RISC Relay CPU but instead of a serial operation (used in @roelh's architecture as a "repeat for another cycle" order), FC1 does it in parallel, sending the same opcode and register address to a globule pair. This should help with pointer increments or flushing/shuffling registers with less code. Not sure it will be used in the end but there is an opportunity here.
Now, this is where the fun part starts.
A group of parallelly-executable instruction requires no additional information than what is already encoded in the instruction flow. Since I have reduced the addressing range, it's easier to check for data read and write hazards. Only 4 bits per instruction are required to evaluate the data dependencies, 2 of them indicate which globule is addressed.
Therefore, a group is formed by a sequence of instructions where the destination globule (2MSB of destination address) is stricly in increasing sequence.
This is illustrated by the following random sequence where the numbers are grouped on one line according to the above rule.
$ strings /dev/urandom |grep '[1-4]'|sed 's/[^1234]//g' 4 4 13 13 4 1 14 13 2 2 14 2 14 24 4 13 4 14 4 2 2 1 1 14 2 13 13 4
When two consecutive instructions operate on the same destination globule, there is an obvious hazard and the second instruction must be delayed.
The logic that evaluates this condition requires only a few gates and can run very fast.
This first example with random data shows that not all instructions can be grouped (one half can) and the groups don't reach 3 members (in this example at least) so a first implementation can safely use only 2 instruction decoders, not 4. A wider decode unit will require more instruction set bandwidth and a better compiler, but more importantly : source code with enough available ILP.
It's less obvious if some instructions can be paired differently, for example across increasing sequences. If 4 is followed by 1, thy could be both exectuted at the same time if no write hazard is detected, but this requires a better analysis (and take the inter-globule communication bottleneck into account)
This also does not take into account side-effect caused by register-mapped memory.
Anyway, parallel execution is required for reaching a significant performance, since address generation will use the common computation resources. A value from RAM can be read at the same time as its pointer is updated by the following instruction. This is a key to preserving the memory performance advantage enabled by register-mapped memory.