Close

Instruction packing

A project log for F-CPU

The Freedom CPU project has a log here too now :-)

yann-guidon-ygdesYann Guidon / YGDES 12/20/2016 at 06:090 Comments

(updated 20210102)

As the FC1 moves from FC0's single-issue superpipeline to a superscalar architecture, instruction decoding faces new challenge. There is the potential for up to 4 instructions to be issued for every clock cycle and this is a lot !

Depending on the actual implementation and object code, 1, 2, 3 or 4 instructions might be issued to any of the 4 parallel execution units ("globules") and this requires significant efforts. Some processors like the Alpha 21264 managed this pretty well, two decades ago, so there is a hope it is doable. Later, the 21264 added OOX but there is no point to go this far for now.

Anyway there is pressure on both the implementations (1, 2, 3 or 4-issue wide versions) and the code generators to create efficient code that runs optimally everywhere.

One option I had considered for a while was to create a pseudo-instruction (or meta-instruction) that indicates the properties of the following instruction bundle. This is pretty under-efficient though and would waste code size for single-issue implementations.

I think I have found a much better option, explained below.


The instructions operate on one globule at a time, over the local 16-registers set. That's 4×2 source address bits for the source operands. The destination address requires 6 bits, for the 4-bits local address and two more bits to indicate the destination globule.

It is necessary to extend this system because globules would not be able to communicate so one source register gets more bits to access the global register set. Of course, an implementation might not be happy with it because it adds a 3rd read port to the register set for the sake of intercommunication and that would still not allow several instructions to read from the same globule at once, but this can be solved with hazard detection logic.

Instruction format :
bits  function
 16   Opcode & flags
  2   SG Source globule
  4   Src1 reg#
  4   Src2 reg#
  2   Destination globule
  4   Dest reg#

Most RISC CPU have 32 registers and use 15 bits for the register addresses, FC1 has twice as many but uses only one more bit.

I consider adding another bit to the destination address field, which would be a "duplicate instruction" flag. I borrow the idea from #RISC Relay CPU but instead of a serial operation (used in @roelh's architecture as a "repeat for another cycle" order), FC1 does it in parallel, sending the same opcode and register address to a globule pair. This should help with pointer increments or flushing/shuffling registers with less code. Not sure it will be used in the end but there is an opportunity here.

Now, this is where the fun part starts.

A group of parallelly-executable instruction requires no additional information than what is already encoded in the instruction flow. Since I have reduced the addressing range, it's easier to check for data read and write hazards. Only 4 bits per instruction are required to evaluate the data dependencies, 2 of them indicate which globule is addressed.

Therefore, a group is formed by a sequence of instructions where the destination globule (2MSB of destination address) is stricly in increasing sequence.

Here are all the possible combinations for a valid "instruction packet":
   4
  3 
 34
 2
 2 4
 23
 234
1
1  4
1 3
1 34
12
12 4
123
1234

Yes it's a binary tree with all 15 combinations. All of them are present and there is no funky restriction. The difficult part is in the multiplexing/selection of the instruction words, which sends them the their respective Globule.

This is further illustrated by the following random sequence where the numbers are grouped on one line according to the above rule.

$ strings /dev/urandom |grep '[1-4]'|sed 's/[^1234]//g'

4
4
13
13
4
1
14
13
2
2
14
2
14
24
4
13
4
14
4
2
2
1
1
14
2
13
13
4

20170409: why increasing order ? Because 1) the boundaries are easier to spot (and are easy to decode in binary) 2) it simplifies and reduces the number of MUXes inside the decoder, the 2nd instruction by definition doesn't need to go to Globule 1

When two consecutive instructions operate on the same destination globule, there is an obvious hazard and the second instruction must be delayed.

The logic that evaluates this condition requires only a few gates and can run very fast, helping issuing instructions at the highest speed.

This first example with random data shows that not all instructions can be grouped (in average, with a purely random/non-optimised stream, only one half can) and the groups don't reach 3 members (in this example at least) so a first implementation can safely use only 2 instruction decoders, not 4. A wider decode unit will require more instruction set bandwidth and a better compiler, but more importantly : source code with enough available ILP.

It's less obvious if some instructions can be paired differently, for example across increasing sequences. If 4 is followed by 1, they could be both executed at the same time if no write hazard is detected, but this requires a better analysis (and take the inter-globule communication bottleneck into account)

Following this argument, the packet rule becomes cyclical, relative to the first element found :

Base1  Base2  Base3  Base4
1      2      3      4
1  4   2  1   3  2   4  3
1 3    2 4    3 1    4 2
1 34   2 41   3 12   4 23
12     2341   34     41
12 4   23 1   34 2   41 3
123    234    341    412
1234   2341   3412   4123

The 4 globule numbers can be decoded into 4×4 wires then basic logic can detect hazards.

This also does not take into account side-effect caused by register-mapped memory.

Anyway, parallel execution is required for reaching a significant performance, since address generation will use the common computation resources. A value from RAM can be read at the same time as its pointer is updated by the following instruction. This is a key to preserving the memory performance advantage enabled by register-mapped memory. 

Discussions