Here is a summary of the design so far.
FC1 or F-CPU core #1 is the successor of FC0 which was designed more than 20 years ago. You can have a look at the original F-CPU manual for an overview of the original concept and history. FC1 is a more mature version that drops the ideas that failed and introduces new ones, the FC1 instruction set is inspired but incompatible with FC0.
So many features have changed/evolved but the founding spirit remains: make a decent application processor with a fresh RISC architecture, avoid complex out-of-order circuits and instead redesign the instruction set around the problems that OOO tries to solve in HW.
FC1 is a 4-ways superscalar processor from the ground up. FC0 would require re-engineering to go superscalar and instead counted on its superpipeline (very short pipeline stages, or "the carpaccio approach") to reach high speed and throughput. The cost was more complexity, longer pipeline stages and maybe lower single-thread performance (reminiscent of the "plague of the P4"). The Low FO4 can quickly hit a logic wall and the intended granularity might have been overly optimistic.
Instead the FC1 is designed as superscalar with very fewer pipeline stages, which is easier to convert to 2-ways or 1-way issue, than the reverse. Code that is correctly compiled and scheduled will run equally well on the 3 possible implementations, though 4-ways is the most natural choice. 2 and 1 way would be interesting for gates-limited versions, such as FPGA.
Just like FC0, FC1 is an in-order processor that uses a scoreboard to stall the instructions at the decode stage if hazards are detected. To some, this is ugly and unthinkable in 2020 but the "lean philosophy" attempts to avoid feature creep that will add a considerable burden later.
Instead, the instruction set and architecture are designed to reduce the effects and causes of decode stalls. Precise exceptions, mostly from memory reference faults, are possible by splitting the classic "LOAD" or "STORE" instructions in 2:
- compute the address, TLB lookup and tag the corresponding address register
- access the data and use it as operand for another operation
The access instructions are the one to trap, once the address is known, but only if the address is referenced. This is possible by flagging the corresponding register as "invalid access" for example. This also enables prefetch, to shadow some of the latency from memory.
By the way, FC1 uses explicitly dedicated Address registers and Data registers. This reduces the complexity and overhead caused by FC0's more flexible and general approach, since now only 16 register addresses have to be flagged "invalid/ready" instead of 63.
Just like FC0, FC1 uses 64 registers though as explained above, the register set is not homogeneous, but split into 3 main functions. Just like the #YASEP and the #YGREC8, FC1 uses register-mapped memory:
- 32 "normal" registers (R0 to R31, and R0 is not hardwired to 0)
- A0 to A15 hold data addresses
- D0 to D15 are "windows" to the memory pointed by the respective address register (they can be thought as a port to the L0 memory buffers)
This is a LOT of ports to memory and the question of the relevance is legitimate (particularly since it creates a LOT of aliasing problems) but we'll see later that it also creates interesting opportunities.
If Data/Address pairs can be paired, that makes 8 blocks of dual-port L1 cache memory, a particularly high bandwidth is expected and it should be matched with eventual L2 cache and main memory bandwidth, but this is something that is not directly inside the scope of the design. Let's just say that it's less constrained than most existing designs.
An even more radical aspect of FC1 is that the pipelines are "loosely coupled", and in fact quite decoupled. Each of the 4 pipelines has its own 2R1W register set with 16 addressable registers (8×R, 4×D, 4×A) to keep speed as high as possible. Gone is the humongous register set with 64 registers, 2 write ports and 3 read ports that was a major timing problem. Selecting only 2 operands among 16 is faster and smaller.
Each pipeline has dedicated registers and the only way to communicate between pipelines is to write the result of an operation to another target pipeline. This of course creates a hazard (and one stall cycle) but it keeps the decoder & Xbar complexity low.
The diagram above also shows that each "Globule", or pipeline+cache block, has 1 output port (wrongly tied to the source register selector) and selects input data from one of four sources : either the shared execution unit(s) or one of the 3 other globules. Two shared TLB manage the aliases, check the data that go to Address registers, while keeping cache eviction reasonable (if you manage your pointer well).
Look ma', no OOO !
The "loosely coupled" approach helps when dealing with code that would benefit from OOO but can be detected at compile time, with "sub-threads" that can be allocated to given pipelines to complete a sequence while the other pipeline(s) start a new sequence. A virtual FIFO (through the L0 instruction cache's multiple instruction pointers) lets a pipeline, or two, or three, stall during L1 cache misses, while the remaining pipeline(s) still proceed in the program's logic. While loads are big headaches (and can be managed through the A registers by early address computations), stores don't slow the program logic as long as no aliasing occurs.
Another breakthrough that is possible with a split register set is that all the pesky instructions that need more than 3 register addresses and don't fit in the clean 2R1W scenario are now handled by "paired instructions".
A pair of pipelines can now handle addition with carry, full-precision multiplies, or long-shifting with almost no effort. The same instruction is duplicated BUT the 2nd instruction specifies a destination in another pipeline (which will stall to accept the new result). This is both trap-safe (the pair of instruction can be split into 2 and be functionally equivalent despite the break of the pair at decode time) and a good use of the available resources.
The example below shows such a case of paired instructions:
; here is a pair of instructions that will be decoded and executed in parallel IMULL R01, R2, R03 ; feed back the result in the pipeline IMULH R01, R2, R13 ; send extra result in another pipeline
Note that the register names are ... in octal. This is not to emulate Cray's philosophy but to ease coding: the first digit will encode the pipeline's number.
Another note on instruction encoding: both operands of an instruction are located in the same pipeline. The result can be sent to another pipeline though. As a result, one needs to encore 6+6+4=16 bits, 2 bits less than FC0 due to the explicit partitioning of the register sets. Decoding is also greatly simplified/smaller.
For paired instructions, the encoding is even smaller because some bits are implicit. The result can't be sent to another globule. A single instruction could also be used, that will be expanded by the decoder, sending the result to the opposite globule at an implicit address. The extra 3 bits can encode more options for the opcode.