More than 15 years have passed since FC0 was drafted. It's a very nice and venerable architecture but time has shown some of its practical limitations. The proposed FC1 addresses most of them, thanks to the experience gained since. The #AMBAP: A Modest Bitslice Architecture Proposal has been the most influential inspiration lately but it's just the melon on the already huge cake of my design explorations. Some things remain the same as in FC0, some things have changed and some have been radically altered.
What remains the same
I keep everything that makes sense and is characteristic of the project.
- F-CPU is a 64-bits design (well, mostly) that can scale up to arbitrary widths (ARM recently jumped in the bandwagon)
- Instructions are (almost typical) RISC and 32-bits wide, with 2 register operands and 1 destination register.
- There are 64 registers
- It's aimed at performance for general computation tasks and applications.
Some details have changed
- No need of a cleared register. Register #0 will not be hardwired to 0.
- Instead, the Instruction Pointer and Next Instruction Pointer will be more useful.
- A 32-bits subset will exist, to bootstrap the design and allow smaller implementations for embedded purposes.
- The register set is split into 4 "globules", one pair for scalar operations and memory, the other set for wide SIMD operations (the SIMD set can be implemented as scalar but will trap on SIMD instructions)
What departs from the FC0
I simply dropped the load/store instructions altogether. Since I have solved some compiler issues with the YASEP, I can now confidently use the same techniques with F-CPU.
Each scalar globule have half of their registers dedicated to memory access, with the same register-mapped memory principles. With 4 A/D register pairs per globule, there can be two dual-ported SRAM blocks (or cache) per globule.
This changes everything. With 8 pairs of data/address registers, the instantaneous bandwidth is not comparable with standard CPU cores of this class (scalar in-order). The split "globule" architecture means that 4 data can be read and 2 data written. Simultaneously. With only two instructions in one clock cycle. This greatly compensates for the lower number of registers.
This new structure solves many issues I had with the FC0.
First, there was this huge, slow crossbar... Now it's gone and the most usual instructions save one cycle (ADD/SUB/ROP2 have no latency, though the SHL and MUL units are shared and might add some latency).
Then there was the memory system. FC0's was complex and very architecture-dependant.
Also that large register set with 3R2W and out-of-order-completion scheduling : gone.
The FC1 can scale up (as FC0) but also down (32 bits), not just in data width but also in IPC : it's easy to design a decoder for 1 or 2 instructions per cycle, as well as more when the SIMD globules are implemented.
A smaller register set (per computation unit) means that it's possible to implement a larger physical file, either for improved cross-globule latency, or for implementing "multithreading" (barrel CPU).
I believe the FC1 will be faster and more efficient, as well as easier to design.
There is a major difference with YASEP though. It is not practical to perform address register post-updates in the same instruction because
- There is not enough room for update bits in the instruction (18 bits are already used for the register addresses, 2 more for the size bits, leaving only 12 bits for opcode and fancy stuff)
- The globule can only compute one operation at a time, but pointer update should be computed in parallel for lower latency.
The YASEP has special hardware to update the pointers but the F-CPU is too general for this. The solution is in the organisation and allocation of the registers, with cross-linked addresses and data:
- Globule A holds registers A0-3 and D4-7
- Globule B holds registers A4-7 and D0-3
This allows pairs of instructions to execute in a single cycle, the instruction reads the D register(s) and the following one updates the A register. Since the registers belong to different globules, they are "pairable" and can be executed in parallel in a single cycle.