Close

Overview of the FC1 (F-cpu Core #1)

A project log for F-CPU

The Freedom CPU project has a log here too now :-)

yann-guidon-ygdesYann Guidon / YGDES 10/17/2016 at 12:050 Comments

More than 15 years have passed since FC0 was drafted. It's a very nice and venerable architecture but time has shown some of its practical limitations. The proposed FC1 addresses most of them, thanks to the experience gained since. The #AMBAP: A Modest Bitslice Architecture Proposal has been the most influential inspiration lately but it's just the melon on the already huge cake of my design explorations. Some things remain the same as in FC0, some things have changed and some have been radically altered.

What remains the same

I keep everything that makes sense and is characteristic of the project.

Some details have changed

What departs from the FC0

I simply dropped the load/store instructions altogether. Since I have solved some compiler issues with the YASEP, I can now confidently use the same techniques with F-CPU.

Each scalar globule have half of their registers dedicated to memory access, with the same register-mapped memory principles. With 4 A/D register pairs per globule, there can be two dual-ported SRAM blocks (or cache) per globule.

This changes everything. With 8 pairs of data/address registers, the instantaneous bandwidth is not comparable with standard CPU cores of this class (scalar in-order). The split "globule" architecture means that 4 data can be read and 2 data written. Simultaneously. With only two instructions in one clock cycle. This greatly compensates for the lower number of registers.

Benefits

This new structure solves many issues I had with the FC0.

First, there was this huge, slow crossbar... Now it's gone and the most usual instructions save one cycle (ADD/SUB/ROP2 have no latency, though the SHL and MUL units are shared and might add some latency).

Then there was the memory system. FC0's was complex and very architecture-dependant.

Also that large register set with 3R2W and out-of-order-completion scheduling : gone.

The FC1 can scale up (as FC0) but also down (32 bits), not just in data width but also in IPC : it's easy to design a decoder for 1 or 2 instructions per cycle, as well as more when the SIMD globules are implemented.

A smaller register set (per computation unit) means that it's possible to implement a larger physical file, either for improved cross-globule latency, or for implementing "multithreading" (barrel CPU).

I believe the FC1 will be faster and more efficient, as well as easier to design.


There is a major difference with YASEP though. It is not practical to perform address register post-updates in the same instruction because

  1. There is not enough room for update bits in the instruction (18 bits are already used for the register addresses, 2 more for the size bits, leaving only 12 bits for opcode and fancy stuff)
  2. The globule can only compute one operation at a time, but pointer update should be computed in parallel for lower latency.

The YASEP has special hardware to update the pointers but the F-CPU is too general for this. The solution is in the organisation and allocation of the registers, with cross-linked addresses and data:

This allows pairs of instructions to execute in a single cycle, the instruction reads the D register(s) and the following one updates the A register. Since the registers belong to different globules, they are "pairable" and can be executed in parallel in a single cycle.

Discussions