The Freedom CPU project has a log here too now :-)

Similar projects worth following
Today, the Freedom CPU Project's goal is to create and distribute the source code of a microprocessor core under a copyleft license: all the VHDL sources, resources and most tools are Free as in Free Speech. When it was created in 1999, the F-CPU Core #0 (FC0) was the first purely SIMD superpipelined RISC CPU core that could handle 64-bit data and wider.

After the F-CPU project's freeze in 2004 and some explorations of embedded designs, the tools developed for are considered for a reboot of this project, with an updated framework, a better roadmap and new direction.

The instruction set architecture and the microarchitecture will be refined and should provide a nice Application Processor (32 and 64 bits versions) and an Offload Processor (wide SIMD, 64 bits and much more, for sound, graphics, physics etc.).

For a good overview from "back then" (in French) :

Look at all the (still working) links and local copies !

F-CPU has now evolved and the new "cluster of globules" version is described here.

1. Overview of the FC1 (F-cpu Core #1)
2. Operating System and Security
3. Instruction packing
4. Pairing and memory read/write
5. Memory-mapped registers

  • Check-in, Check-out

    Yann Guidon / YGDES03/07/2018 at 08:22 0 comments

    I think I'm closer to the solution of the old old problem of copy-less data transmission between threads or processes.

    Already, switching from one process to another is almost as easy as a Call/Return within a given process. The procedure uses 3 dedicated opcodes : IPC, IPE and IPR (InterProcess Call/Enter/Return). The real problem is how to transfer more data than what the register set can hold.

    My new solution is to "mark" certain address registers to yield the ownership of the pointer to the called process.

    • The caller saves all the required registers on its stack and "yields" one or more pointers with a sort of "check-out" instruction that sets the Read and Write bits, as well as changes the PID (ownership of the block) OR puts the block in a sort of "Limbo" state (still owned but some transfer of ownership is happening, in case the block is not used by the recipient and needs to be reused).
    • The called process "accepts" the pointer(s) with a "check-in" instruction so the blocks can be accessed within the new context and mapped to the new process' addressing range.

    This creates quite a lot of new sub-problems but I believe it is promising.

    • How do you define a block ? It's basically 1)  a base pointer 2) a length 3) access rights 4) owner's PID. All of these must have some sort of HW support, merged with the paging system.
      The sizes would probably be powers-of-two, so a small number easily describes it, and binary masks checks the pointer's validity.
    • Transferable blocks must come from a global "pool" located in an addressing region that is identical for everybody because we don't want to bother with translating the addresses.
    • IF there is a global pool then one process must be responsible for handling the allocation and freeing of those blocks.

    This affects the paging system in many ways so this will be the next thing to study...

  • Memory-mapped registers

    Yann Guidon / YGDES02/22/2018 at 04:41 0 comments

    I just had this little "hah!" moment...

    I keep thinking about how to make FC1 much more badass. While writing one of the latest logs of #PDP - Processor Design Principles  I realised I could/should have more than one data register per address register.

    For example, 4 address registers are linked to one data register, and 4 other address registers are linked to 4 data registers each. Advantages include :

    • much faster and lighter call/return (it would emulate register windows sans their drawbacks)
    • faster register save and backup
    • better memory bandwidth
    • solves loop unrolling
    • fewer aliasing problems ?
    • easier unaligned access in one cycle (by combining 2 pipelines for a 2R2W instruction)
    • reduces the number of address registers and pointer update instructions

    It looks like a weird mix between Itanium, SPARC and TMS9900... I have to ponder more about it but the overwhelming benefits are enticing.

    I also have to find a proper behaviour for pointer aliases, should they trap ?

  • Pairing and memory read/write

    Yann Guidon / YGDES04/09/2017 at 08:38 0 comments

      The register-mapped memory system, pioneered by Seymour Cray in the CDC6600, has many benefits and I developed it in my own way in the #YASEP

      I used a different approach however, where the data register both reads and writes. The CDC had data 5 registers for reading and 2 for writing, with different code sequences.

      For the YASEP, my approach makes perfect sense because the core uses onchip memory with single-cycle latency. You set the address register and by the next cycle, the data is read, whether you need it or not. Same for the #YGREC where the access time is "relatively immediate".

      The F-CPU is a different beast though. We expect multi-level memory and going off-chip is a definitive performance killer. There must be a way to determine if a memory access is for reading or writing.

      The CDC6600 version, with registers that are dedicated to functions, is not possible : there are already 16 registers, 1/4 of the total, and there is no room left, either in the register set or the opcodes (each bit is precious !)

      The semantics of the address/data registers is to access the memory contents directly, through a buffer like a cache line, for example. Loading the cache line is the costly part.

      How do I tell the CPU to load the cache line ? Simply by using/reading the data register. But doing this will stall the core if the cache line is not ready... This totally defeats the purpose of the system, where the memory's latency is hidden by explicit scheduling of the instructions: you calculate the address, schedule some instructions while waiting for the data to arrive, then read the register.

      The costly part is the reading and it's too unpredictible : the data could be in the cache, in the local DRAM, on another CPU, or swapped out... So let's think in reverse and consider the write operation : the sequence goes like

      1. write the destination address to the address register
      2. write the data to the data register

      The point of this log is to make a sort of atomic pair that the decoder can easily detect : the destination register must be the same (modulo one inverted bit) in two consecutive instructions (or closely enough, if the core can afford the comparators) to indicate that it's a write and the memory system shouldn't fetch the line right now.

      There are a couple of things to notice :

      1. the address register of one port must be located in the "mirror" globule of the data register so the two instructions (set address, write data) can be "emitted" at the same time (and only one comparator is required, no comparison across pairs and simpler decoder)
      2. the SIMD globules need their own data ports but the address must remain in the scalar globules. I should change the register map... but now each of the 4 globules has their own dedicated memory block, with a matching data size !

      And now, due to the pairing rules, it seems obvious that all the address registers must reside in the first globule, or else the pairing rules must be more complex, but this creates an imbalance in the register allocation...

      Honestly I dislike the new asymmetrical design because it makes the architecture less orthogonal and potentially less efficient to code for. Architecturally, you only had to design one globule (or two if you want SIMD) and there you go, copy-paste-mirror-connect and you're done.

      But let's look at the new partition :

      1. First globule is specialised for addresses, has A0-A7, and its dedicated memory block accessed with D0 & D1. Since you write all the addresses there, the TLB is directly connected to its ALU's output. Aliasing is directly managed there. The 6 remaining registers can be used to store some pointers, frame base, indices and increments, as well as supplement the standard computations.
      2. Second globule has only 2 data registers, to access the dedicated cache. 14 working registers can do some meaningful work. No TLB here, it can be "replaced" by a barrel shifter/multiplier array/division unit
      3. The 3rd globule has 2 data registers, accessing the wider (but shallower) cache...
    Read more »

  • Instruction packing

    Yann Guidon / YGDES12/20/2016 at 06:09 0 comments

    As the FC1 moved from a single-issue superpipeline to a superscalar architecture, instruction decoding faces new challenge. There is the potential for up to 4 instructions to be issued for every clock cycle and this is a lot !

    Depending on the actual implementation and object code, 1, 2, 3 or 4 instructions might be issued to any of the 4 parallel execution units ("globules") and this requires significant efforts. Some processors like the Alpha 21264 managed this pretty well, two decades ago, so there is a hope it is doable. Later, the 21264 added OOX but there is no point to go this far for now.

    Anyway there is pressure on both the implementations (1, 2, 3 or 4-issue wide versions) and the code generators to create efficient code that runs optimally everywhere.

    One option I had considered for a while was to create a pseudo-instruction (or meta-instruction) that indicates the properties of the following instruction bundle. This is pretty under-efficient though and would waste code size for single-issue implementations.

    I think I have found a much better option, explained below.

    The instructions operate on one globule at a time, over the local 16-registers set. That's 4×2 source address bits for the source operands. The destination address requires 6 bits, for the 4-bits local address and two more bits to indicate the destination globule.

    It is necessary to extend this system because globules would not be able to communicate so one source register gets more bits to access the global register set. Of course, an implementation might not be happy with it because it adds a 3rd read port to the register set for the sake of intercommunication and that would still not allow several instructions to read from the same globule at once, but this can be solved with hazard detection logic.

    I consider adding another bit to the destination address field, which is a "duplicate instruction" flag. I borrow the idea from #RISC Relay CPU but instead of a serial operation (used in @roelh's architecture as a "repeat for another cycle" order), FC1 does it in parallel, sending the same opcode and register address to a globule pair. This should help with pointer increments or flushing/shuffling registers with less code. Not sure it will be used in the end but there is an opportunity here.

    Now, this is where the fun part starts.

    A group of parallelly-executable instruction requires no additional information than what is already encoded in the instruction flow. Since I have reduced the addressing range, it's easier to check for data read and write hazards. Only 4 bits per instruction are required to evaluate the data dependencies, 2 of them indicate which globule is addressed.

    Therefore, a group is formed by a sequence of instructions where the destination globule (2MSB of destination address) is stricly in increasing sequence.

    This is illustrated by the following random sequence where the numbers are grouped on one line according to the above rule.

    $ strings /dev/urandom |grep '[1-4]'|sed 's/[^1234]//g'

    20170409: why increasing order ? Because 1) the boundaries are easier to spot (and are easy to decode in binary) 2) it simplifies and reduces the number of MUXes inside the decoder, the 2nd instruction by definition doesn't need to go to Globule 1

    When two consecutive instructions operate on the same destination globule, there is an obvious hazard and the second instruction must be delayed.

    The logic that evaluates this condition requires only a few gates and can run very fast.

    This first example with random data shows that not all instructions can be grouped (one half can) and the groups don't reach 3 members (in this example at least) so a first implementation can safely use only 2 instruction decoders, not 4. A wider decode unit will require more instruction set bandwidth and a better compiler, but more importantly : source code with enough available ILP.

    It's less obvious if some instructions...

    Read more »

  • Operating System and Security

    Yann Guidon / YGDES11/16/2016 at 23:26 0 comments

    Lately, I've discussed with @★ STMAN ★who is very knowldgeable about computer security and I'm starting to see old problems with a fresh, new and promising angle. Cue opening theme.

    The trivial.

    Code is not data and vice versa. FC1 is a Harvard system that does not allow self-modifying code. You can't execute data, so there is no risk from the stack, for example.

    Each globule might contain enough space to store registers for 2 or 4 different threads so context switching is pretty immediate. Upon context switching, you can choose to completely switch the context or to keep the current register set (good if you need to provide informations). An instruction provides a mask to clear/wipe any register containing unneeded information for the callee.

    The bad news.

    Forget about Linux, forget about POSIX. FC1 is aiming at speed and security, which Linux can't garantee. In turn, this frees us from the tyrany of "having to port GCC".

    My favorite approach borrows ideas from microkernels. Thank you Hurd people :-) But STMAN convinced me to go even further. Actually, the most critical code (which handles vital parts) should be as short as possible (for speed AND security audit AND non-modifiability in a small ROM). To this end, hardware must support fast and safe support for all the necessary primitives. This is implemented, for example, with specific instructions with inherent checks against user-unaccessible task properties.

    Even worse news

    You'll hate me, I know, but I swear it's for your own good ;-)

    I'll enforce an idea I have tinkered with while thinking about the YASEP: the executable code size of a thread is limited to 64KB.

    This does not apply to other things, like data sizes. But honestly, when your executable code reaches 64KB (or 16K instructions), you're definitely not doing a trivial program. The chances of having a bug is not nil. An exploit can hide.

    The F-CPU forces you to compartiment your functions and make your application modular. The modules can communicate faster than with a classic architecture so there is no excuse to not enforce safe coding practices.

    For example, I borrow other ideas from the YASEP: the InterProcess Call/Entry/Return trinity of instructions lets you call any publicly accessible code entry point safely, for the caller and the callee. With almost no penalty compared to a classicl call/return, and much less risks. Use it, though you won't be able to abuse it ;-)

    Transfering data

    An old riddle might get finally solved...

    How do you transmit data from one thread to another ? Safely yet fast ? I've been researching about this for 14 years now. STMAN offered a clue : avoid aliasing by design. It sounds weird at first but this actually solves many things, since aliasing creates more problems than it solves. Usually, aliasing lets us upload executable code or share a data segment between two or more threads. But if these two cases are solved, there is no need to maintain a risky "feature".

    Loading executable code in the instruction space can be achieved with special instructions that can only run in a restricted more. Data sharing is a more complex case but might be solved at last.

  • Overview of the FC1 (F-cpu Core #1)

    Yann Guidon / YGDES10/17/2016 at 12:05 0 comments

    More than 15 years have passed since FC0 was drafted. It's a very nice and venerable architecture but time has shown some of its practical limitations. The proposed FC1 addresses most of them, thanks to the experience gained since. The #AMBAP: A Modest Bitslice Architecture Proposal has been the most influential inspiration lately but it's just the melon on the already huge cake of my design explorations. Some things remain the same as in FC0, some things have changed and some have been radically altered.

    What remains the same

    I keep everything that makes sense and is characteristic of the project.

    • F-CPU is a 64-bits design (well, mostly) that can scale up to arbitrary widths (ARM recently jumped in the bandwagon)
    • Instructions are (almost typical) RISC and 32-bits wide, with 2 register operands and 1 destination register.
    • There are 64 registers
    • It's aimed at performance for general computation tasks and applications.

    Some details have changed

    • No need of a cleared register. Register #0 will not be hardwired to 0.
    • Instead, the Instruction Pointer and Next Instruction Pointer will be more useful.
    • A 32-bits subset will exist, to bootstrap the design and allow smaller implementations for embedded purposes.
    • The register set is split into 4 "globules", one pair for scalar operations and memory, the other set for wide SIMD operations (the SIMD set can be implemented as scalar but will trap on SIMD instructions)

    What departs from the FC0

    I simply dropped the load/store instructions altogether. Since I have solved some compiler issues with the YASEP, I can now confidently use the same techniques with F-CPU.

    Each scalar globule have half of their registers dedicated to memory access, with the same register-mapped memory principles. With 4 A/D register pairs per globule, there can be two dual-ported SRAM blocks (or cache) per globule.

    This changes everything. With 8 pairs of data/address registers, the instantaneous bandwidth is not comparable with standard CPU cores of this class (scalar in-order). The split "globule" architecture means that 4 data can be read and 2 data written. Simultaneously. With only two instructions in one clock cycle. This greatly compensates for the lower number of registers.


    This new structure solves many issues I had with the FC0.

    First, there was this huge, slow crossbar... Now it's gone and the most usual instructions save one cycle (ADD/SUB/ROP2 have no latency, though the SHL and MUL units are shared and might add some latency).

    Then there was the memory system. FC0's was complex and very architecture-dependant.

    Also that large register set with 3R2W and out-of-order-completion scheduling : gone.

    The FC1 can scale up (as FC0) but also down (32 bits), not just in data width but also in IPC : it's easy to design a decoder for 1 or 2 instructions per cycle, as well as more when the SIMD globules are implemented.

    A smaller register set (per computation unit) means that it's possible to implement a larger physical file, either for improved cross-globule latency, or for implementing "multithreading" (barrel CPU).

    I believe the FC1 will be faster and more efficient, as well as easier to design.

    There is a major difference with YASEP though. It is not practical to perform address register post-updates in the same instruction because

    1. There is not enough room for update bits in the instruction (18 bits are already used for the register addresses, 2 more for the size bits, leaving only 12 bits for opcode and fancy stuff)
    2. The globule can only compute one operation at a time, but pointer update should be computed in parallel for lower latency.

    The YASEP has special hardware to update the pointers but the F-CPU is too general for this. The solution is in the organisation and allocation of the registers, with cross-linked addresses and data:

    • Globule A holds registers A0-3 and D4-7
    • Globule B holds registers A4-7 and D0-3

    This allows pairs of instructions to execute in a single cycle, the instruction reads the D register(s) and the following one...

    Read more »

View all 6 project logs

Enjoy this project?



Julian wrote 08/25/2018 at 22:34 point

Wow.  I remember this project from back in the day. I always thought you guys were crazy with your 6 gates per pipeline stage rule. I presume you've relaxed that a little for this version? :)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/25/2018 at 22:50 point

hi Julian :-)

crazy, but not that much, when you consider today's hyperpipelines, some CPUs have 20 stages easily...

Anyway it's a strong design guide, because P&R will always find ways to break the rules.

I have also evolved and enhanced many aspects of my design rules. I'm more experienced and have come up with more and better rules.

Still, FC0 was not /that/ crazy. We were just a bit too young and poor :-P

Right now I polish all my skills on the #YGREC8 which inspires many things for FC1.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 06/02/2018 at 10:21 point

Bienvenue @hlide !

  Are you sure? yes | no

hlide wrote 06/01/2018 at 08:14 point

Ca tombe bien, j'ai un MiSTer (basé sur Terrassic DE10-nano). Je me disais où je pouvais trouver les sources de F-CPU64 pour faire joujou avec (plus dans une version 16 et 32). Un coup de pouce ?

  Are you sure? yes | no

f4hdk wrote 06/01/2018 at 17:29 point

Oh, the F-CPU is not dead!!!

Does someone know where is the source code of the F-CPU?

Has someone made a full implementation of it on a FPGA board?

@hlide : have you begun your CPU project?

  Are you sure? yes | no

hlide wrote 06/02/2018 at 09:05 point

Hi f4hdk!
My CPU project? which one? I'm mainly collecting some "ordinosaures" (SHARP MZ series) coupled with Arduinos and acquired a MiSTer (Terassic DE10-nano) to start some FPGA coding. I may try to integrate a F-CPU core on MiSTer if I could have the most recent and working source of it.

  Are you sure? yes | no

f4hdk wrote 06/02/2018 at 11:15 point

@hlide  : You had a CPU project in january 2017, when we exchanged by e-mail together. A strange CPU with 2 instruction set, and FIFOs instead of registers. You had bought 2 FPGA cards especially for your project.

@Yann Guidon / YGDES : I'm sure you still have the source code (Verilog or VHDL) of the F-CPU. Could you please share it with us? I'm interested too.

  Are you sure? yes | no

hlide wrote 06/02/2018 at 11:57 point

Oh I see. Well, I have now 4 FPGA: the first 3 FPGA are based on Xillinx and are so badly documented with no interesting examples so they are mostly parked. The last one is MiSTer (Terrassic DE10-nano) with full examples (Arcarde and Retrocomputer cores) that are more helpful. So I guess I could retry that old project but it is not the one I would prioritized now.  

  Are you sure? yes | no

Yann Guidon / YGDES wrote 06/02/2018 at 11:59 point

@f4hdk see my comment below -_-

  Are you sure? yes | no

f4hdk wrote 06/02/2018 at 14:45 point

@Yann Guidon / YGDES : This was not my question. I know that no version of F-CPU has been finished. But my question remains : could you please provide the sources of F-CPU, as is, even if it is not finished?

The F-CPU project was supposed to be open source.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 06/02/2018 at 16:36 point

@f4hdk I'll see if I can find old files... Though I have to remind you that the latest developments were 15 years ago. I forget if I have put anything relevant on during the reboot.

  Are you sure? yes | no

hlide wrote 06/02/2018 at 10:07 point

Et par F-CPU64, je pensais bien sûr au core #0, pas le core #1 qui doit être qu'en état d'étude. Le truc, c'est que j'ai perdu les sources depuis des années et que maintenant, c'est chaud pour retrouver ça - vu qu'il n'y avait pas GitHub à l'époque.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 06/02/2018 at 10:21 point

FC0 n'a jamais pu être avancé suffisamment, on avait des morceaux de code et rien qui tienne ensemble, puisque ça partait dans tous les sens...

J'espère que ça ira mieux un jour, c'est pour ça que je reprends tout depuis la base de la base :-P

  Are you sure? yes | no

llo wrote 12/13/2015 at 21:15 point

a very exciting news :)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 12/13/2015 at 21:17 point

it was only a matter of time

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates