Close
0%
0%

YGREC32

because F-CPU, YASEP and YGREC8 are not enough and I want to get started with the FC1 but it's too ambitious yet.

Similar projects worth following
The YGREC32 is an experimental superscalar shallow-pipeline RISC microprocessor core, meant to issue up to 3 instructions per cycle thanks to the split register set. Instructions are fixed (32-bit wide) and can only be decoded by one corresponding unit, though this is not LIW.

This is a successor of the YASEP toward superscalar execution, or a scaled-down version of the F-CPU FC1, so it's smaller but borrows many of the design features without their creep.

Both data and instruction memories are decoupled from the core: each request is managed through a 3-bit "handle" (Target ID or Address register number), that can be tied to a register or equivalent. This vastly reduces complexity and latency by allowing several memory accesses to be interleaved, without requiring OOO or speculative execution.

PRELIMINARY

May 8th, 202: It only got started.

version 2024-07-03

Class for comparison : i586 (no MMX), i960/i860, MIPS R3000 or R5000, SPARC V7/8 or LEON, RISC-V, ARM Cortex-R...

  • Type : embedded safe/secure, 32-bit application processor for medium performance despite slow memory
  • 32 bits per instruction
  • 32-bit wide data words
  • 2 "globules" with 1 ALU, 16 registers and a dual-ported cache block each.
  • register mapped memory : 4 address registers and 4 data registers per glob.
  • CDI model: separate addressing spaces for Control (stack), Data and Instructions.
  • 24 bits for instruction pointers, that's 16M instructions or 64MB of code per module. 2^24 modules are possible simultaneously.
  • 2 very short parallel pipelines for data processing and a 3rd decoder for control/branch/stack instructions: ILP can go up to 3.
  • 8-entry "explicit Branch Target Buffer" with 8 more hidden/cached entries as well as 8 pairs of entries dedicated/linked to the control stack.
  • Multitasking suitable for Real Time/Industrial OS, light desktop or game console workloads.
  • Heavy computations are offloaded to suitable coprocessors.
  • Powerful tagged control stack
  • Some high-level single-cyle opcodes (and combinations thereof) provide basic control structures.
  • Resilient, safe and secure by design
  • Need 64 bits, more registers or SIMD ? Use its big brother the #F-CPU FC1 (tba)
  • Too overkill ? Use a microcontroller like the #YASEP Yet Another Small Embedded Processor (16/32 bits) or even the #YGREC8(8 bits).
  • Spoiler alert : it is not designed with Linux-friendliness in mind.

Rationale:

For now I'm only collecting and polishing the ideas. Several years ago I considered a streamlined YASEP with only 32-bit instructions but it would have broken too many things. The YASEP (either 16 or 32 bits) resides at a particular sweet spot but can't move significantly outside of it. OTOH a 32-bit mode for F-CPU would have been interesting but still too ambitious : F-CPU is a huge system so even implementing a simpler subset implies already having the whole already well figured.

So YGREC32 is not really a cleaned-up YASEP. The use of a dedicated control stack does not fit well in the YASEP which will remain a "microcontroller". The YGREC32 is an application processor for multitasking environments that will run user-supplied code, even potentially faulty. It is still suitable for real time but not heavy lifting. It could be simultaneous-multithreaded for even better efficiency. Yet YGREC32 binaries would be easily executed by FC1 with little adaptation since it's mostly a subset, with half the globules and smaller words. Upwards compatibility/translation of YASEP32 is also possible.

A redesigned, pure 32-bit processor is a clean slate where I can develop&experiment with several methods such as #A stack. It becomes the first architecture to explicitly implement and develop the #POSEVEN model. It will be a shaky ride but hopefully it will help further our goals.


Logs:
1. First sketch (and discussion)
2. Second sketch
3. eBTB v2.0
4. a first basic module.
.
.
.

......

  • a first basic module.

    Yann Guidon / YGDES07/03/2024 at 10:32 0 comments

    Here is a first attempt at writing a module that just prints "Hello world!".

    It shows the basic structure of a module as well as the IPC/IPE/IPR instructions that allow modules to call each others.

    ; HelloWorld.asm
    
    Section Trampoline
    
    ; init:
    Entry 0
      IPE
        ; nothing to initialise
      IPR
    
    ; run:
    Entry 1
      IPE
      ; no check of who's calling.
      jmp main
    
    Section Module
    
    main:
    ; get the number/ID of the module that writes
    ; on the console, by calling service "ModuleLookup"
    ; from module #1 
      set 1 R1
      set mod_str R2
      IPC ModuleLookup R1
    ; result:  module ID in R1
    
    ; write a string of characters on the console
    ; by calling the service "WriteString"
    ; in the module designated by par R1
      set mesg_str R2
      IPC WriteString R1
    
      IPR ; end of program.
    
    Section PublicConstants
      mod_str:  as8 "console"
      mesg_str: as8 "Hello world!"

    I'm a bit concerned that the IPC instruction has a rather long latency but... Prefetching would make it even more complicated, I suppose.

  • eBTB v2.0

    Yann Guidon / YGDES07/03/2024 at 01:13 15 comments

    The last log introduced a modified, explicit Branch Target Buffer. It has evolved again since then and is taking a new life of its own...

    As already explained, it totally changes the ISA, the compiler strategy, the coding practices, almost everything, but it is able to scale up in operating frequency, or with CPU/Memory mismatch, provided the external memory works with pipelined transactions. The new ISA exposes some of this and lets the compiler handle parts of the overlapping of prefetches.

    So the jump system works with two phases :

    1. The program emits a prefetch operation: this associates a jump address to a fixed ID. The visible ID is 3 bits, there can be up to 8 simultaneously ongoing prefetch operations. For each ID, there is one 8-instruction cache line that is stored close to the instruction decoder, for immediate access. It usually takes several cycles to fetch the desired line and store in in the BTB.
    2. The actual "jump" instruction selects one of these 8 lines (with the 3-bit ID in the instruction) and checks the (possibly complex) condition. If the condition is verified, the selected line is chosen as source of instructions for the next cycle, and the PC starts to prefetch the next line(s) in its double buffer.

    You can see this system as exposing a register set dedicated to instructions addresses. Each register is a target, so let's call them T0 to T7. You can only perform an overwrite of these registers with a prefetch instruction, or a "loopentry" instruction for example : there is no fancy way to derail the CPU's operation.

    Each line has one of 4 states:

    1. absent/invalid
    2. fetching/checking
    3. failure/error/fault
    4. valid

    This system mirrors the already studied mechanism for accessing data memory, but the BTB is much more constrained : Each data fetch address is explicitly computed and stored into a dedicated "Address register" (in Y32: A0 to A7), which triggers the TLB and cache checks, such that "a number of cycles later" the corresponding data is available in a coupled/associated "data" register (in Y32: D0 to D7). The #YASEP already uses this split scheme for 2 decades now.

    And now, a similar decoupling scheme is also provided in the Y32.

    So we get 8 visible registers : they must be saved and restored through all kinds of situations such as function calls, IPC or traps... yet there is no way to read them back to the general register file. The only way to get them out and back in is with the control stack, which can save 2 entries in one cycle.

    So the pair T0-T1 can be saved and restored in one cycle, for example : it seems easier to pair them as neighbours. A whole set of targets can be saved in 4 instructions, 4 cycles. It still consumes a half-line of cache...

    All the pairs of targets can be freely allocated by the compiler, with some hints and heuristics: the short-lived and immediate targets would be in the high numbers (T4-T7) while targets for later would be in the lower addresses (T0-T3). This makes it easier to save the context across the functions, as the buffer would be defragmented and there are fewer chances to save a register that doesn't need backup.

    ...

    Another thing I didn't mention enough is that the control stack (its neighbour) also has hidden entries in the eBTB. Let's say 8 more lines (or even pairs for the double-buffering) which are allocated on a rolling basis. The stack manages the allocation and prefetch itself. Usually, these hidden lines are a copy of the current instruction buffer and its double/prefetch buffer, so there is only need of prefetch when the stack is emptying.

    ...

    But the v2.0 goes beyond that: it manages the call/return and others, without the need for explicit saving/restoring by discrete instructions. These spill/unspill instructions are still possible but they take precious space (hence time).

    What call and IPC now do is accept a 2-bit immediate (on top of the 3-bit target, the condition flags, the base address...) to say how many pairs of targets to push/pop...

    Read more »

  • Second sketch

    Yann Guidon / YGDES06/26/2024 at 14:49 0 comments

    I have the drawing somewhere but it's not very different from the first one. There are two significant changes though.

    The first change is the "instruction buffer", ex-Instruction L0, now called "explicit branch target buffer" (eBTB). It's a big change that also affects the instructions so I'll cover it later in detail.

    The second change is the instruction decoder that is now meant to decode 3 instructions per cycle, instead of two. It uses an additional third slot in parallel with the 2 other pipelines (each inside one glob) to process the control stack, the jumps, the calls, the returns, and other instructions of the sort.

    I couldn't... I just couldn't resolve myself to make a choice of which pipeline had to be sacrificed or modified to handle the jumps. And the nature of the operation is so different that it makes no sense to mix the control flow logic with the computations.

    Ideally, the instructions are fed in this order : Glob1-Glob2-Stack. Any "slot" can be absent, so the realignment is going to be a bit tricky, and it increases the necessary bandwidth for the memories (FSB and caches). But the ILP can reach 3 in some cases, and is better than if I mixed the stack decoder with the Glob2 decoder, as this would create slot allocation contentions.

    So the overall effect is a reasonable increase in ILP, the glob decoders remain simple, a special dedicated datapath can be carved out for the stack's operation and the "control" slot only communicates with the globs through status bits.

    ...

    The other change, as alluded before, is with the instruction buffers that cache the L1, which are now explicitly addressed (at least at a first level). Most jumps work with a prefetch instruction that targets one of the lines, then a "switch" instruction that selects one of the lines for execution (and may stall if absent yet).

    I have allocated 3 bits for the line ID. That makes 8 lines for prefetching upcoming instructions for a branch target. More lines (4?) are also dedicated-linked to the control stack.

    Another line is the PC, which is actually implemented as a pair of line for double-buffering.

    A direct Jump instruction acts as a direct prefetch to the double buffer, but it will stall the pipelines.

    The targets for function calls, loops, switch/case, IF, and others are easy to statically schedule a few cycles in advance most of the time, leading to an efficient parallelism of all the units. Then a variety of branch instructions will conditionally select a different eBTB line ID, which gets virtually copied into Line #0.

    So the eBTB has 2+8+4 lines working as L0 with specific ties to other units. More are possible using indexed/relative addressing of the eBTB.

    This scheme mimics what is already happening with the data accesses : the core is decoupled from the memory system and communicates through registers. Instead, here, there is no register even though the lines virtually count as 8 address registers. But you can't read them back or alter them.

    • A given target line can only be set with a prefetch instruction.
    • A target line's address can be saved on the control stack (SPILL) and restored (UNSPILL), under program's control.
    • Target lines are invalidated/hidden/forbidden across IPC/IPE/IPR because the change of the module makes the address irrelevant. However the addresses remain in cache and the stack keeps some extra metadata to keep the return smooth.

    So the Y32 behaves as a 3-pipeline core, with one register set per pipeline, where the 2 short data pipelines (read-decode / ALU / writeback) can communicate with each other and memory, but the control/jump pipeline can't readback or move outside of its unit, to preserve safety/security/speed. The only way to steer execution of the control unit is:

    • with explicit instructions containing immediate data,
    • or by reading flags and status bits from the other pipelines.

    The recent reduction of the PC width (now 24 bits) changes a lot of things compared to a classical/canonical...

    Read more »

  • First sketch (and discussion)

    Yann Guidon / YGDES05/10/2024 at 02:34 0 comments

    That's about it, you have the rough layout of the YGREC32:

    Sorry for the quick&dirty look, I'll take time later to redraw it better.

    It's certainly more complex than the YASEP32 and it looks like a simpler but fatter F-CPU FC0.

    There are a number of differentiating features, some inherited from FC0 (don't fire a winning team) and some quite recent, in particular the "globule" approach : a globule is a partial register set associated with a tightly coupled ALU and a data cache block. Copy&paste and you get 32 registers that work fast in parallel, which FC0 couldn't do.

    YGREC32 has 2 globules, FC1 will have 4, but a more classic and superpipelined version can be chosen instead of the 2-way superscalar.

    Scheduling and hazards are not hard to detect and manage, much simpler than FC0, with a simpler scoreboard and no crossbar. Yay! Most instructions are single-cycle, the more complex ones are either decoupled from the globules (in a separate, multi-cycle unit), or they pair the globules.

    A superpipelined version with a unified register set would simplify some aspects and make others harder. Why choose a superscalar configuration, and not a single-issue superpipeline ? In 2-way configuration, a sort of per-pipeline FIFO can decouple execution (for a few cycles) so one globule could continue working while another is held idle by a cache fault (as long as data dependencies are respected). This is much less intrusive to implement than SMT though the advantages are moderate as well. But hey, it's a low hanging fruit so why waste it? And doing this with a single-issue pipeline would almost turn into OOO.

    Each globule has a 3R1W register set that is small, fast and coupled to the ALU. More complex operations with more than 1 write operation per cycle (such as barrel shifting, multiplies and divisions) are implemented by coupling both globules.

    2 read ports of each register set go directly to the ALU and the memory buffers (L0) that cache the respective data cache blocks. The 3rd read port is used for exchange between the twin globules, and/or accessing the stack.

    The pipeline is meant to be short so no branch predictor is needed. Meanwhile, the hardware stack also works as a branch target buffer that locks the instruction cache, so the return addresses or loopback addresses (for example) are immediately available, no guessing or training.
    I didn't draw it correctly but the tag fields of the instruction cache and data caches are directly linked to the stack unit, to lock certain lines that are expected to be accessed again soon.

    The native data format is int32. FP32 could be added later. Byte and half-word access is handled by a dedicated/duplicated I/E unit (insert-extract, not shown).

    Aliasing of addresses between the 2 globules should be resolved at the cache line level and checking every write to an address register against other address registers of the other globule. I consider adding a 3rd data cache block with one port for each globule to reduce aliasing costs, for example to handle the data stack.

    -o-O-0-O-o-

    I posted the announcement on Facebook:

    Yes I haven't finished all the detailed design of Y8 but what's left is only technical implementation, not logical/conceptual design: I can already write ASM programs for Y8!

    OTOH the real goal is to define and design a 64-bit 64-register superscalar processor in the spirit of F-CPU. But that's so ambitious...
    F-CPU development switched to a different generation called FC1 which uses some of the techniques developed in the context of Y8 (which is a totally scaled down processor barely suitable for running Snake). Between the two, a huuuge architectural gap. But I need to define many high-level aspects of FC1.
    So there, behold a bastardised FC1 with only 32 bits per registers, only 2 globules and 32 registers, but without all the bells and whistles that F-CPU has promised for more than a quarter of century!
     

    Duane Sand shared interesting comments:

    It looks...

    Read more »

View all 4 project logs

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates