Details

PRELIMINARY

May 8th, 202: It only got started.

Class for comparison : i486, i586 (no MMX), i960/i860, MIPS R3000 or R5000, SPARC V7/8 or LEON, RISC-V, ARM Cortex-R...

Type : embedded safe/secure, 32-bit application processor for medium performance
32 bits per instruction
32-bit wide words
2 "globules" with 1 ALU 16 registers and a cache block each.
register mapped memory
CDI model: separate addressing spaces for Control (stack), Data and Instructions.
28 bits for instruction pointers, 2^27 instructions/512MiB of code per module.
Flexibility: Implemented as 1 long/fast pipeline or 2 shorter/slower parallel pipelines.
Multitasking suitable for Real Time/Industrial OS, light desktop or game console workloads.
Heavy computations are offloaded to suitable coprocessors.
Tagged control stack
Some high-level single-cyle opcodes provide basic control structures.
Resilient, safe and secure by design
Need 64 bits or SIMD ? Use its big brother the #F-CPU FC1 (tba)
Too overkill ? Use a microcontroller like the #YASEP Yet Another Small Embedded Processor (16/32 bits) or even the #YGREC8(8 bits).
Spoiler alert : it is not designed with Linux-friendliness in mind.

Rationale:

For now I'm only collecting the ideas. Several years ago I considered a streamlined YASEP with only 32-bit instructions but it would have broken too many things. The YASEP (either 16 or 32 bits) resides at a particular sweet spot but can't move significantly outside of it. OTOH a 32-bit mode for F-CPU would have been interesting but still too ambitious : F-CPU is a huge system so even implementing a simpler subset implies already having the whole already well figured.

So YGREC32 is not really a cleaned-up YASEP. The use of a dedicated control stack does not fit well in the YASEP which will remain a "microcontroller". The YGREC32 is an application processor for multitasking environments that will run user-supplied code. It is still suitable for real time but not heavy lifting. It could be simultaneous-multithreaded for even better efficiency. Yet YGREC32 binaries would be easily executed by FC1 with little adaptation since it's mostly a subset, with half the globules and smaller words.

A redesigned, pure 32-bit processor is a clean slate where I can develop&experiment with several methods such as #A stack. It becomes the first architecture to explicitly implement and develop the #POSEVEN model. It will be a shaky ride but hopefully it will help further our goals.

Logs:
1. First sketch (and discussion)
2. .
3. .
4.

......

Project Logs

Collapse

First sketch (and discussion)
Yann Guidon / YGDES • 05/10/2024 at 02:34 • 0 comments

That's about it, you have the rough layout of the YGREC32:

Sorry for the quick&dirty look, I'll take time later to redraw it better.

It's certainly more complex than the YASEP32 and it looks like a simpler but fatter F-CPU FC0.

There are a number of differentiating features, some inherited from FC0 (don't fire a winning team) and some quite recent, in particular the "globule" approach : a globule is a partial register set associated with a tightly coupled ALU and a data cache block. Copy&paste and you get 32 registers that work fast in parallel, which FC0 couldn't do.

YGREC32 has 2 globules, FC1 will have 4, but a more classic and superpipelined version can be chosen instead of the 2-way superscalar.

Scheduling and hazards are not hard to detect and manage, much simpler than FC0, with a simpler scoreboard and no crossbar. Yay! Most instructions are single-cycle, the more complex ones are either decoupled from the globules (in a separate, multi-cycle unit), or they pair the globules.

A superpipelined version with a unified register set would simplify some aspects and make others harder. Why choose a superscalar configuration, and not a single-issue superpipeline ? In 2-way configuration, a sort of per-pipeline FIFO can decouple execution (for a few cycles) so one globule could continue working while another is held idle by a cache fault (as long as data dependencies are respected). This is much less intrusive to implement than SMT though the advantages are moderate as well. But hey, it's a low hanging fruit so why waste it? And doing this with a single-issue pipeline would almost turn into OOO.

Each globule has a 3R1W register set that is small, fast and coupled to the ALU. More complex operations with more than 1 write operation per cycle (such as barrel shifting, multiplies and divisions) are implemented by coupling both globules.

2 read ports of each register set go directly to the ALU and the memory buffers (L0) that cache the respective data cache blocks. The 3rd read port is used for exchange between the twin globules, and/or accessing the stack.

The pipeline is meant to be short so no branch predictor is needed. Meanwhile, the hardware stack also works as a branch target buffer that locks the instruction cache, so the return addresses or loopback addresses (for example) are immediately available, no guessing or training.
I didn't draw it correctly but the tag fields of the instruction cache and data caches are directly linked to the stack unit, to lock certain lines that are expected to be accessed again soon.

The native data format is int32. FP32 could be added later. Byte and half-word access is handled by a dedicated/duplicated I/E unit (insert-extract, not shown).
Aliasing of addresses between the 2 globules should be resolved at the cache line level and checking every write to an address register against other address registers of the other globule. I consider adding a 3rd data cache block with one port for each globule to reduce aliasing costs, for example to handle the data stack.
-o-O-0-O-o-
I posted the announcement on Facebook:

Yes I haven't finished all the detailed design of Y8 but what's left is only technical implementation, not logical/conceptual design: I can already write ASM programs for Y8!

OTOH the real goal is to define and design a 64-bit 64-register superscalar processor in the spirit of F-CPU. But that's so ambitious...
F-CPU development switched to a different generation called FC1 which uses some of the techniques developed in the context of Y8 (which is a totally scaled down processor barely suitable for running Snake). Between the two, a huuuge architectural gap. But I need to define many high-level aspects of FC1.
So there, behold a bastardised FC1 with only 32 bits per registers, only 2 globules and 32 registers, but without all the bells and whistles that F-CPU has promised for more than a quarter of century!

Duane Sand shared interesting comments:

It looks...
Read more »

View project log

Discussions

YGREC32

Description

Details

PRELIMINARY

Rationale:

Project Logs

Collapse

First sketch (and discussion)

Discussions

Similar Projects

TMS9900 compatible CPU core in VHDL

F-CPU

AltairX

PDP - Processor Design Principles

YGREC32

Become a Hackaday.io member

Just one more thing

Description

Details

PRELIMINARY

Rationale:

Project Logs Collapse

First sketch (and discussion)

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

TMS9900 compatible CPU core in VHDL

F-CPU

AltairX

PDP - Processor Design Principles

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse