One design goal of the F-CPU is to increase efficiency with smart architecture features, and with the FC1 I would like to get closer to OOO performance with in-order cores.
You can only go so far with all the techniques already deployed and explored before : superscalar, superpipeline, register-mapped memory, ultra-short branch latency... Modern OOO processors go beyond that and can execute tens or hundreds of instructions while still completing old ones.
FC1 can't get that far but can at least go a bit in this direction. Typically, the big reordering window is required for 1) compensate for false branches 2) wait for external memory.
1) is already "taken care of" with by the design rule of having the shortest pipeline possible. A few cycles here and there won't kill the CPU and the repeated branches have almost no cost because branch targets are flagged.
2) is a different beast : external memory can be SLOW. Like, very slow. Prefetch (like branch target buffering) helps a bit, but only so much...
So here is the fun part : why wait for the slow memory and block all the core ?
Prefetch is good and it is possible to know when data are ready, but let's go further: don't be afraid anymore to stall a pipeline... because we have 3 others that work in parallel :-)
I was wondering earlier about "microthreads", how one could execute several threads of code simultaneously, without SMT, to emulate OOO. I had seen related experimental works in the last decade(s) but they seemed too exotic. And I want to avoid the complexity of OOO.
The method I explore now is to "decouple" the globules. Imagine that each globule has a FIFO of instructions, so they could proceed while one globule is stalled. Synchronisation is kept simple with
- access to SR
- jumps (?)
- writes to other globules
The last item is the interesting one : the last log moved the inter-globule communication from the read to the write part of the pipeline. The decoder can know in advance when data must cross a globule boundary and block the recipient(s). This works more or less as implied semaphores, with a simple scoreboard (or just four 4-bits fields to be compared, one blocking register per globule).
I should sketch all that...
I vaguely remember that in the late 80s, one of the many RISC experimental contenders was a superscalar 2-issues pipeline where one pipeline would process integer operations and the other would just do memory read/writes. They could work independently under some constraints. I found it odd and I have never seen it mentioned since, so the name now escapes me...
Decoupling the globules creates a new problem and goes against one of the golden rules of F-CPU scheduling : don't "emit" an instruction that could trap in the middle of the pipeline. It creates so many problems and requires even more bookkeeping...
Invalid memory accesses could simply silently fail and raise a flag (or something). A memory barrier instruction does the trick as well (à la Alpha).
Anyway, decoupling is a whole can of worm and would appear in FC1.5 (maybe).