F-CPU as a decent vector processor

Re-inventing-the-wheel-warning ! But in the middle of the decades-old tricks, some new ones could prove fruitful.

To celebrate the 22nd anniversary of the project, I bring a new life and perspective to vector processing, which fully exploits the superscalar architecture that has evolved these last years.

To be fair, parts of these considerations are inspired by another similar project : https://libre-soc.org/ is currently trying to tape-out a RISC processor capable of GPU functions, with a CDC-inspired OOO core that executes POWER instructions. Not only that, the project is also trying to add vector computation to the POWER ISA, and this is now completely weird. See https://libre-soc.org/openpower/sv/vector_ops/

My personal opinion about POWER may bias my judgement but here it is : despite the insane amount of engineering that has been invested in it, it's overly complex and I still can't wrap my head around it, even 25 years after getting a book about it.

However some of the discussions have tickled me.

There is one architectural parameter that defines the capacity and performance of vector computers : the number and length of the vector registers. Some years ago, I evaluated a F-CPU coprocessor that contains a large number of scalar registers (probably 256 or 1024) that could then be processed, eventually in parallel if "suitable hardware" is designed, and for now, Libre-SoC considers 128, eventually 256 scalar registers that can be processed in a vector-like way.

But this number is a hard limit, it defines and cripples the architecture, and as we have seen in the scientific HPC industry, the practical vector sizes have grown and completely exceeded the 8×64 numbers (4Ki bytes) of the original Cray-1. For example the NEC SX-6 (used for the Earth Simulator) uses a collection of single-chip supercomputer with 72 registers of 256 numbers (147456 bytes) and that was 20 years ago. That is way beyond the 1K bytes considered by Libre-SoC which will barely allow to mask main memory's latency. Furthermore, because of the increased number of ports, the vector register set will be less dense and will draw more power than standard cache SRAM for example.

Clearly, setting a hard limit for the vector size and capacity is a sure way to create problems later. Scalability is critical and some implementation will favour a smaller and cheaper implementation that makes compromises for performance, while other will want to push all the buttons in the quest for performance.

And you know what is scalable and configured at will by designers ? Cache SRAM. It totally makes sense to use the Data L1 cache (and other levels) to emulate vectors. User code that relies on cache is totally portable (and if adaptive code is not used, the worst that can happen is thrashing the cache) and memory lines are already pretty wide (256 or 512 bits at once) which opens the floodgates for wide SIMD processing as well (you can consider 4 or 8 parallel computational pipelines already). This would consume less power, be denser and more scalable than using dedicated registers. In fact, one of the Patterson & Hennessy books writes:

In 1986, IBM introduced the System/370 vector architecture and its first
implementation in the 3090 Vector Facility. The architecture extends the
System/370 architecture with 171 vector instructions. The 3090/VF is
integrated into the 3090 CPU. Unlike most other vector machines, the
3090/VF routes its vectors through the cache.

That is not exactly what I have in mind but it shows that the idea has been floating around for such a long time that the first patents have long expired.

Cache SRAM have enough burst bandwidth to emulate a huge vector register but this is far from being enough to make a half-decent vector processor. The type of CPU/GPU hybrid I consider is rather used for accelerating graphics primitives, not for large matrix maths (which is the golden standard for HPC, and I don't care about LINPACK ratings) so I'm aiming at "massive SIMD" performance, knowing that scatter/gather access is a PITA. But graphics primitives are not just single primitive operations on a long string of numbers: there can be several streams of data that get combined by multiple execution units. DSP such as FFT and DCT require tens of operations to create tens of results from tens of operands. There is a significant potential for heterogeneous parallelism, and a single cache block is obviously underwhelming. This is where FC1's structure changes the rules.

For those who have not followed the development of FC1, here's a summary :

FC1 is best implemented as a 4-way superscalar processor (though a narrower one with 1 and 2 ways is easy to devise). Each instruction is 32-bits wide and is targeted at one of the 4 independent pipelines. The program stream should ensure that instructions for the 4 pipelines are packed following a few simple rules for optimal pipeline use. Each pipeline has their private computation units, register file and cache memory, to keep latency as low as possible. For communication, one pipeline can write a result to another pipeline, with a time penalty.

Each pipeline has 16 registers only. Just like the #YASEP and the #YGREC8, it uses register-mapped memory:

R0 to R7 are "normal" registers
A0 to A3 hold addresses
D0 to D3 are "windows" to the memory pointed by the respective address register (they can be thought as a port to the L0 memory buffers)

Each pipeline can be backed by its own TLB mirror and cache memory block (4 or 8-way). This allows FC1 to read 8 operands and write 4 64-bits results in a single cycle. For a while I wondered if this was balanced or overkill but the vector extension makes it "just right", I think.

The proposed vector extension reuses some of these mechanisms but doesn't use the whole scalar register file, which can still work in parallel for housekeeping duties. The instruction format is the same (the SIMD bit is now a vector bit) and the decoding and packing rules are mostly the same but the vector-flagged instructions operate on another subset of the system.

D0-D3 operands refer to vectors in cache, either as a operand or destination. You could "vadd D0 D1 D2" and that's all.
A0 to A3 don't make sense as such but that could be used for scatter/gather.
R0 to R7 could be scalar operands, or a "port" to another pipeline.

The point of the proposed architecture is its ability to write sequences of instructions that describe a dataflow/route of the multiple parallel streams of vector data, that is more powerful than simple chaining. Each pipeline can contain more than one processing unit (integer ALUs, integer Multiply-Accumulate, FPadd, FPmul, FPrecip/trans...) and the sequence of instructions can virtually wire them to form a more complex and useful operation. Multiple operations could even be "in flight" in the same "pipeline" (cluster), and there are 4 of them that can send their result to the others.

A simple scoreboard can help the decoder stall when a register is still processing a vector, for example. But non-vector operations can still be executed because R0 to R7 are not used or affected by vector operations. A and D register are updated though, and special circuits are required for auto-incrementing the addresses, but this was intended since the very beginning, no biggie here. So if the vector operations use A1 and A2, the scalar core can still work with A0 and A3.

With the system as I imagine it, it is possible to execute simultaneous operations on identical operands, such as

R1 <= D0 + D1, R2 <= D0 - D1

This will take 2 consecutive instructions, yet D0 and D1 will be only read once, and both operations are executed during the same cycle (if the execution units are available). You can even chain another instruction before sending the result to another pipeline or to memory through D3 or D4. In this example, R1 and R2 are not real registers but symbolic names for bus/port numbers, for example.

I'm only starting to explore this route but I'm already happy that it promises a lot of operations per cycles (and potentially high operation unit utilisation) with little control logic. No need for OOO. Of couse, a lot of refinements will be required, in particular for scatter/gather and inter-lane communication. But we have to start somewhere and the basic "scalar" FC1 is barely changed, the vector extension is easy to add or remove at will.

Please show your interest if illustrations are necessary ;-)

Discussions

Thomas wrote 12/29/2020 at 08:05

Thanks for sharing your thoughts and ideas - it's inspiring to get insight into a creative process!

Are you sure? yes | no

alice crush wrote 12/24/2020 at 07:28

Sounds good to me, a lay person / coder, I would like to see some samples, examples, sure diagrams.

Are you sure? yes | no

Yann Guidon / YGDES wrote 12/24/2020 at 07:35

Then I'll try to formalise all this in the coming days :-)

Are you sure? yes | no

Yann Guidon / YGDES wrote 12/24/2020 at 07:41

Meanwhile, you might have a look at the original F-CPU manual.

http://archives.f-cpu.org/manual-20021116/

Many features have changed/evolved but the spirit remains : make a decent application processor with a fresh RISC architecture, avoid complex out-of-order circuits but instead redesign the instruction set around the problems that OOO tries to solve in HW.

In comparison, RISC-V is a very polished architecture born 35 years ago, but nothing really new has appeared...

Are you sure? yes | no

Yann Guidon / YGDES wrote 12/26/2020 at 06:00

The latest log should bring a new perspective.

You can consider this core as 4 sub-cores sharing an instruction stream :-)

Are you sure? yes | no

Celebration

FC1 : the memo

Discussions

Become a Hackaday.io Member