The register-mapped memory system, pioneered by Seymour Cray in the CDC6600, has many benefits and I developed it in my own way in the #YASEP
I used a different approach however, where the data register both reads and writes. The CDC had data 5 registers for reading and 2 for writing, with different code sequences.
For the YASEP, my approach makes perfect sense because the core uses onchip memory with single-cycle latency. You set the address register and by the next cycle, the data is read, whether you need it or not. Same for the #YGREC where the access time is "relatively immediate".
The F-CPU is a different beast though. We expect multi-level memory and going off-chip is a definitive performance killer. There must be a way to determine if a memory access is for reading or writing.
The CDC6600 version, with registers that are dedicated to functions, is not possible : there are already 16 registers, 1/4 of the total, and there is no room left, either in the register set or the opcodes (each bit is precious !)
The semantics of the address/data registers is to access the memory contents directly, through a buffer like a cache line, for example. Loading the cache line is the costly part.
How do I tell the CPU to load the cache line ? Simply by using/reading the data register. But doing this will stall the core if the cache line is not ready... This totally defeats the purpose of the system, where the memory's latency is hidden by explicit scheduling of the instructions: you calculate the address, schedule some instructions while waiting for the data to arrive, then read the register.
The costly part is the reading and it's too unpredictible : the data could be in the cache, in the local DRAM, on another CPU, or swapped out... So let's think in reverse and consider the write operation : the sequence goes like
- write the destination address to the address register
- write the data to the data register
The point of this log is to make a sort of atomic pair that the decoder can easily detect : the destination register must be the same (modulo one inverted bit) in two consecutive instructions (or closely enough, if the core can afford the comparators) to indicate that it's a write and the memory system shouldn't fetch the line right now.
There are a couple of things to notice :
- the address register of one port must be located in the "mirror" globule of the data register so the two instructions (set address, write data) can be "emitted" at the same time (and only one comparator is required, no comparison across pairs and simpler decoder)
- the SIMD globules need their own data ports but the address must remain in the scalar globules. I should change the register map... but now each of the 4 globules has their own dedicated memory block, with a matching data size !
And now, due to the pairing rules, it seems obvious that all the address registers must reside in the first globule, or else the pairing rules must be more complex, but this creates an imbalance in the register allocation...
Honestly I dislike the new asymmetrical design because it makes the architecture less orthogonal and potentially less efficient to code for. Architecturally, you only had to design one globule (or two if you want SIMD) and there you go, copy-paste-mirror-connect and you're done.
But let's look at the new partition :
- First globule is specialised for addresses, has A0-A7, and its dedicated memory block accessed with D0 & D1. Since you write all the addresses there, the TLB is directly connected to its ALU's output. Aliasing is directly managed there. The 6 remaining registers can be used to store some pointers, frame base, indices and increments, as well as supplement the standard computations.
- Second globule has only 2 data registers, to access the dedicated cache. 14 working registers can do some meaningful work. No TLB here, it can be "replaced" by a barrel shifter/multiplier array/division unit
- The 3rd globule has 2 data registers, accessing the wider (but shallower) cache block, and does the heavy lifting (along with the mirored 4th globule)
One big inconvenience is the single TLB that limits the throughput to only one load or store per cycle when we can emit 2 instructions and have 8 ports...