12/26/2020 at 00:46 •
Here is a summary of the design so far.
FC1 or F-CPU core #1 is the successor of FC0 which was designed more than 20 years ago. You can have a look at the original F-CPU manual for an overview of the original concept and history. FC1 is a more mature version that drops the ideas that failed and introduces new ones, the FC1 instruction set is inspired but incompatible with FC0.
So many features have changed/evolved but the founding spirit remains: make a decent application processor with a fresh RISC architecture, avoid complex out-of-order circuits and instead redesign the instruction set around the problems that OOO tries to solve in HW.
FC1 is a 4-ways superscalar processor from the ground up. FC0 would require re-engineering to go superscalar and instead counted on its superpipeline (very short pipeline stages, or "the carpaccio approach") to reach high speed and throughput. The cost was more complexity, longer pipeline stages and maybe lower single-thread performance (reminiscent of the "plague of the P4"). The Low FO4 can quickly hit a logic wall and the intended granularity might have been overly optimistic.
Instead the FC1 is designed as superscalar with very fewer pipeline stages, which is easier to convert to 2-ways or 1-way issue, than the reverse. Code that is correctly compiled and scheduled will run equally well on the 3 possible implementations, though 4-ways is the most natural choice. 2 and 1 way would be interesting for gates-limited versions, such as FPGA.
Just like FC0, FC1 is an in-order processor that uses a scoreboard to stall the instructions at the decode stage if hazards are detected. To some, this is ugly and unthinkable in 2020 but the "lean philosophy" attempts to avoid feature creep that will add a considerable burden later.
Instead, the instruction set and architecture are designed to reduce the effects and causes of decode stalls. Precise exceptions, mostly from memory reference faults, are possible by splitting the classic "LOAD" or "STORE" instructions in 2:
- compute the address, TLB lookup and tag the corresponding address register
- access the data and use it as operand for another operation
The access instructions are the one to trap, once the address is known, but only if the address is referenced. This is possible by flagging the corresponding register as "invalid access" for example. This also enables prefetch, to shadow some of the latency from memory.
By the way, FC1 uses explicitly dedicated Address registers and Data registers. This reduces the complexity and overhead caused by FC0's more flexible and general approach, since now only 16 register addresses have to be flagged "invalid/ready" instead of 63.
Just like FC0, FC1 uses 64 registers though as explained above, the register set is not homogeneous, but split into 3 main functions. Just like the #YASEP and the #YGREC8, FC1 uses register-mapped memory:
- 32 "normal" registers (R0 to R31, and R0 is not hardwired to 0)
- A0 to A15 hold data addresses
- D0 to D15 are "windows" to the memory pointed by the respective address register (they can be thought as a port to the L0 memory buffers)
This is a LOT of ports to memory and the question of the relevance is legitimate (particularly since it creates a LOT of aliasing problems) but we'll see later that it also creates interesting opportunities.
If Data/Address pairs can be paired, that makes 8 blocks of dual-port L1 cache memory, a particularly high bandwidth is expected and it should be matched with eventual L2 cache and main memory bandwidth, but this is something that is not directly inside the scope of the design. Let's just say that it's less constrained than most existing designs.
An even more radical aspect of FC1 is that the pipelines are "loosely coupled", and in fact quite decoupled. Each of the 4 pipelines has its own 2R1W register set with 16 addressable registers (8×R, 4×D, 4×A) to keep speed as high as possible. Gone is the humongous register set with 64 registers, 2 write ports and 3 read ports that was a major timing problem. Selecting only 2 operands among 16 is faster and smaller.
Each pipeline has dedicated registers and the only way to communicate between pipelines is to write the result of an operation to another target pipeline. This of course creates a hazard (and one stall cycle) but it keeps the decoder & Xbar complexity low.
The diagram above also shows that each "Globule", or pipeline+cache block, has 1 output port (wrongly tied to the source register selector) and selects input data from one of four sources : either the shared execution unit(s) or one of the 3 other globules. Two shared TLB manage the aliases, check the data that go to Address registers, while keeping cache eviction reasonable (if you manage your pointer well).
Look ma', no OOO !
The "loosely coupled" approach helps when dealing with code that would benefit from OOO but can be detected at compile time, with "sub-threads" that can be allocated to given pipelines to complete a sequence while the other pipeline(s) start a new sequence. A virtual FIFO (through the L0 instruction cache's multiple instruction pointers) lets a pipeline, or two, or three, stall during L1 cache misses, while the remaining pipeline(s) still proceed in the program's logic. While loads are big headaches (and can be managed through the A registers by early address computations), stores don't slow the program logic as long as no aliasing occurs.
Another breakthrough that is possible with a split register set is that all the pesky instructions that need more than 3 register addresses and don't fit in the clean 2R1W scenario are now handled by "paired instructions".
A pair of pipelines can now handle addition with carry, full-precision multiplies, or long-shifting with almost no effort. The same instruction is duplicated BUT the 2nd instruction specifies a destination in another pipeline (which will stall to accept the new result). This is both trap-safe (the pair of instruction can be split into 2 and be functionally equivalent despite the break of the pair at decode time) and a good use of the available resources.
The example below shows such a case of paired instructions:
; here is a pair of instructions that will be decoded and executed in parallel IMULL R01, R2, R03 ; feed back the result in the pipeline IMULH R01, R2, R13 ; send extra result in another pipeline
Note that the register names are ... in octal. This is not to emulate Cray's philosophy but to ease coding: the first digit will encode the pipeline's number.
Another note on instruction encoding: both operands of an instruction are located in the same pipeline. The result can be sent to another pipeline though. As a result, one needs to encore 6+6+4=16 bits, 2 bits less than FC0 due to the explicit partitioning of the register sets. Decoding is also greatly simplified/smaller.
For paired instructions, the encoding is even smaller because some bits are implicit. The result can't be sent to another globule. A single instruction could also be used, that will be expanded by the decoder, sending the result to the opposite globule at an implicit address. The extra 3 bits can encode more options for the opcode.
12/22/2020 at 03:15 •
Re-inventing-the-wheel-warning ! But in the middle of the decades-old tricks, some new ones could prove fruitful.
To celebrate the 22nd anniversary of the project, I bring a new life and perspective to vector processing, which fully exploits the superscalar architecture that has evolved these last years.
To be fair, parts of these considerations are inspired by another similar project : https://libre-soc.org/ is currently trying to tape-out a RISC processor capable of GPU functions, with a CDC-inspired OOO core that executes POWER instructions. Not only that, the project is also trying to add vector computation to the POWER ISA, and this is now completely weird. See https://libre-soc.org/openpower/sv/vector_ops/
My personal opinion about POWER may bias my judgement but here it is : despite the insane amount of engineering that has been invested in it, it's overly complex and I still can't wrap my head around it, even 25 years after getting a book about it.
However some of the discussions have tickled me.
There is one architectural parameter that defines the capacity and performance of vector computers : the number and length of the vector registers. Some years ago, I evaluated a F-CPU coprocessor that contains a large number of scalar registers (probably 256 or 1024) that could then be processed, eventually in parallel if "suitable hardware" is designed, and for now, Libre-SoC considers 128, eventually 256 scalar registers that can be processed in a vector-like way.
But this number is a hard limit, it defines and cripples the architecture, and as we have seen in the scientific HPC industry, the practical vector sizes have grown and completely exceeded the 8×64 numbers (4Ki bytes) of the original Cray-1. For example the NEC SX-6 (used for the Earth Simulator) uses a collection of single-chip supercomputer with 72 registers of 256 numbers (147456 bytes) and that was 20 years ago. That is way beyond the 1K bytes considered by Libre-SoC which will barely allow to mask main memory's latency. Furthermore, because of the increased number of ports, the vector register set will be less dense and will draw more power than standard cache SRAM for example.
Clearly, setting a hard limit for the vector size and capacity is a sure way to create problems later. Scalability is critical and some implementation will favour a smaller and cheaper implementation that makes compromises for performance, while other will want to push all the buttons in the quest for performance.
And you know what is scalable and configured at will by designers ? Cache SRAM. It totally makes sense to use the Data L1 cache (and other levels) to emulate vectors. User code that relies on cache is totally portable (and if adaptive code is not used, the worst that can happen is thrashing the cache) and memory lines are already pretty wide (256 or 512 bits at once) which opens the floodgates for wide SIMD processing as well (you can consider 4 or 8 parallel computational pipelines already). This would consume less power, be denser and more scalable than using dedicated registers. In fact, one of the Patterson & Hennessy books writes:
In 1986, IBM introduced the System/370 vector architecture and its first implementation in the 3090 Vector Facility. The architecture extends the System/370 architecture with 171 vector instructions. The 3090/VF is integrated into the 3090 CPU. Unlike most other vector machines, the 3090/VF routes its vectors through the cache.
That is not exactly what I have in mind but it shows that the idea has been floating around for such a long time that the first patents have long expired.
Cache SRAM have enough burst bandwidth to emulate a huge vector register but this is far from being enough to make a half-decent vector processor. The type of CPU/GPU hybrid I consider is rather used for accelerating graphics primitives, not for large matrix maths (which is the golden standard for HPC, and I don't care about LINPACK ratings) so I'm aiming at "massive SIMD" performance, knowing that scatter/gather access is a PITA. But graphics primitives are not just single primitive operations on a long string of numbers: there can be several streams of data that get combined by multiple execution units. DSP such as FFT and DCT require tens of operations to create tens of results from tens of operands. There is a significant potential for heterogeneous parallelism, and a single cache block is obviously underwhelming. This is where FC1's structure changes the rules.
For those who have not followed the development of FC1, here's a summary :
FC1 is best implemented as a 4-way superscalar processor (though a narrower one with 1 and 2 ways is easy to devise). Each instruction is 32-bits wide and is targeted at one of the 4 independent pipelines. The program stream should ensure that instructions for the 4 pipelines are packed following a few simple rules for optimal pipeline use. Each pipeline has their private computation units, register file and cache memory, to keep latency as low as possible. For communication, one pipeline can write a result to another pipeline, with a time penalty.
- R0 to R7 are "normal" registers
- A0 to A3 hold addresses
- D0 to D3 are "windows" to the memory pointed by the respective address register (they can be thought as a port to the L0 memory buffers)
Each pipeline can be backed by its own TLB mirror and cache memory block (4 or 8-way). This allows FC1 to read 8 operands and write 4 64-bits results in a single cycle. For a while I wondered if this was balanced or overkill but the vector extension makes it "just right", I think.
The proposed vector extension reuses some of these mechanisms but doesn't use the whole scalar register file, which can still work in parallel for housekeeping duties. The instruction format is the same (the SIMD bit is now a vector bit) and the decoding and packing rules are mostly the same but the vector-flagged instructions operate on another subset of the system.
- D0-D3 operands refer to vectors in cache, either as a operand or destination. You could "vadd D0 D1 D2" and that's all.
- A0 to A3 don't make sense as such but that could be used for scatter/gather.
- R0 to R7 could be scalar operands, or a "port" to another pipeline.
The point of the proposed architecture is its ability to write sequences of instructions that describe a dataflow/route of the multiple parallel streams of vector data, that is more powerful than simple chaining. Each pipeline can contain more than one processing unit (integer ALUs, integer Multiply-Accumulate, FPadd, FPmul, FPrecip/trans...) and the sequence of instructions can virtually wire them to form a more complex and useful operation. Multiple operations could even be "in flight" in the same "pipeline" (cluster), and there are 4 of them that can send their result to the others.
A simple scoreboard can help the decoder stall when a register is still processing a vector, for example. But non-vector operations can still be executed because R0 to R7 are not used or affected by vector operations. A and D register are updated though, and special circuits are required for auto-incrementing the addresses, but this was intended since the very beginning, no biggie here. So if the vector operations use A1 and A2, the scalar core can still work with A0 and A3.
With the system as I imagine it, it is possible to execute simultaneous operations on identical operands, such as
R1 <= D0 + D1, R2 <= D0 - D1
This will take 2 consecutive instructions, yet D0 and D1 will be only read once, and both operations are executed during the same cycle (if the execution units are available). You can even chain another instruction before sending the result to another pipeline or to memory through D3 or D4. In this example, R1 and R2 are not real registers but symbolic names for bus/port numbers, for example.
I'm only starting to explore this route but I'm already happy that it promises a lot of operations per cycles (and potentially high operation unit utilisation) with little control logic. No need for OOO. Of couse, a lot of refinements will be required, in particular for scatter/gather and inter-lane communication. But we have to start somewhere and the basic "scalar" FC1 is barely changed, the vector extension is easy to add or remove at will.
Please show your interest if illustrations are necessary ;-)
12/22/2018 at 20:23 •
20 years ago, I shyly joined a crazy pipedream vaporware project, which I found on Slashdot. Since then, it's been a constant adventure.
I have learned a LOT. I have met a LOT of people. I have gotten a LOT of knowledge and understanding and this project has shaped a lot of my day-to-day life, you couldn't imagine. And it's not going to end. My presence on HAD is fueled by the need to innovate, create, build, design, and break ALL the ground that was needed for FC0. We didn't have the right people, the right tools, the right structure. I do all my best to at least create the environment that will make F-CPU a reality, even if I have to scale things down to the extreme (see #YGREC8 ) before going back to the full-scale SIMD.
There is no way I could build everything alone and no way I could tapeout FC1 either alone or with the original structure. I have grown a bit wiser and I try to surround myself with the BEST people. I have found many here on HAD (if you read this, you might be one of them).
The motto will always remain :
« There can be no Free Software without Free HardWare Designs »
and such a chip is long overdue.
There is so much to do and there is certainly something you can do.
12/01/2018 at 12:27 •
As already stated, F-CPU enforces a strong separation between the data and instruction memory spaces. This is a central requirement because this processor will run user-provided (or worse: 3rd party) code and security must not be an afterthought.
Another security measure is to ensure that each data block can be accessed by only one thread. Zero-copy information sharing happens when one thread "yields" a block to another thread, as a payload of a message for example. At that point, the original owner loses ownership and rights to that block, that must be previously allocated from a dynamic pool. This ensures (ideally) that no information is leaked (as long as the allocator clears the blocks after use).
There are also "hard" semaphores in the Special Registers space, with a scratch area where appropriate read and write rights allow deadlock-free operation.
But there are still cases where this is not enough and zero-copy operation is impossible or too cumbersome. In a few cases, the overhead of yielding a block is too high and we would like one thread to be able to write to a buffer while other threads can read it. Without this feature, the system is forced to clump data into a message packet, which adds latency. This is multiplied by the number of interested recipients but the added cost would not be justified if this specific update is not required at the moment. It would be faster and lighter to read some word here and there from time to time.
This desired "multicast" feature is reminiscent of a system that must be somehow implemented : ONE thread is allowed to write to the instruction space, to load programs, and this space may be shared by several threads. However, all the other threads can't access the program space, and shared constants (read-only values) must exist in the data addressing space, a multicast system is required anyway.
12/01/2018 at 10:47 •
When looking at (almost?) all the 64 bits CPUs out there, you see that the MSB of addresses are not really used. 48 bits give you 256 terabytes, which is about the RAM installed into a HPC rack these days. But so much space can't be really used conveniently because access times prevent efficient access across a rack or a datacenter. Message passing often takes over when communication is required across compute nodes or blades.
So we're left with these dangling bits... The addresses are usually "virtual" and require translation to physical space. But you know I hate wasted resources, right ? So I want to use the MSB as an "address space ID" and store some meta-information such as process number. It's easy to check if this address is valid so it adds some sort of protection while allowing the whole addressable space to be linearised.
Of course there is the question of "how many bits should be allocated" because there is always a case that will bump into fixed limitations. The solution is to make a flexible system with not just an ID to compare, but also a bitmask (this is used nowadays for comparing addresses in chipsets, for example). The OS will allocate as many bits as required by the workload, either 56 address bits for 256 tasks, or 40 bits for 16M tasks, or any desired mix (think : unbalanced binary tree)...
11/30/2018 at 14:37 •
One design goal of the F-CPU is to increase efficiency with smart architecture features, and with the FC1 I would like to get closer to OOO performance with in-order cores.
You can only go so far with all the techniques already deployed and explored before : superscalar, superpipeline, register-mapped memory, ultra-short branch latency... Modern OOO processors go beyond that and can execute tens or hundreds of instructions while still completing old ones.
FC1 can't get that far but can at least go a bit in this direction. Typically, the big reordering window is required for 1) compensate for false branches 2) wait for external memory.
1) is already "taken care of" with by the design rule of having the shortest pipeline possible. A few cycles here and there won't kill the CPU and the repeated branches have almost no cost because branch targets are flagged.
2) is a different beast : external memory can be SLOW. Like, very slow. Prefetch (like branch target buffering) helps a bit, but only so much...
So here is the fun part : why wait for the slow memory and block all the core ?
Prefetch is good and it is possible to know when data are ready, but let's go further: don't be afraid anymore to stall a pipeline... because we have 3 others that work in parallel :-)
I was wondering earlier about "microthreads", how one could execute several threads of code simultaneously, without SMT, to emulate OOO. I had seen related experimental works in the last decade(s) but they seemed too exotic. And I want to avoid the complexity of OOO.
The method I explore now is to "decouple" the globules. Imagine that each globule has a FIFO of instructions, so they could proceed while one globule is stalled. Synchronisation is kept simple with
- access to SR
- jumps (?)
- writes to other globules
The last item is the interesting one : the last log moved the inter-globule communication from the read to the write part of the pipeline. The decoder can know in advance when data must cross a globule boundary and block the recipient(s). This works more or less as implied semaphores, with a simple scoreboard (or just four 4-bits fields to be compared, one blocking register per globule).
I should sketch all that...
I vaguely remember that in the late 80s, one of the many RISC experimental contenders was a superscalar 2-issues pipeline where one pipeline would process integer operations and the other would just do memory read/writes. They could work independently under some constraints. I found it odd and I have never seen it mentioned since, so the name now escapes me...
Decoupling the globules creates a new problem and goes against one of the golden rules of F-CPU scheduling : don't "emit" an instruction that could trap in the middle of the pipeline. It creates so many problems and requires even more bookkeeping...
Invalid memory accesses could simply silently fail and raise a flag (or something). A memory barrier instruction does the trick as well (à la Alpha).
Anyway, decoupling is a whole can of worm and would appear in FC1.5 (maybe).
11/22/2018 at 01:42 •
So far, the 2R1W opcodes have 1 full destination field (6 bits) that determines the globule number, one partial source register address within the same globule, and one full register address that can read from any other globule. It's quite efficient on paper but the corresponding circuit is not as straight-forward as I'd like.
Decoding the full source field creates significant delays because all the source globules must be checked for conflict, the 4 addresses must be routed/crossbarred, oh and the globules need 3-read register sets.
Inter-globule communication will be pretty intense and will determine the overall performance/efficiency of the architecture... it is critical to get it right. And then, I remember one of the lessons learned with the #YASEP Yet Another Small Embedded Processor : You can play with the destination address because you have time to process it while the value is being decoded, fetched and computed.
OTOH the source registers must be directly read. Any time spent tweaking it will delay the appearance of the result and increase the cost of branches, among others. So I evaluate a new organisation :
- one full (6 bits) source register address, that determines the globule
- one partial (4 bits) source register address, in the same globule as above
- one full destination address register that might be in a different globule.
This way, during the fetch and execute cycles, there is ample time to gather the 4 destination globule numbers and prepare a crossbar, eventually detect write after write conflicts, route the results to the appropriate globules...
The partial address field is a significant win when compared to FC0, there are 16 address bits compared to 18 in the 2000-era design. This means more space for opcodes or options in the instruction word. And moving the complexity to the write stage also reduces the size of the register sets, that now have only 2 read ports !
But communication will not be as flexible as before...
I have considered a few compromises. For example, having shared registers that are visible across globules. It would create more problems than it solves : it effectively reduces the number of real registers and makes allocation harder, among other things that could go wrong. Forget it.
Or maybe we can use more bits in the destination register. The basic idea is to use 1 bit per destination globule and we get a bitfield. The destination address has 4 bits to select the destination globule, and 4 bits for the destination register in the globule. The result could be written to 4 globules at once !
Decoding and detecting the hazards would be very easy, by working with 4 decoded bits directly. The control logic is almost straight-forward.
However this wastes all the savings in precious opcode bits and many codes would not be used or make sense.
A globule field with 3 bits would be a good compromise, and the most usual codes would be expanded before going to the hazard detection logic :
000 G0 001 G1 010 G2 011 G3 100 Gx, Gx+1 101 Gx, Gx+2 110 Gx, Gx+3 111 G0, G1, G2, G3 (broadcast)
The MSB selects between unicast and multicast. One instruction can write to one, two or four globules at once. The last option would be very efficient for 4-issue wide pipelines, and take more cycles for smaller versions.
x is the local globule, determined by the source's full address. Some wire swappings and we get the proper codes for every decode slot simultaneously...
Moving data to a different globule would still have one cycle of penalty (because transmission is slow across the chip) but this is overall much better than delaying the whole code when one source must be fetched in another globule. Furthermore, broadcast uses fewer instructions.
It is less convenient than reading from a different globule, because moves must be anticipated. This is just a different way to read/scan/analyse/partition the dataflow graph of the program... More effort for the compiler but less complexity for the whole processor. And a simpler core is faster, cheaper and more surface&power-efficient :-)
03/07/2018 at 08:22 •
I think I'm closer to the solution of the old old problem of copy-less data transmission between threads or processes.
Already, switching from one process to another is almost as easy as a Call/Return within a given process. The procedure uses 3 dedicated opcodes : IPC, IPE and IPR (InterProcess Call/Enter/Return). The real problem is how to transfer more data than what the register set can hold.
My new solution is to "mark" certain address registers to yield the ownership of the pointer to the called process.
- The caller saves all the required registers on its stack and "yields" one or more pointers with a sort of "check-out" instruction that sets the Read and Write bits, as well as changes the PID (ownership of the block) OR puts the block in a sort of "Limbo" state (still owned but some transfer of ownership is happening, in case the block is not used by the recipient and needs to be reused).
- The called process "accepts" the pointer(s) with a "check-in" instruction so the blocks can be accessed within the new context and mapped to the new process' addressing range.
This creates quite a lot of new sub-problems but I believe it is promising.
- How do you define a block ? It's basically 1) a base pointer 2) a length 3) access rights 4) owner's PID. All of these must have some sort of HW support, merged with the paging system.
The sizes would probably be powers-of-two, so a small number easily describes it, and binary masks checks the pointer's validity.
- Transferable blocks must come from a global "pool" located in an addressing region that is identical for everybody because we don't want to bother with translating the addresses.
- IF there is a global pool then one process must be responsible for handling the allocation and freeing of those blocks.
This affects the paging system in many ways so this will be the next thing to study...
02/22/2018 at 04:41 •
I just had this little "hah!" moment...
I keep thinking about how to make FC1 much more badass. While writing one of the latest logs of #PDP - Processor Design Principles I realised I could/should have more than one data register per address register.
For example, 4 address registers are linked to one data register, and 4 other address registers are linked to 4 data registers each. Advantages include :
- much faster and lighter call/return (it would emulate register windows sans their drawbacks)
- faster register save and backup
- better memory bandwidth
- solves loop unrolling
- fewer aliasing problems ?
- easier unaligned access in one cycle (by combining 2 pipelines for a 2R2W instruction)
- reduces the number of address registers and pointer update instructions
It looks like a weird mix between Itanium, SPARC and TMS9900... I have to ponder more about it but the overwhelming benefits are enticing.
I also have to find a proper behaviour for pointer aliases, should they trap ?
04/09/2017 at 08:38 •
The register-mapped memory system, pioneered by Seymour Cray in the CDC6600, has many benefits and I developed it in my own way in the #YASEP
I used a different approach however, where the data register both reads and writes. The CDC had data 5 registers for reading and 2 for writing, with different code sequences.
For the YASEP, my approach makes perfect sense because the core uses onchip memory with single-cycle latency. You set the address register and by the next cycle, the data is read, whether you need it or not. Same for the #YGREC where the access time is "relatively immediate".
The F-CPU is a different beast though. We expect multi-level memory and going off-chip is a definitive performance killer. There must be a way to determine if a memory access is for reading or writing.
The CDC6600 version, with registers that are dedicated to functions, is not possible : there are already 16 registers, 1/4 of the total, and there is no room left, either in the register set or the opcodes (each bit is precious !)
The semantics of the address/data registers is to access the memory contents directly, through a buffer like a cache line, for example. Loading the cache line is the costly part.
How do I tell the CPU to load the cache line ? Simply by using/reading the data register. But doing this will stall the core if the cache line is not ready... This totally defeats the purpose of the system, where the memory's latency is hidden by explicit scheduling of the instructions: you calculate the address, schedule some instructions while waiting for the data to arrive, then read the register.
The costly part is the reading and it's too unpredictible : the data could be in the cache, in the local DRAM, on another CPU, or swapped out... So let's think in reverse and consider the write operation : the sequence goes like
- write the destination address to the address register
- write the data to the data register
The point of this log is to make a sort of atomic pair that the decoder can easily detect : the destination register must be the same (modulo one inverted bit) in two consecutive instructions (or closely enough, if the core can afford the comparators) to indicate that it's a write and the memory system shouldn't fetch the line right now.
There are a couple of things to notice :
- the address register of one port must be located in the "mirror" globule of the data register so the two instructions (set address, write data) can be "emitted" at the same time (and only one comparator is required, no comparison across pairs and simpler decoder)
- the SIMD globules need their own data ports but the address must remain in the scalar globules. I should change the register map... but now each of the 4 globules has their own dedicated memory block, with a matching data size !
And now, due to the pairing rules, it seems obvious that all the address registers must reside in the first globule, or else the pairing rules must be more complex, but this creates an imbalance in the register allocation...
Honestly I dislike the new asymmetrical design because it makes the architecture less orthogonal and potentially less efficient to code for. Architecturally, you only had to design one globule (or two if you want SIMD) and there you go, copy-paste-mirror-connect and you're done.
But let's look at the new partition :
- First globule is specialised for addresses, has A0-A7, and its dedicated memory block accessed with D0 & D1. Since you write all the addresses there, the TLB is directly connected to its ALU's output. Aliasing is directly managed there. The 6 remaining registers can be used to store some pointers, frame base, indices and increments, as well as supplement the standard computations.
- Second globule has only 2 data registers, to access the dedicated cache. 14 working registers can do some meaningful work. No TLB here, it can be "replaced" by a barrel shifter/multiplier array/division unit
- The 3rd globule has 2 data registers, accessing the wider (but shallower) cache block, and does the heavy lifting (along with the mirored 4th globule)
One big inconvenience is the single TLB that limits the throughput to only one load or store per cycle when we can emit 2 instructions and have 8 ports...