Tagged registers

BEWARE !

That log might make you roll your eyeballs so much that you"ll see yourself cringing but it might also "solve" an old problem, in the sense that "solving" in engineerspeak means "moving the problem somewhere else". Hopefully it doesn't come back later to haunt you even more from its new residence.

Tagged registers were not present as such in FC0 but it contained "shadow" values, such as the zero flag. The LSB and MSB values are directly available in the register set (or a shadow copy somewhere else), so they could be used for conditional instructions. The Zero flag can be recomputed any time the register is written, either after a computation or context switch. This shadow mechanism is simple, cheap, non intrusive... Another register attribute bit would have contained the carry flag, but it was not considered "good" and FC0 preferred 2R2W instructions instead. It was still "fair game" though.

Now I'm thinking of something potentially much weirder than that, and I can already hear some of your sarcasm, of the kind "Why don't you go full iAPX32 while you're at it ?". I'll simply answer : "Because that's not the purpose", despite some temptations that are quickly swatted by the teachings of Computer Development History.

So let's consider our "Tagged Registers". To be honest, FC1 already has some "shadow flags" for some registers to help reduce checking bounds all the time, for example, when a pointer is stored, like when copying a valid address register to a common register : this would also propagate its TLB status as well, to prevent excessive TLB lookups and save some energy and time. This helps speed up jumps or passing parameters through function calls, for example. This again, this is "non intrusive" and hidden from the ISA, a mere convenience. If the shadow flag is invalid, you lost a few cycles and that's all.

But the ISA faces a growth crisis. The 2R1W instruction format already saves a little room by dropping 2 bits for one source register address, but that's still 6+4+6=16 bits and the opcode can only have 16 bits, of which 3 are eaten by the "size field". The basic decoding table for FC0 was:

000  scalar byte
001  scalar half-word
010  scalar word
011  scalar double word
100  SIMD byte
101  SIMD half-word
110  SIMD word
111  SIMD double word

Soon it appeared that this approach was still too limiting because you always need to handle something that won't fit these orthogonal yet arbitrary types. For example, some people will desire 128-bit integers, or 256 bits. DSP people love fixed point formats. Float formats abound and some "IA" (neural networks) algos want small numbers, while some crypto nerds wish for 512 bits. Whenever we define a size, it will be too large for some and too small for others.

Furthermore, most of the time, you will use only 2 or 3 types, which would reduce the use(fulness) of one bit. Wasting one opcode bit is baaaaaad.... One solution that was considered was to use a lookup table, which meant that you needed opcodes to read and set it, one by one... And since you can't count on people to agree, many of your function calls and returns would have to save that damned "Size LUT". What sounded simple and cheap seemed to implant its madness all over the code "just to be sure". And you would have to count on the LUT format to evolve !

Now, the "solution" I propose here might not be better, but it looks like a more serious alternative. You will still need dedicated opcodes, you will still need some hidden state that must be saved and restored upon all traps, but you will save a few opcode bits and the save-restore madness is gone ! So what is this so-called solution ? Add tags to each register to store the individual data type. This type could even be NULL and trigger a trap when this register is read as an operand.

The tag can be in machine-dependent format and is used by computational opcodes, that will obey the common-sense rules, such as: if you add two operands with a different type tag, trap, else store the result as the same type. Some opcodes could even change the type or inquire it.

The type will contain:

the scalar/SIMD/vector attribute,
integer/fixed point/floating point/whatever attribute,
the granularity (how many bits per element)
the overall size of the data (?)
if the register contains a valid value

They are set by :

load constant opcodes (explicit)
load from memory (explicit)
Type conversion opcodes (explicit, modify the register's content, for example float to integer)
Type casting opcodes (explicit, doesn't touch the register's contents)
computations/operations (implicit)

They are read by :

computations/operations
type read opcodes (introspection)
trap register backup mechanism

The good :

The code will be more flexible, more scalable
vectoring code would be easy, and the scalar version would still work/execute.
You don't need a separate FP or vector register set (though this could be done in practice, and you might limit the allowed types for certain registers, such as addresses)
One computation opcode for MANY cases ! Change the register's size and the code remains the same.
Not too intrusive when dealing with casts and conversions, if you have proper coding hygiene, unlike the LUT idea.
Better type safety, maybe better security (even though each new feature adds complexity, bugs, misunderstandings and potential flaws).

The bad :

Now the opcode table must specify which types it accepts, when/why it traps
Should the opcodes trap on any type mismatch or autopromote/autocast types à la JavaScript ?
Yes, that's more trap sources/reasons. Who wants more traps ? not me.
That's also more opcodes to explicitly handle.
You have to handle a type NULL ! and never mind the "whatever" needed for storing in the tagged format.
This adds LATENCY ! This is the part that bugs me the most. You don't know at instruction fetch time if the operation is integer or FP, you need to read the register set and check the respective operand types before you can reserve or allocate an execution unit (though the instruction words could be "shadow tagged" in the cache to help loops). This wouldn't be a huge drawback with OOO though.

The overall :

It shouldn't be such a crazy undertaking for mundane code, because in C for example, the source code explicitly declares all types in advance and it would be redundant (hence space-wasting) to repeat the size all the time in the opcodes. The compiler already tracks the register widths and handles explicit type conversions. With hardware size tags, things will be a bit more funky but this will also help save other opcodes for SIMD or vectors operations, because you don't need to create a separate opcode that adds a scalar to each element of a vector (for example).

Of course, overall, the architecture moves further away from its squarely RISC roots but FC1 is also meant to be a good general-purpose number cruncher that could handle a wide variety of workload types (that are not well handled already by the #YASEP and the #YGREC8). Keeping the type separate from the opcode leaves many doors open without the risk of saturating the opcode table. You could add LNS ("Logarithmic Number System"), complex or quaternions, matrices or whatever, and the same old ADD opcode would still work. The work would be directed to the appropriate execution unit.

Another interesting effect, by saving opcodes (both from the 3 bits saved and the deduplication for each type) is that not only it has more opcodes to enjoy but also more bits for the register addresses ! If 3 bits are saved then I can bump the register set's size to 4×32 ! This would use "only" 7+5+7=19 bits :-) And there would still be 13 bits available for 2R1W operations.

PS: This thought process was rekindled by http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-June/003142.html ...

BEWARE !

They are set by :

They are read by :

The good :

The bad :

The overall :

FC1 : the memo

Another cautionary tale... from Apple

Discussions

Become a Hackaday.io Member