Close

Register set again

A project log for YGREC8

A byte-wide stripped-down version of the YGREC16 architecture

Yann Guidon / YGDESYann Guidon / YGDES 03/07/2018 at 08:520 Comments

The pursuit of the Ultimate Register Set Structure progresses. I'm trying to make it more hierarchical and practical for a wider range of technologies (ASIC, FPGA, transistors, TLL, transistors).

I decided to use a parity bit for the register set and the memory. This increases reliability and the 9th bit is already provided by the A3P FPGA anyway. I'm also settling with a 512 bytes addressing space, whenever I can, to prevent aliasing issues (but the mapping can be controlled by some bits in the IO space at address 0)

The redesign of the register set uses bit slices again. 3 slices are grouped and 3 groups make the 9-bits wide register set. This is near perfect from the fanout point of view and the structure is very easy to place and route.

Parity is in bit #4 to reduce wire lengths in FPGA and ASIC.


Each slice has 8 bits of addressable storage and two MUX8.

The two MUX8 can be either balanced (fan-in={1,3,3}) or not (the classical {1,2,4}), it doesn't make a difference. There will be a fan-in of 7 in each group of 3 slices for all 8 address wires, when using circular permutation.

The storage part has more variations and options, depending on the technology.

For FPGA the bits are made of DFF with enable. The clock must feed all 72 bits and the enable signal is split into 8 lanes, one for each register. No reset signal is required (despite complaints from the synthesiser). It's possible to go further by removing the Enable signal : the clock signal is split into 8 lanes, so yes, that's "clock gating"...

Even further : a DFF is made from a couple of latches clocked on opposite signals. The first latch of each bit in a lane can be "factored" to reduce parts count in a discrete system. Instead of 16 latches to store 8 bits, only 9 remain (we saved almost one half of the parts !) which is good for TTL, transistors, ASIC... but clock sequencing is more complex. This approach is a bit slower but also saves power because the clock gating reduces the activity on the clock network by a 8:1 ratio.


3 slices make a group where the control lines get a circular permutation to balance the load on the control lines. However, the 8 "enable" lanes would become all shuffled (and prove hard to route) if all the MUX8 are shuffled, so each of the slices must be routed correctly from the MUX8s to keep the right order of the latches.

The groups have a fan-in of 1 for each signal (except data input if there is a direct connection to the DFF). The 2×3 MUX8 driving lines get amplified by one buffer each.

On A3P, each group has a XOR3 at the data input to generate parity.


Then at the higher level, 3 groups are assembled to create a 9-bits register set. The fan-in of the MUX8 is only 3. For other technologies, the 8 data input bits are parity-ed with a tree of XOR2 and the result is placed in the middle slice. The 8 latch enable lanes should be "straight" and easy to route.

Two other parity checks should be implemented at the output ports.

Discussions