Tree balancing for IC

A project log for AMBAP: A Modest Bitslice Architecture Proposal

Trying to unify and simplify a minimal architecture for various implementation technologies...

Yann Guidon / YGDESYann Guidon / YGDES 03/07/2017 at 14:380 Comments

The previously described method for balancing a tree of MUX relays has reminded me the difficulty of creating a similar dual-MUX tree in FPGA where there is no memory blocks. I encountered this problem while designing the YASEP in VHDL for Actel's A3P family and the buffering was pretty complicated.

In ASIC/FPGA though the partition constraints are relaxed and slight unbalances can help the place/router put gates further or closer. Furthermore, the synchronicity of the arrival of all the signals ensures a more consistent timing. The elimitation of huge final fanout also brings a better overall operating speed.


The previous research dealt with the relay version of the register set and memory decoder, which I now realize are specific cases of the Knapsack problem. I'm now diving in the realm of Combinatorial optimization!

With ASIC or FPGA, we can relax some of the constraints and a little imbalance (10% or less) is not critical. We don't even have a constraint of having a power-of-two number of sinks because the gate inputs are not in series and tied in the middle. As a consequence, for an ASIC or FPGA, the original approach of rotation is easy and makes sense !

Let's examine the example of a 16-addresses register set, such as used by the #YASEP Yet Another Small Embedded Processor or the #F-CPU (v2). There are 4 address lines, and let's consider a 16-bits wide register (wider registers are just duplicates). The total fan-in is 16+32+64+128=240 for the 16×MUX16. The problematic fan-in of 128 (the last stage of the MUX) is reduced to four fan-ins of 240/4=60. This is slightlty faster (because there are less buffers to traverse, I estimate the gain to be equivalent to 1 gate propagation time due to the propagation in forward and reverse directions) and makes a more regular structure.

Actually the structure can be decomposed in groups of bits that are as wide as the number of address lines. For our 4-addresses register set, we can create groups of 4 bits, each with a fan-in of (1+2+4+8)=15 (per bit). For a full-custom design, these nibbles can be hand-optimised then duplicated as required, but it's also possible to just let the synthesizer&place&route decide how to allocate the 4 buffer lines. I should write some VHDL for this...



There is more than a way to interleave the control lines. For example, simple rotations have at least 3 versions:

    step=4    step=2    step=1
0   a b c d   a b c d   a b c d
1   a b c d   a b c d   b c d a
2   a b c d   b c d a   c d a b
3   a b c d   b c d a   d a b c
4   b c d a   c d a b   a b c d
5   b c d a   c d a b   b c d a
6   b c d a   d a b c   c d a b
7   b c d a   d a b c   d a b c
8   c d a b   a b c d   a b c d
9   c d a b   a b c d   b c d a
A   c d a b   b c d a   c d a b
B   c d a b   b c d a   d a b c
C   d a b c   c d a b   a b c d
D   d a b c   c d a b   b c d a
E   d a b c   d a b c   c d a b
F   d a b c   d a b c   d a b c
I don't even mention the direction (negative steps) or miroring (dcba instead of abcd).

Then there is the matter of "reversal" (using a different symmetry)...

0    a b c d
1    b c d a
2    c d a b
3    d a b c
4    d a b c
5    c d a b
6    b c d a
7    a b c d
8    a b c d
9    b c d a
A    c d a b
B    d a b c
C    d a b c
D    c d a b
E    b c d a
F    a b c d
All these combinations should be tried and tested to see which yields the best speed, considering that each FPGA or gate array has their own characteristics, fanout trees, etc.