The previously described method for balancing a tree of MUX relays has reminded me the difficulty of creating a similar dual-MUX tree in FPGA where there is no memory blocks. I encountered this problem while designing the YASEP in VHDL for Actel's A3P family and the buffering was pretty complicated.

In ASIC/FPGA though the partition constraints are relaxed and slight unbalances can help the place/router put gates further or closer. Furthermore, the synchronicity of the arrival of all the signals ensures a more consistent timing. The elimitation of huge final fanout also brings a better overall operating speed.

*20170317*

The previous research dealt with the relay version of the register set and memory decoder, which I now realize are specific cases of the Knapsack problem. I'm now diving in the realm of Combinatorial optimization!

With ASIC or FPGA, we can relax some of the constraints and a little imbalance (10% or less) is not critical. We don't even have a constraint of having a power-of-two number of sinks because the gate inputs are not in series and tied in the middle. As a consequence, for an ASIC or FPGA, the original approach of rotation is easy and makes sense !

Let's examine the example of a 16-addresses register set, such as used by the #YASEP Yet Another Small Embedded Processor or the #F-CPU (v2). There are 4 address lines, and let's consider a 16-bits wide register (wider registers are just duplicates). The total fan-in is 16+32+64+128=240 for the 16×MUX16. The problematic fan-in of 128 (the last stage of the MUX) is reduced to four fan-ins of 240/4=60. This is slightlty faster (because there are less buffers to traverse, I estimate the gain to be equivalent to 1 gate propagation time due to the propagation in forward and reverse directions) and makes a more regular structure.

Actually the structure can be decomposed in groups of bits that are as wide as the number of address lines. For our 4-addresses register set, we can create groups of 4 bits, each with a fan-in of (1+2+4+8)=15 (per bit). For a full-custom design, these nibbles can be hand-optimised then duplicated as required, but it's also possible to just let the synthesizer&place&route decide how to allocate the 4 buffer lines. I should write some VHDL for this...

*20170320*

Interleaving.

There is more than a way to interleave the control lines. For example, simple rotations have at least 3 versions:

step=4 step=2 step=1 0 a b c d a b c d a b c d 1 a b c d a b c d b c d a 2 a b c d b c d a c d a b 3 a b c d b c d a d a b c 4 b c d a c d a b a b c d 5 b c d a c d a b b c d a 6 b c d a d a b c c d a b 7 b c d a d a b c d a b c 8 c d a b a b c d a b c d 9 c d a b a b c d b c d a A c d a b b c d a c d a b B c d a b b c d a d a b c C d a b c c d a b a b c d D d a b c c d a b b c d a E d a b c d a b c c d a b F d a b c d a b c d a b cI don't even mention the direction (negative steps) or miroring (

**dcba**instead of

**abcd**).

Then there is the matter of "reversal" (using a different symmetry)...

```
0 a b c d
1 b c d a
2 c d a b
3 d a b c
4 d a b c
5 c d a b
6 b c d a
7 a b c d
8 a b c d
9 b c d a
A c d a b
B d a b c
C d a b c
D c d a b
E b c d a
F a b c d
```

All these combinations should be tried and tested to see which yields the best speed, considering that each FPGA or gate array has their own characteristics, fanout trees, etc..

## Discussions

## Become a Hackaday.io Member

Create an account to leave a comment. Already have an account? Log In.