The previously described method for balancing a tree of MUX relays has reminded me the difficulty of creating a similar dual-MUX tree in FPGA where there is no memory blocks. I encountered this problem while designing the YASEP in VHDL for Actel's A3P family and the buffering was pretty complicated.
In ASIC/FPGA though the partition constraints are relaxed and slight unbalances can help the place/router put gates further or closer. Furthermore, the synchronicity of the arrival of all the signals ensures a more consistent timing. The elimitation of huge final fanout also brings a better overall operating speed.
The previous research dealt with the relay version of the register set and memory decoder, which I now realize are specific cases of the Knapsack problem. I'm now diving in the realm of Combinatorial optimization!
With ASIC or FPGA, we can relax some of the constraints and a little imbalance (10% or less) is not critical. We don't even have a constraint of having a power-of-two number of sinks because the gate inputs are not in series and tied in the middle. As a consequence, for an ASIC or FPGA, the original approach of rotation is easy and makes sense !
Let's examine the example of a 16-addresses register set, such as used by the #YASEP Yet Another Small Embedded Processor or the #F-CPU (v2). There are 4 address lines, and let's consider a 16-bits wide register (wider registers are just duplicates). The total fan-in is 16+32+64+128=240 for the 16×MUX16. The problematic fan-in of 128 (the last stage of the MUX) is reduced to four fan-ins of 240/4=60. This is slightlty faster (because there are less buffers to traverse, I estimate the gain to be equivalent to 1 gate propagation time due to the propagation in forward and reverse directions) and makes a more regular structure.
Actually the structure can be decomposed in groups of bits that are as wide as the number of address lines. For our 4-addresses register set, we can create groups of 4 bits, each with a fan-in of (1+2+4+8)=15 (per bit). For a full-custom design, these nibbles can be hand-optimised then duplicated as required, but it's also possible to just let the synthesizer&place&route decide how to allocate the 4 buffer lines. I should write some VHDL for this...
There is more than a way to interleave the control lines. For example, simple rotations have at least 3 versions:
step=4 step=2 step=1 0 a b c d a b c d a b c d 1 a b c d a b c d b c d a 2 a b c d b c d a c d a b 3 a b c d b c d a d a b c 4 b c d a c d a b a b c d 5 b c d a c d a b b c d a 6 b c d a d a b c c d a b 7 b c d a d a b c d a b c 8 c d a b a b c d a b c d 9 c d a b a b c d b c d a A c d a b b c d a c d a b B c d a b b c d a d a b c C d a b c c d a b a b c d D d a b c c d a b b c d a E d a b c d a b c c d a b F d a b c d a b c d a b cI don't even mention the direction (negative steps) or miroring (dcba instead of abcd).
Then there is the matter of "reversal" (using a different symmetry)...
0 a b c d 1 b c d a 2 c d a b 3 d a b c 4 d a b c 5 c d a b 6 b c d a 7 a b c d 8 a b c d 9 b c d a A c d a b B d a b c C d a b c D c d a b E b c d a F a b c dAll these combinations should be tried and tested to see which yields the best speed, considering that each FPGA or gate array has their own characteristics, fanout trees, etc.