Some of the gain comes from sharing one input signal with two AND gates. This and a few other things make me consider the legitimacy of increasing the size of the LUT that LibreGates uses in the advanced mode. It's a seducing idea but... Where is the limit ? How many inputs will be enough ?
As noted in the #analoglib project, gates with a maximal number of traversed transistors of 3 can have up to 3*3=9 inputs, though this extreme case is not very practical due to routing constraint in an ASIC cell. But this is way above the 4 inputs that the current system uses now. And CLA3 is a useful gate, there may be others.
So let's say, the maximum size is 8 inputs. That's a LUT with 256 inputs, which also means 256 test vectors to generate for just one gate, which is not reasonable. Even the CLA3 gate, with the 5 inputs, requires 32 vectors.
Fortunately this is easy to break down, as shown in the following circuit. The ANDs can be grouped in a LUT4 and OR-combined with a OR2 with the 5th input. That's a total of 16+4=20 LUT entries, hence 20 test vectors.
Now something gets in the way of this pretty picture : this composite gate is considered as 2 gates of delay/depth 1 each so if I create a macrogate, it will be seen as depth 2, while the equivalent cell has a delay of (mostly) 1. So the macrogate will give relevant results for the ATPG but the circuit complexity or depth will be misleading.
The solution I have found so far is to add a generic to each gate that gives a slightly more accurate idea of the relative complexity and delay. In the macrogate, the gates can be given smaller values than the individual gates alone, thus giving an overal sum that is closer to the real thing.
Of course the backwards metagate has a delay of 0 but all the others must start with an arbitrary constant cost of say, 1. An inverter adds 1 pass transistor so the total delay is 2. NAND2 has a total of 3 and NAND3 has a delay of 4.
CLA3 has a maximum of 3 transistors in series, so it's 1+3=4, but there is an added inverter. Since it does not go out of the cell, the total cost is only 5. All of this is arbitrary of course but depending on the way the delays are estimated, it could give a better estimate of the performance of a circuit. And we can add the fanout to the estimates as well.