Digging through various .lib gathered from around the 'net, it appears that sequential cells (flip-flops, transparent latches and set-reset latches) are often described with the Q and /Q outputs simultaneously. There is some advantage to this :
- You may want to decrease the output fanout so you split it over two related outputs
- You may want to ease bubble-pushing, and the dual/complementary output offers a convenient "conversion" point with some sort of "free inverter".
So I'll have to find a way to include that in the code...
Furthermore, I often get annoyed by the redundancy that is often found inside cells, which makes it harder for the synthesiser to factor some signals, in particular inverters : you can find them in many gates such as XOR, MUX and even DFF. Often, these signals are driven by a common source, such as a clock net, a control signal, and spread to tens of gates (let's say N). So the common signal is sent to N gates with a fan-in of 2 because it must drive the gate and the inverter. Such a fanout creates delays.
Sending a pair of complementary signals lowers the fanout/fanin constraint, so it should be 2× faster (up to...). It also makes place&routing more complex, particularly if it's a dumb algo that doesn't care about the geometry...
But this makes the case for "grouped cells" with one clock or control input signal which is locally amplified to 2 or 4 neighbouring cells. I have seen this recently in a paper or something...
Again, this is an optimisation step, where energy, space and time are progressively enhanced by making some custom cells to solve some local inefficiencies.
Meanwhile I also see some cells with more than 4 inputs.
Currently the system uses up to 4-input LUTS, and quite a bit of obscure code depends on this. But I have seen some larger combinatorial gates such as MUX4, and the adder uses a macrogate containing AND2, AND3 and OR3 with 5 inputs. A LUT5 has 32 entries, while the individual gates total 4+8+8=20 entries, so testing them requires fewer test vectors than a custom macrogate. OTOH the macrogate saves interconnect/routing resources, since it knows how to keep some traces directly at the cell level, without going through the tortuous metal layers.
I guess a cell would need to be crazy complicated or incredibly more efficient to justify going to LUT6. Even LUT5 would require some rewrites here and there. I also see some cells FA and HA : Full Adder and Half Adder, with 2 or 3 inputs and 2 outputs. This last case is a bit easier to handle since each "macrogate" can be "splitl" into two 1-output gates :
- half adder is just a AND and a XOR so it can be described as these two gates internally
- full adder is a XOR3 and a MAJ3 which can also be built from individual gates.
- Same idea with MUX4 : it can be internally built from three MUX2 at least for the purpose of test vector generation.
Anyway, with @alcim.dev the exploration and design of libraries of gates is progressing.