When Y8 is integrated in a chip as a building block of a SoC or a microcontroller, the IO space implements some pin-altering registers, usually named GPIO. So you have configuration registers, read register and output register... This last one is often the tricky one.
In the simplest cases, only a direct output from a DFF or Latch is implemented. This increases the code size and execution time because when you want to modify specific bits, you first have to read the previous port state (or read it from a cached copy somewhere), then mask the unwanted bits and/or OR the others then finally write the result. And code space is often a premium, particularly with only 256 addressable instructions !
Some more modern chips provide alternate addresses for the output registers, providing additional features such as SET, CLEAR and TOGGLE functions. I start from SET and CLEAR because that's what I was discussing lately. They are indeed implemented straight-forwardly by a Set/Reset flipflop using only 2 NAND2 gates, or 8 transistors in CMOS (2 in RTL/DTL/CTL).
So I take a basic S0R0 flip flop and add two NAND gates to selectively enable the clear and the set. This way, you just write a 1 to the bits you want to clear or set on the given port. Try by yourself, it's easy :-)
Total : 4 NAND2 gates (16 transistors per pin), and they can even be paired to use smaller footprints with standard cells.
Note that the DATA signal is latched from the Y8's register set's read port, it is pretty stable for a while (until a new instruction is fetched). The _SEL signals get decoded and take a bit longer to come alive, and they are only short strobes after the data has settled (otherwise it's a hell, you have to distribute the main clock signal everywhere...).
The more normal copy though is a bit more complex but having the 3 functions "copy", "set" and "clear" amount to implementing a transparent latch with clear and set. So I'm mostly reinventing the wheel... with the small detail that the clear and set must be controlled by the input data (which acts as an "enable" pin) so it does not conform to a classic standard cell. Thus, let's dive in and have a look at the circuit made with CircuitJS:
The "copy" function is quite easy for the "set" half : it is congruent with the "set" function indeed. So it's managed at the decoder level.
But the system must copy the Zero value and that's the tricky part. It requires two more AND and an inverter.
The structure with the 2 NAND2 converging to the AND is reminiscent of a XOR structure, except there are 3 inputs instead of 2. But the 3 gates could be merged into a single standard macrocell.
Note also that the CPY_SEL signal can be bubble-pushed so the inverter on the data disappears. But this creates a OR which needs its own inverter anyway...
There is this alternative version with some bubble-pushing, still using a non-inverting gate (AND) which has its own inverter. But the total amound of inverters has been cut in half. Sigh.
For each bit of the port, there are 2 pairs of gates that can be merged into a single more complex standard cell each, and 2 gates that remain lonely... That's a total of 26 CMOS transistors, and a bit more for a general reset. To implement the general reset one needs a R1S1 cell instead and rebuild everything but the general reset signal will do its work cleanly.
There is a function that is one order of magnitude larger to implement : the "toggle" function requires storing the last value in the port, which needs a full-blown DFF. It's sad and it's something I wish the Raspberry Pi implemented to make #SPI4C practical to code.
Anyway the current features are already nice and compact, decoding is rather simple, the timing should be good as long as DATA remains stable before and after the x_SEL pulse. I could add a state in the FSM to ensure it, at worst.
Otherwise, you need this beast for each pin :
This means much more transistors, more high-fanout signals, more clock and activity everywhere... more power just to add a small "toggle" feature. So I think that the previous asynchronous circuit is a decent compromise.
I hope it's less ambiguous now :-)
And each output block with 7 gates can be reduced to 3 or 4 standard logic cells by grouping some functions: