Let's take a closer look at the ALU. The overall structure is actually fairly straight forward:
There are registers at the inputs, ALUA, ALUB and ALUC. From there, there are independent paths for the adder and other functions in order to keep capacitance as low as possible for the adder. The shift buffers (SHR and SHL) are placed after the OR function so either the ALUA or ALUB can be shifted by feeding a zero to the other input. Logical operations and shifts are both very fast so there is no issue having them in series. There is a dedicated left-shift buffer rather than using the adder to add a value to itself, as is commonly done. This is so we don't have to connect the A and B inputs of the adder together, which would once again add capacitance.
The R and C registers at the outputs of the ALU capture the ALU result and carry at the end of the cycle. There are paths that bypass these registers to recirculate R and C back into the ALU inputs. Thse are required when two inter-dependent ALU operations follow one after the other immediately. This is the case, for example, when adjusting the high-byte during address calculation.
Control signals going to the ALU are applied only at the outputs in order to select the desired ALU operation output. The control signals can therefore be generated without penalty *during* the cycle while the ALU itself is working. The Flags To Modify (FTM) register is used to capture Write-Enable control signals for each flag that must be updated. The flags are actually updated in the cycle following the ALU operation based on the R and C values. The A7, B6 and B7 hold the indicated bits from the A and B inputs and are used to evaluate the V flag.
The theory of operation for the ALU is that all inputs must be prepared and loaded into registers in the prior cycle. At the clock-edge, the ALU begins working immediately, and the results are captured into output registers at the very end of the cycle. The ALU is thus bracketed by registers on both sides, and can be neatly inserted as a pipeline stage into the datapath.
One thing to note is that the ALU does not invert the B input of the adder for subtract operations. Instead, the B input is inverted in the prior cycle. This manouver reduces the propagation delay through the adder and conveniently shifts the burden to the prior cycle -- which is typically a operand read of the SBC instruction. There is plenty of time to invert the operand on the way in from memory.
And that's a nice segue to the setup for memory:
In this design, memory too has dedicated registers, namely ADL, ADH, WE and DOR (Data Output Register). Just as with the ALU, these registers are also loaded in the cycle prior to the memory operation. The result of a memory read is clocked into a register also. but rather than using a dedicated register, the data read is placed directly into an appropriate internal register in the CPU (ALUB, ADL, ADH or IR).
This arrangement is very well suited to synch RAMs, which have registered inputs internally. When using synch RAM, ADL, ADH, WE and DOR merely act as shadow registers to the synch RAM's own internal registers. An asynchronous data bus can run at the outputs of ADL and ADH, where traditional RAM, ROM and other peripherals can operate as usual. Of course, very little time will be available for such peripherals in the normal cycle, so it is likely that all aynchronous I/O will be wait-stated (or buffered). More on that later.
Equipped with these registers, both memory and the ALU can be treated as pipeline stages. In both cases, we set up the inputs in one cycle, the operation is completed in the next, and the result is captured in registers at the end of the cycle. The critical path for the pipeline stage includes the CLK-to-Q delay of the input registers and Data-to-CLK setup time of the output registers. If the output is going directly into synch RAM internal registers (when using the ALU to calculate an address, for example), the setup time of the synch RAM must be met.
At 100MHz, only 6.5ns remain available to the ALU after the initial register tpd and before the synch RAM setup time. We’ll turn next to how to design the ALU internals to meet this time threshold.