11/12/2020 at 03:25 •
I wanted to touch on a final aspect of this pipeline design and the specific problems it helps to overcome. I struggled a bit with this explanation, so apologies in advance for any confusion. I’ll be happy to try to clarify so please just ask. Here it goes ...
Unlike the atomic instructions we find in a traditional RISC pipeline, even basic operations in this design are spread over two microinstructions. For example, a typical RISC style add operation, like this:
add r1, r2, r3
takes two microinstructions to specify in this pipeline, corresponding to the 6502 FetchOperand and FetchOpcode bus cycles, like this:
ALUin(A, DB, C); PC += 1 # FetchOperand A := ALUop(ADD); IR := DB; PC += 1 # FetchOpcode
The first microinstruction loads the inputs of the ALU and the second performs the ALU operation itself. To be clear, the RISC form of the instruction would execute the same sequence of steps with respect to the ALU as it works down the pipeline. But it remains an atomic unit that describes only a single operation.
By contrast, a single microinstruction in this design can specify the ALUop for one operation and the ALU inputs for the next. For example, during indexing, we might see the following microinstruction:
ADL := ALUop(ADD); ALUin(ADH, 0, Cout)
This microinstruction completes the sum for the low-byte of the address (ADL) and also sets up the ALU inputs to adjust the high-byte (ADH). Because of their dual-function, only one of these microinstructions is required to manage the activity across both the DECODE and EXECUTE stages.
As a bonus, that one microinstruction can also specify whether any ALU outputs need to be recirculated (as is the case with Cout in the microinstruction above).
Most importantly, though, the arrangement allows the pipeline to avoid control stalls. To see how, consider the effect of a FetchOpcode operation on the pipeline. A FetchOpcode causes microinstructions for a new opcode to begin to be fetched from a new location. In that sense, we can think of the opcode as an address and of FetchOpcode microinstructions as unconditional branches to opcode "subroutines". From the point of view of the microinstruction stream, a FetchOpcode is in fact an unconditional branch.
And just like all pipelined branches, FetchOpcode invalidates any instructions that have already been pre-fetched into the pipeline at the time it executes. This is effectively a "branch delay slot". With a traditional pipeline, there are two such invalid instructions, one in the FETCH stage and another in the DECODE stage. In this case we can use the opcode itself in place of the microinstruction in the FETCH stage. But the microinstruction in the DECODE stage has to be discarded. Left as is, the pipeline would stall.
This is where the "split" microinstructions come in handy. Since there is only one microinstruction active for both DECODE and EXECUTE, we can take the opcode and just keep going. No control stall is triggered and no extra cycles added to the processing as a result.
Alright, with that final gremlin banished, here now is a high-level block diagram for this CPU.
11/04/2020 at 23:03 •
Let's now take a closer look at the pipeline in this design. The objective is to reduce the cycle-time while keeping the cycle-count fixed. The critical path in the CPU falls squarely on the ALU, and associated pre- and post-processing. Rather than cramming all this into one cycle, the basic strategy is to push pre-processing to the prior cycle, and post-processing to the next. This allows the ALU to have the whole cycle to itself, giving us the headroom we need to boost the clock-rate.
Pre-processing here refers to the work required to set up the inputs to the ALU with appropriate values. That seemingly innocuous task takes a surprising amount of time -- we have to fetch microcode, decode control signals, select source values and output-enable the approriate registers. Post-processing, on the other hand, refers to updating the status flags and writing to the destination register. Rebalancing this workload around the ALU, we end up quite naturally with a four-stage pipeline, as follows:
We have FETCH, DECODE, EXECUTE and WRITEBACK -- the idea is to perform a roughly equal amount of work at each stage and then to pass the baton to the next. Along the way, we capture intermediate results in pipeline registers. Specifically, we have the Microinstruction Register (MIR) after the FETCH stage, we have ALUA, ALUB and ALUC registers at the ALU inputs and we have the R register at its output. The FTM (Flags To Modify) and RTM (Registers To Modify) registers direct the WRITEBACK stage regarding which flags and destination register to update. (More on the WRITEBACK stage below.)
Memory operations using "flow-through" synch RAMs are a good fit for this arrangement. A key feature of these RAMs is that we can clock an address into the RAM's internal registers then read the data value from its outputs before the next clock edge occurs. The ADL and ADH registers allow the pipeline to work in this same way with asynchronous peripherals. For writes, there is also the WE register and a Data Output Register (DOR).
As we've discussed before, the ALU features a "recirculate" path to allow the result to be fed back into its inputs. This is done during address calculation, for example, when the ALU result is immediately required in the next cycle. Memory reads are also recirculated, as either ALU operands or addresses to be used in the next cycle.
The WRITEBACK stage calculates the flags based on the ALU result, updates the P register according to the FTM, and writes the result to a destination register according to the RTM.
One important thing to highlight is that the WRITEBACK stage writes to registers using a mid-cycle rising clock-edge (PHI2 rising edge). Meanwhile, registers are always sampled at the end of the cycle (PHI1 rising edge). This discipline ensures that we always get an up to date value when a given register is being read and written to in the same cycle. For example, the P register may be updated in the same cycle that a branch test is being executed. Delaying the branch test until the second half of the cycle ensures that the branch test evaluates correctly.
Beyond allowing enough time to calculate the flags, a separate WRITEBACK stage allows the R register to neatly buffer the ALU from the rest of the CPU's internal registers (and the added bus capacitance they would impose). There are over ten destinations for the ALU output, all of which would add unnecessary delay to the ALU's critical path were they connected directly (10 loads x 3pF per load x 50Ω + 6" trace delay = 2.5ns).
Finally, we should note that the DECODE stage must receive a fresh instruction every cycle in order for the pipeline to function smoothly. To begin with, FETCH retrieves a new opcode from main memory (or simply generates a BRK on a CPU reset) and feeds it to DECODE stage via the Instruction Register (IR).
Thereafter, FETCH will retrieve microinstructions associated with that opcode from the microcode store, one per cycle, and feed them to the DECODE stage via the the Mircoinstruction Register (MIR). Attachment:
Once we reach the end of the current opcode, a new opcode is fetched and the sequence repeats again. The DECODE stage, meanwhile, always delivers appropriate control signals for downstream pipeline stages, whether by decoding the opcode in one cycle or a microinstruction in another.
And that's it. We'll take a look at how this pipeline executes cycle-accurate 6502 instructions in a future post. For now, the main thing to note is that this is a relatively simple pipeline that still packs a punch in terms of performance. By way of comparison, the critical path on this pipeline is about 20ns long (50MHz) as compared to 50ns (20MHz) on the C74-6502 -- that's assuming similar components in both cases; ie, AC logic for the ALU and CBT logic for tri-state buffers. The hope of course is that faster components and further optimizations (like the FET Switch Adder) will enable us to double the clock-rate yet again and reach the 100MHz milestone. Only time will tell whether we'll manage to get there.
P.S. Many thanks to Dr Jefyll for helping to clarify and edit this description. It is much better for it. Thanks Jeff!
10/26/2020 at 16:27 •
The basic method for Decimal Mode is to perform an ADD or SUB operation, and then convert the result to BCD. The process is to work on each nibble in turn, as follows:
Adder LO --> Detect LO --> Generate LO --> Adjust LO --> BCD result LO
Adder HI --> Detect HI --> Generate LO --> Adjust HI --> BCD Result HI
Detect_LO tests to see if the lower nibble needs to be adjusted. This would be the case if the the binary result is greater than 9, or if the low-nibble carry (C4) is high. To adjust an ADD result, Generate_LO will generate a 6 (or 0 if no adjustment is needed) which is then applied to the binary result by Adjust_LO. Generate_LO will also generate a BCD low-nibble carry (BCDLC) in that case. The process is the same for the upper nibble, except that BCDLC must be added to the upper nibble result. The same logic holds for subtraction, except that Generate_LO and HI will produce a $A rather than a 6 to perform the adjustment.
Now the binary adder alone consumes the entire cycle at 100MHz, so Decimal Mode at high speed will need to take two cycles to complete (like it does on the 65C02). A happy consequence of this is that we can use the ALU adder for both the original binary operation and the subsequent adjust operation. To do so we feed the result of the initial binary addition back into the ALUA input, and feed an appropriate Adjust Value into the ALUB input for each nibble.
Because the binary result for the lower nibble emerges from the adder early in the initial cycle, we are able to generate the lower nibble Adjust Value in the same cycle, like this:
Cycle 1: Adder LO --> Detect LO --> Genereate LO --> ALUB
Cycle 2: ALUB LO --> Adder LO (B input) --> BCD Result
The high nibble, on the other hand, is not ready until the very end of the initial cycle. We must therefore generate the Adjust Value for the high nibble in the second cycle, like this:
Cycle 1: Adder --> ALUA
Cycle 2: ALUA HI --> Detect HI --> Generate HI --> Adder HI (B input) --> BCD Result
This will work, as long as the high nibble Adjust Value can be generated quickly. Adding an alternate path to the B input of the adder will add capacitance, but only minimally so and only to the high order bits of the carry-chain where we can tolerate some delay.
Thanks to Dr Jefyll and ttlworks, the BCD adjust circuit in the C74-6502 is very fast already, and we can adapt it for our purposes here. This circuit produces results that are compatible with the NMOS 6502 for both decimal and non-decimal inputs. It uses FET Switches for time critical logic. With a little rejigging, we can adapt it to work in this new design, as is shown in this rough schematic:
The high-nibble Adjust Value is generated by four FET Muxes in series (BCD.DETECT.HI, BCD.DETHI.AUX, BCD.SEL.HI and ALUB.SEL). This value is then fed into the high-nibble of the FET Adder. Earlier tests showed that CBTLV switches took about 1ns longer than AUC parts in the carry chain. The Adjust Value path is therefore likely to delay the adder result by that margin as well. Thankfully, because the results of Decimal Mode operations are never used as addresses, the Adjust Value path does not have to meet the 1.5ns setup time of the synch RAM. We therefore should have just enough extra time for this path to work.
In order to remove from the adder the delay associated with the BCD carry, it’s easiest to break the carry chain at C4 and perform to separate adds for the low and high nibbles. The BCD carry can then be added in at the end as bit 0 of the high-nibble Adjust Value. In order to make this work, Detect_HI must adjust the threshold to test for > 8 for addition and < $F for subtraction. The ADJ1 and ADJ7 values that are input to BCD.DETECT.HI achieve that in the schematic above.
We can separate the FET carry chain at C4 without adding capacitance by using the INH pin on the 74AUC2G53 C4 IC. An alternate C4' tied to GND can push a zero into the carry chain as needed. Both C4 and C4' can be switched before the ripple carry arrives if the control signal is generated early in the cycle. A 75AUC1G74 that is pre-loaded in the cycle ahead of the ALU operation can generate active-low and active-high control signals to make the switch. (The BRK.CARRY signal going to the FET Adder in the schematic illustrates that function).
One final note regarding flag evaluation: we can use the final BCD adjusted result to obtain the correect results for the N, Z and V flags. This behaviour is compatible with 65C02 and the 65816 CPUs. The NMOS 6502, on the other hand, calculates the flags based on the original binary sum, but with the BCD low-nibble carry (BCDLC) added in. Since this value is no longer calculated by the ALU adder, we can include a simple 4-bit incrementer to add BCDLC to the upper nibble of the binary sum. This would be done during the second cycle of the BCD operation.
As will likely be the case with everything in this design, we meet the required timing for this circuit only by the skin of our teeth. It will be impossible to know whether we will reach the target clock-rate until the whole CPU is built. For now, I am doing my best to account for even the smallest delays, and have taken to including clock-skew and trace propagation delay in my estimates of the critical path. That will give me some idea of which components will need to be near each other in the final layout. At these speeds, just getting signals from one side of the board to the other is going to be a challenge!
P.S. Wait, hold the phone!
The carry chain is in fact split at C4 for the initial binary add as well. That means that the low and nigh nibbles emerge from the adder at exactly the same time, early in the cycle! So we can in fact make a start on the high-nibble Adjust Value in the initial cycle as well, and ease the time crunch on the second cycle. We only really have to capture the BCDHC and /BCDHC and we can generate the Adjust Value in the second cycle from that. Nice.
Phew! Feels good to come across a couple of unclaimed nanoseconds — it’s like finding an open parking spot downtown. :)
10/18/2020 at 03:21 •
I first tried a 16-bit FET carry chain in series just to see what kind of delay we might see. (For reference the test board is configured as follows: R2, R4, R5, R9, R11, R12, R14 and R8 are open and R1, R3, R6, R7, R10, R13 and R15 are closed .. schematic here). Here are the results:
- @ 2.5V, 2.2MHz ==> 14.2ns
- @ 2.7V, 2.4MHz ==> 12.9ns
- @ 2.8V, 2.5MHz ==> 12.5ns
- @ 3.3V, 2.9MHz ==> 10.7ns
As expected, a serial 16-bit FET carry chain is much too slow. The incrementer result in the CPU will be fed directly to the synch RAM (when incrementing PC for example), so the setup time of 1.5ns applies here as well. Add to that the tpd for the source register, some transit time, clock skew, etc. and we're pretty much left with about 6.5ns for the incrementer (just like with the adder).
So, the next step was to try carry lookahead. Four levels of AND gates on this board simulate carry lookahead for the first 12 bits of the incrementer. In the test circuit, the lookahead carry is then fed to four FET switches to simulate incrementing the final four bits. In this case, we don't have to include the switch time in the circuit since that happens concurrently with the carry lookahead.
So, I configured the board accordingly (as above, except that R13 is moved to R12 and R15 is moved to R14) and ran the test. Here are the results:
- @ 2.5V, 4.9MHz ==> 6.4ns
- @ 2.7V, 5.2MHz ==> 6.0ns
- @ 2.8V, 5.4MHz ==> 5.8ns
- @ 3.3V, 5.9MHz ==> 5.3ns
All good results! -- so we now know we can make a 16-bit incrementer that will be fast enough.
P.S. The four carry lookahead AND gates are on a single VQFN 74AUC08 IC. So, yes, soldering the VQFN package worked out just fine! That’s going to come in handy when it’s time to do layout.
10/18/2020 at 03:13 •
Here is a different take on the FET Switch Adder. This one relies on a 2:1 74AUC2G53 FET Switch. (Thanks to Dr Jefyll for suggesting this part). This configuration requires an additional gate, but capacitance on the carry-chain is lower — AUC parts have lower intrinsic capacitance to begin with, and the carry chain now connects to one pin on the switch rather than two, as follows:
I took the opportunity to extend the carry chain to better simulate a 16-bit incrementer. This circuit also includes four AND gates in series to simulate carry lookahead feeding the final four bits of the adder. Here is they layout of the test board:
74AUC08 ICs are only available in a VQFN package, so I thought I would experiment with that in passing. Honestly, the footprint (bottom center on the board) looks about the same size as the other VSSOP packages, and the big center pad makes routing harder.
Incidentally, the good folks at PCBWay have very kindly offered to support this project with PCB manufacturing. Many thanks to them for that! I used them for all my prior boards, so I’m happy to continue to do so. For now, these little test boards are quite straight forward. I’m sure I will welcome having a contact to talk to when we get to the more demanding impedance controlled boards.
To configure the board for the test, jumpers R2, R4, R5, R7, R10, R13 and R15 were fitted. In this setup, the oscillation of the carry chain includes the switch-time of the first FET Switch, so it accurately reflects the transit time as it would be used in the Adder.
I ran the test at various operating voltages to see what would happen. The normal operating voltage for AUC logic is 2.5V, the Recommended Maximum is 2.7V and Absolute Maximum is 3.6V. Once again we measure pin 11 of the 74LVC163 counter which is a divide by 16 function. We are looking for a 6.5ns tpd to the output carry in order to meet the target. Here are the results:
- @2.5V, 4.25MHz * 16 = 68MHz. 1000/68 = 14.7 / 2 = 7.35ns tpd
- @2.7V, 4.65MHz * 16 = 74.4MHz. 1000/74.4 = 13.4 / 2 = 6.72 ns tpd
- @2.8V, 4.87MHz * 16 = 76.9MHz. 1000/76.9 = 13 / 2 = 6.5ns tpd
- @3.3V, 5.45MHz * 16 = 87.2MHz. 1000/87.2 = 11.47 /2 = 5.73ns tpd
I then had a chance to do some surgery ...
This is to double up the driver at the input of the carry-chain, as Dr. Jeffyl suggested. To do so I stacked another SOT23 gate on top of the existing driver on the board. (I didn’t have another AND gate, so I used an XOR gate and tied one of the inputs to GND with a little patch cable. It’s a mess but it did the job).
The rationale here is that AUC logic has relatively weak drive: 9mA as compared to 24mA for LVC. Doubling up the drivers will add a tiny bit of capacitance on the input, but the reduced tpd though the FET switches should more than compensate for that and tpd overall should drop. At least that’s the theory.
Now, recall that we are looking 6.5ns or less here. We measure the frequency of oscillation divided by 16 and calculate the tpd through the 8-bit adder at various voltage levels. Here are the results:
- @2.5V, 5MHz x 16 = 80MHz —> 6.25ns
- @2.7V, 5.54MHz x 16 = 88.64MHz —> 5.64ns
- @2.8V, 5.65MHz x 16 = 90.4MHz —> 5.5ns
- @3.3V, 6.14MHz x 16 = 98.24MHz —> 5.08ns
The additional drive has done it, and we even have a reasonable safety margin. ttlworks’ FET Switch Adder as enhanced by Dr. Jefyll is a winner! I then fired up the test at 2.5V with a NC7SV08 in place of the 74AUC1G08 in the 8-bit adder carry-chain., Here is what I got:
- @ 2.5V, 4.94MHz ==> 6.33ns
Bingo! it's confirmed. NC7SV logic is a nice choice to drive the carry chain. It can be used conveniently for all the AND gates along the carry chain to provide the additional drive when needed. There is also an NC7SV74 flip-flop available which will do nicely for the ALU input Carry.
10/18/2020 at 02:53 •
This is an important element of the design and right at the center of the critical path. Within the ALU, the inputs to the adder will be registered, and its outputs will go to the address lines of synch RAM (among other destinations). So the critical path will include the CLK-to-Q delay of the input registers, the Address-to-CLK setup time for the RAM, and a couple of buffers in between. Allowing sufficient time for clock-skew and intrinsic trace delay, we get just about 6.5ns available for the adder at 100MHz!
This design is based on ttlworks' concept for a FET Switch Adder. The FET Switch Adder uses the fast data-to-Y tpd through the switches for the all-important ripple-carry chain. The data inputs are subject to the much slower Sel-to-Y tpd of the switches, but that delay is incurred only once for the whole chain.
For the test, I used a variation as suggested by Dr Jefyll, with 74CBTLV3253 muxes, as follows:
The central challenge in the circuit is the build-up of capacitance along the carry chain. To explore the issue, the test sets up the carry chain to oscillate and trigger a 74LVC163 counter. We can configure the chain as 8-bits or 12-bits, and measure the frequency of oscillation as divided by the counter. The carry chain can also be split with an optional buffer (AND gate) after the 4th element to reduce the capacitance. The whole thing sits on about 1.5 square inches of board space:
At these distances, we don't have to worry about transmission line effects, so all connections are unterminated. Here's a trace of the counter output:
We're probing pin 11 on the '163 counter (divide by 16 output), and the carry-chain is configured as two 4-bit segments linked with the AND gate. We can calculate the tpd of the carry-chain based on the 4.29MHz measured frequency as follows:
- 8-bit carry-chain w/ AND gate: 4.29 MHz x 16 = 68.64 MHz = 14.5ns period / 2 = 7.25ns tpd
Removing the AND gate from the circuit is pretty much a wash -- the delay from the added capatiance is just about equivalent to gate delay we take out:
- 8-bit carry-chain, no AND gate: 7.2ns
So, we have about 0.9ns per bit. The 12-bit carry chain showed a pretty linear growth in the delay, with 0.9ns per bit as well:
- 12-bit carry-chain: 10.8ns
The tpd of the adder includes the carry chain plus the switch-time of the 74CBTLV3253, which is 2.9ns (typical). That will remove one bit from the carry chain, so a net addition of about 2ns. The final inverter in the chain should be counted since the carry chain will need to be buffered from the rest of the CPU. So that gives us about 9.2ns for the “A to C” tpd of an 8-bit 74CBTLV3253 FET Switch Adder (roughly 1.2ns per bit).
Not bad at all, and certainly MUCH faster than an equivalent circuit using conventional gates (a conventional ripple-carry adder would be roughly 3ns per bit with NC7SV logic). So a great result, all told, but unfortunately not quite fast enough for 100MHz operation. We’ll have to keep working to squeeze out just a little more performance out of this circuit.
10/17/2020 at 18:14 •
Let's take a closer look at the ALU. The overall structure is actually fairly straight forward:
There are registers at the inputs, ALUA, ALUB and ALUC. From there, there are independent paths for the adder and other functions in order to keep capacitance as low as possible for the adder. The shift buffers (SHR and SHL) are placed after the OR function so either the ALUA or ALUB can be shifted by feeding a zero to the other input. Logical operations and shifts are both very fast so there is no issue having them in series. There is a dedicated left-shift buffer rather than using the adder to add a value to itself, as is commonly done. This is so we don't have to connect the A and B inputs of the adder together, which would once again add capacitance.
The R and C registers at the outputs of the ALU capture the ALU result and carry at the end of the cycle. There are paths that bypass these registers to recirculate R and C back into the ALU inputs. Thse are required when two inter-dependent ALU operations follow one after the other immediately. This is the case, for example, when adjusting the high-byte during address calculation.
Control signals going to the ALU are applied only at the outputs in order to select the desired ALU operation output. The control signals can therefore be generated without penalty *during* the cycle while the ALU itself is working. The Flags To Modify (FTM) register is used to capture Write-Enable control signals for each flag that must be updated. The flags are actually updated in the cycle following the ALU operation based on the R and C values. The A7, B6 and B7 hold the indicated bits from the A and B inputs and are used to evaluate the V flag.
The theory of operation for the ALU is that all inputs must be prepared and loaded into registers in the prior cycle. At the clock-edge, the ALU begins working immediately, and the results are captured into output registers at the very end of the cycle. The ALU is thus bracketed by registers on both sides, and can be neatly inserted as a pipeline stage into the datapath.
One thing to note is that the ALU does not invert the B input of the adder for subtract operations. Instead, the B input is inverted in the prior cycle. This manouver reduces the propagation delay through the adder and conveniently shifts the burden to the prior cycle -- which is typically a operand read of the SBC instruction. There is plenty of time to invert the operand on the way in from memory.
And that's a nice segue to the setup for memory:
In this design, memory too has dedicated registers, namely ADL, ADH, WE and DOR (Data Output Register). Just as with the ALU, these registers are also loaded in the cycle prior to the memory operation. The result of a memory read is clocked into a register also. but rather than using a dedicated register, the data read is placed directly into an appropriate internal register in the CPU (ALUB, ADL, ADH or IR).
This arrangement is very well suited to synch RAMs, which have registered inputs internally. When using synch RAM, ADL, ADH, WE and DOR merely act as shadow registers to the synch RAM's own internal registers. An asynchronous data bus can run at the outputs of ADL and ADH, where traditional RAM, ROM and other peripherals can operate as usual. Of course, very little time will be available for such peripherals in the normal cycle, so it is likely that all aynchronous I/O will be wait-stated (or buffered). More on that later.
Equipped with these registers, both memory and the ALU can be treated as pipeline stages. In both cases, we set up the inputs in one cycle, the operation is completed in the next, and the result is captured in registers at the end of the cycle. The critical path for the pipeline stage includes the CLK-to-Q delay of the input registers and Data-to-CLK setup time of the output registers. If the output is going directly into synch RAM internal registers (when using the ALU to calculate an address, for example), the setup time of the synch RAM must be met.
At 100MHz, only 6.5ns remain available to the ALU after the initial register tpd and before the synch RAM setup time. We’ll turn next to how to design the ALU internals to meet this time threshold.
10/17/2020 at 16:01 •
At 100MHz, the cycle is only 10ns long. At that time scale, issues that can go ignored at slower clock-rates suddenly become very material. Clock-skew is one such issue. The cycle is so short that even small delays on clock signals will be material.
What kinds of delays could we be dealing with? Suppose we have a clock signal internal to the CPU with a 1.2ns rise-time (Tr) driving a 5" trace with ten flip-flops on it. A 50Ω trace on FR4 will present 3.3pF of parasitic capacitance per inch, and each flip-flop will add 3pF of capacitance in addition (assuming AUC logic). The cumulative delay on that trace is something in the order of 3.5ns relative to the input clock signal (i.e., prop delay = Tr + RC, so 1200ps + (5 * 3.3pF * 50Ω) + (10 * 3pF * 50Ω) = 3.5ns). 3.5ns may not seem like much, but it represents more than a third of the cycle at 100MHz!
The moral of the story is to manage capacitance on clock lines carefully. To that end, I'm contemplating using a CDCVF310 1:10 Clock Driver to distribute the clock around the board. A two level clock tree can provide a dedicated trace for up to 100 destinations with minimum capacitance. We can then adjust for the tpd of the clock drivers themselves by using a CY2302 Zero-Delay-Buffer (ZDB) to synchronize these internal signals to the input clock.
Beyond capacitance, there are four key specs in the CDCVF310 clock-driver datasheet that we should examine to better understand skew:
- Tpd = 2.8ns max -- Propagation Delay: CLK input to Yn output propagation delay
- Tsk(o) = 150ps max -- Output Skew: the variation in the tpd between outputs, i.e., from Ym to Yn
- Tsk(p) = 250ps max -- Pulse skew: the variation in tpd from PLH to PHL
- Tsk(pp) = 350ps max -- Part to Part skew: the variations in tpd from various ICs on the board
With a multi-level tree, all four specs may come into play, and the total skew can add up to be a problem if we're not careful. Consider two Flip-Flops in series, like this:Attachment:
If the clock-delay from FF1 to FF2 is longer than the tpd of FF1 plus the data delay to FF2, then FF2 will not latch the intended value correctly and the circuit will fail. One way to ameliorate the problem is to use trace delays in our favour. We can wire CLK signals so traces go from downstream flip-flops to upstream ones, hence clocking them in reverse order. Another option is to introduce delay in the data signals until the travel time between the flip-flops exceeds the longest clock-skew by some safety margin.
And that brings us neatly into the issue of Write-Enable signals (WE) and how they might impact skew. We have a few implementation options to consider:
1) On the C74-6502, write signals are all routed to a 74AC273 register and released together on the clock-edge -- like this:Attachment:
The 74AC273 is cleared mid-cycle by a low-going pulse. Active-high WE signals arrive at the 74AC273 at various times throughout the cycle, but then travel to their destinations more or less together. A challenge with using this approach in this design is the potential skew between the outputs of the’273 register. There is no spec for skew mentioned on the 74AC273 datasheet, but it can be as much as 1ns on a 74LVC273. (From the datasheet, Tsk(o) = 1ns max, “Skew between any two outputs of the same package switching in the same direction."). In addition, it’s also more difficut to generate the mid-cycle pulse to clear the ‘273 reliably at these clock-rates.
2) To minimize skew, we ideally want nothing in the path between the clock and a flip-flop’s CLK input, as in this alternative based on a 2:1 FET Mux at the data inputs of a register:Attachment:
This method accomodates both active-high and active-low WE signals equally well. The FET switch will add 5Ω of series resistance at the data inputs of the flip-flop, and with it some minimal additional delay that we can safely ignore here. The switch-time of the mux becomes the setup time for the WE signal. One consideration is that clocking all flip-flops every cycle will consume some additional power, but the main drawback is the number of additional ICs required implement the approach. We effectively need one ‘3257 2:1 mux for every 4-bits of registered data in the CPU. That's a lot of hardware overhead.
3) Another approach is to use active-high WE signals with AND gates at the CLK inputs of flip-flops. This gating mechanism is sensitive to glitches if the WE signal is allowed to fluctuate while the clock is high. This is very likely to happen with WE signals that are generated by combinatorial logic. In that case, we can use a transparent latch to hold WE steady during the first half of the cycle, like this:Attachment:
In this case, the datasheet for 74AUC08 lacks any skew information so it’s hard to know what kinds of delay we might be dealing with. We can use the ZDB to synch up the signals with an external clock, but that of course won’t help skew.
4) We can also use OR gates with active-low WE signals, as follows:Attachment:
In this case, fluctuations in /WE during the clock-high phase are not harmful. The catch is that once /WE is brought low, the signal must rise again before the following mid-cycle, or a phantom clock-edge will be generated at the flip-flop input when it does rise. The approach is therefore not feasible in situations where WE signals are only available late in the cycle (which is very much the case in this design).
5) Finally, we can use FET switches to gate clock signals, like this:Attachment:
With this approach, brief fluctuations on the WE signal should not cause a problem. However, longer pulses, or indeed a full switching of the WE signal before the mid-cycle is problematic. Like with AND-gated clock above, a transparent latch could be used to hold WE steady during the first half of the cycle. The advantage here is that the tpd of FET switches is much shorter, and so any deviation from the mean is likely to also be minimal.
I don't doubt there are other approaches to these issues, so please feel free to suggest any alternatives. As always, all input is very much welcome. We'll have to see how all this works out in the final analysis. For now it's safe to say that clock-skew, trace-delay and write-enable signals will all need to be carefully managed if we are to get anywhere near the 100MHz goal.