100MHz TTL 6502

Experimental project to break the 100MHz “sound barrier” on a TTL CPU

An often repeated refrain is that homebuilt CPUs are constrained to single-digit clock-rates by limitations inherent in discrete-component design. But we know that's not true. The C74-6502 ( ) achieved a 20MHz clock-rate while still being a full-fledged cycle-accurate 6502. It's worth asking, then, could a humble TTL 6502 reach that rarified air above 100MHz? It’s not clear such a thing is possible, but the challenge is on!

Team C74 is once again on hand and the objective is to build a next generation TTL 6502 with the highest clock-rate we can muster. The focus will be on reducing the cycle-time while keeping CPI fixed. The over-arching goal as always is to learn and to have fun. This project promises ample opportunity for both, so we'll buckle-up and get ready for a bumpy ride! 

The effort breaks down into a few key strategies:

1) Use faster hardware
2) Optimize critical circuits
3) Increase parallel processing
4) Manage signal integrity

Let's look briefly at each in turn.

Memory is a key area where faster hardware is essential. Both external memory and the microcode store will need to keep up with a faster clock-rate. Fortunately, access-time can be reduced almost at will using RAM. Hobby-friendly 10ns RAMs are readily available, and synch RAMs are even faster. The latter expect an addresses in advance of the cycle, and deliver in return access-times that are vanishingly small. It's safe to say memory is not likely to be a bottleneck in this design.

By the same token, there are also faster logic families available. The 3.3V LVC family, for example, has a good selection of parts at almost twice the speed of AC logic. The CBTLV family offers 3.3V variants of FET switches which can be very fast when deployed correctly. And then there is the AVC and AUC families. With near-nanosecond propagation delays, these families also feature variable impedance outputs which "provide great signal integrity without the need for external termination when driving traces of moderate length (less than 15 cm)". All-in-all, it's an embarrassment of riches when it comes to fast components.

But there are limitations also. For example, there is no equivalent to the 74AC283 Adder in these faster families, and FET switches are no faster with Select signals than their AC family cousins. Some careful design will be needed in critical circuits to capture the potential gains. ttlworks’ FET Switch Adder is a good example this, but there are others. The Decode, Flag Evaluation, and Branch Testing circuits are a few examples that are likely to land on the critical path.

Beyond specific optimizations, we'll need to look to increased concurrency. The C74-6502 divided its processing into two stages: the FETCH stage, and the everything-else-stage (aka EXECUTE). An obvious improvement is to split EXECUTE into shorter phases. As we discovered, pipelining can get very complicated very quickly, with multiple caches, hazard checks and branch prediction schemes. So we'll need to be careful lest the whole thing get out of hand. Thankfully, there are significant gains to be had with more TTL-friendly techniques. More on that later.

The final leg of the race is all about signal integrity. Trace geometry, stackup and clock management will all need careful attention. We are likely to need six layers boards, impedance controlled traces and a mixed-voltage supply. It's gonna be fun.

It was not until 1992 that DEC Alpha and HPPA RISC took the computer industry as whole beyond the 100MHz mark. Is it possible for a discrete-component 6502 to reach that same 100MHz milestone today? Well, we're gonna try to find out!

  • Decimal Mode

    Drass6 hours ago 0 comments

    The basic method for Decimal Mode is to perform an ADD or SUB operation, and then convert the result to BCD. The process is to work on each nibble in turn, as follows:

    Adder LO --> Detect LO --> Generate LO --> Adjust LO --> BCD result LO
    Adder HI --> Detect HI --> Generate LO --> Adjust HI --> BCD Result HI

    Detect_LO tests to see if the lower nibble needs to be adjusted. This would be the case if the the binary result is greater than 9, or if the low-nibble carry (C4) is high. To adjust an ADD result, Generate_LO will generate a 6 (or 0 if no adjustment is needed) which is then applied to the binary result by Adjust_LO. Generate_LO will also generate a BCD low-nibble carry (BCDLC) in that case. The process is the same for the upper nibble, except that BCDLC must be added to the upper nibble result. The same logic holds for subtraction, except that Generate_LO and HI will produce a $A rather than a 6 to perform the adjustment.

    Now the binary adder alone consumes the entire cycle at 100MHz, so Decimal Mode at high speed will need to take two cycles to complete (like it does on the 65C02). A happy consequence of this is that we can use the ALU adder for both the original binary operation and the subsequent adjust operation. To do so we feed the result of the initial binary addition back into the ALUA input, and feed an appropriate Adjust Value into the ALUB input for each nibble.

    Because the binary result for the lower nibble emerges from the adder early in the initial cycle, we are able to generate the lower nibble Adjust Value in the same cycle, like this:

    Cycle 1: Adder LO --> Detect LO --> Genereate LO --> ALUB 
    Cycle 2: ALUB LO --> Adder LO (B input) --> BCD Result

    The high nibble, on the other hand, is not ready until the very end of the initial cycle. We must therefore generate the Adjust Value for the high nibble in the second cycle, like this:

    Cycle 1: Adder --> ALUA
    Cycle 2: ALUA HI --> Detect HI --> Generate HI --> Adder HI (B input) --> BCD Result

    This will work, as long as the high nibble Adjust Value can be generated quickly. Adding an alternate path to the B input of the adder will add capacitance, but only minimally so and only to the high order bits of the carry-chain where we can tolerate some delay.

    Thanks to Dr Jefyll and ttlworks, the BCD adjust circuit in the C74-6502 is very fast already, and we can adapt it for our purposes here. This circuit produces results that are compatible with the NMOS 6502 for both decimal and non-decimal inputs. It uses FET Switches for time critical logic. With a little rejigging, we can adapt it to work in this new design, as is shown in this rough schematic:
    The high-nibble Adjust Value is generated by four FET Muxes in series (BCD.DETECT.HI, BCD.DETHI.AUX, BCD.SEL.HI and ALUB.SEL). This value is then fed into the high-nibble of the FET Adder. Earlier tests showed that CBTLV switches took about 1ns longer than AUC parts in the carry chain. The Adjust Value path is therefore likely to delay the adder result by that margin as well. Thankfully, because the results of Decimal Mode operations are never used as addresses, the Adjust Value path does not have to meet the 1.5ns setup time of the synch RAM. We therefore should have just enough extra time for this path to work.

    In order to remove from the adder the delay associated with the BCD carry, it’s easiest to break the carry chain at C4 and perform to separate adds for the low and high nibbles. The BCD carry can then be added in at the end as bit 0 of the high-nibble Adjust Value. In order to make this work, Detect_HI must adjust the threshold to test for > 8 for addition and < $F for subtraction. The ADJ1 and ADJ7 values that are input to BCD.DETECT.HI achieve that in the schematic above.

    We can separate the FET carry chain at C4 without adding capacitance by using the INH pin on the 74AUC2G53 C4 IC. An alternate C4' tied to GND can push a zero into the carry chain as needed. Both C4 and C4' can be switched before the ripple carry...

  • The Incrementer

    Drass10/18/2020 at 03:21 0 comments

    I first tried a 16-bit FET carry chain in series just to see what kind of delay we might see. (For reference the test board is configured as follows: R2, R4, R5, R9, R11, R12, R14 and R8 are open and R1, R3, R6, R7, R10, R13 and R15 are closed .. schematic here). Here are the results:

    • 2.5V, 2.2MHz ==> 14.2ns
    • 2.7V, 2.4MHz ==> 12.9ns
    • 2.8V, 2.5MHz ==> 12.5ns
    • 3.3V, 2.9MHz ==> 10.7ns

    As expected, a serial 16-bit FET carry chain is much too slow. The incrementer result in the CPU will be fed directly to the synch RAM (when incrementing PC for example), so the setup time of 1.5ns applies here as well. Add to that the tpd for the source register, some transit time, clock skew, etc. and we're pretty much left with about 6.5ns for the incrementer (just like with the adder).

    So, the next step was to try carry lookahead. Four levels of AND gates on this board simulate carry lookahead for the first 12 bits of the incrementer. In the test circuit, the lookahead carry is then fed to four FET switches to simulate incrementing the final four bits. In this case, we don't have to include the switch time in the circuit since that happens concurrently with the carry lookahead.

    So, I configured the board accordingly (as above, except that R13 is moved to R12 and R15 is moved to R14) and ran the test. Here are the results:

    • 2.5V, 4.9MHz ==> 6.4ns
    • 2.7V, 5.2MHz ==> 6.0ns
    • 2.8V, 5.4MHz ==> 5.8ns
    • 3.3V, 5.9MHz ==> 5.3ns

    All good results! -- so we now know we can make a 16-bit incrementer that will be fast enough.


    P.S. The four carry lookahead AND gates are on a single VQFN 74AUC08 IC. So, yes, soldering the VQFN package worked out just fine! That’s going to come in handy when it’s time to do layout.


  • The Adder (V2)

    Drass10/18/2020 at 03:13 0 comments

    Here is a different take on the FET Switch Adder. This one relies on a 2:1 74AUC2G53 FET Switch. (Thanks to Dr Jefyll for suggesting this part). This configuration requires an additional gate, but capacitance on the carry-chain is lower — AUC parts have lower intrinsic capacitance to begin with, and the carry chain now connects to one pin on the switch rather than two, as follows:

    Here is the test circuit:

    I took the opportunity to extend the carry chain to better simulate a 16-bit incrementer. This circuit also includes four AND gates in series to simulate carry lookahead feeding the final four bits of the adder. Here is they layout of the test board:
    74AUC08 ICs are only available in a VQFN package, so I thought I would experiment with that in passing. Honestly, the footprint (bottom center on the board) looks about the same size as the other VSSOP packages, and the big center pad makes routing harder. 

    Incidentally, the good folks at PCBWay have very kindly offered to support this project with PCB manufacturing. Many thanks to them for that! I used them for all my prior boards, so I’m happy to continue to do so. For now, these little test boards are quite straight forward. I’m sure I will welcome having a contact to talk to when we get to the more demanding impedance controlled boards.


    To configure the board for the test, jumpers R2, R4, R5, R7, R10, R13 and R15 were fitted. In this setup, the oscillation of the carry chain includes the switch-time of the first FET Switch, so it accurately reflects the transit time as it would be used in the Adder. 

    I ran the test at various operating voltages to see what would happen. The normal operating voltage for AUC logic is 2.5V, the Recommended Maximum is 2.7V and Absolute Maximum is 3.6V. Once again we measure pin 11 of the 74LVC163 counter which is a divide by 16 function. We are looking for a 6.5ns tpd to the output carry in order to meet the target. Here are the results:

    • @2.5V, 4.25MHz * 16 = 68MHz. 1000/68 = 14.7 / 2 = 7.35ns tpd 
    • @2.7V, 4.65MHz * 16 = 74.4MHz. 1000/74.4 = 13.4 / 2 = 6.72 ns tpd
    • @2.8V, 4.87MHz * 16 = 76.9MHz. 1000/76.9 = 13 / 2 = 6.5ns tpd
    • @3.3V, 5.45MHz * 16 = 87.2MHz. 1000/87.2 = 11.47 /2 = 5.73ns tpd

    I then had a chance to do some surgery ... 

    This is to double up the driver at the input of the carry-chain, as Dr. Jeffyl suggested. To do so I stacked another SOT23 gate on top of the existing driver on the board. (I didn’t have another AND gate, so I used an XOR gate and tied one of the inputs to GND with a little patch cable. It’s a mess but it did the job). 

    The rationale here is that AUC logic has relatively weak drive: 9mA as compared to 24mA for LVC. Doubling up the drivers will add a tiny bit of capacitance on the input, but the reduced tpd though the FET switches should more than compensate for that and tpd overall should drop. At least that’s the theory. 

    Now, recall that we are looking 6.5ns or less here. We measure the frequency of oscillation divided by 16 and calculate the tpd through the 8-bit adder at various voltage levels. Here are the results:

    • @2.5V, 5MHz x 16 = 80MHz —> 6.25ns
    • @2.7V, 5.54MHz x 16 = 88.64MHz —> 5.64ns
    • @2.8V, 5.65MHz x 16 = 90.4MHz —> 5.5ns
    • @3.3V, 6.14MHz x 16 = 98.24MHz —> 5.08ns

    The additional drive has done it, and we even have a reasonable safety margin. ttlworks’ FET Switch Adder as enhanced by Dr. Jefyll is a winner! I then fired up the test at 2.5V with a NC7SV08 in place of the 74AUC1G08 in the 8-bit adder carry-chain., Here is what I got:

    • 2.5V, 4.94MHz ==> 6.33ns

    Bingo! it's confirmed. NC7SV logic is a nice choice to drive the carry chain. It can be used conveniently for all the AND gates along the carry chain to provide the additional drive when needed. There is also an NC7SV74 flip-flop available which will do nicely for the ALU input Carry.

  • The Adder (V1)

    Drass10/18/2020 at 02:53 0 comments

    This is an important element of the design and right at the center of the critical path. Within the ALU, the inputs to the adder will be registered, and its outputs will go to the address lines of synch RAM (among other destinations). So the critical path will include the CLK-to-Q delay of the input registers, the Address-to-CLK setup time for the RAM, and a couple of buffers in between. Allowing sufficient time for clock-skew and intrinsic trace delay, we get just about 6.5ns available for the adder at 100MHz!

    This design is based on ttlworks' concept for a FET Switch Adder. The FET Switch Adder uses the fast data-to-Y tpd through the switches for the all-important ripple-carry chain. The data inputs are subject to the much slower Sel-to-Y tpd of the switches, but that delay is incurred only once for the whole chain. 

    For the test, I used a variation as suggested by Dr Jefyll, with 74CBTLV3253 muxes, as follows:


    The central challenge in the circuit is the build-up of capacitance along the carry chain. To explore the issue, the test sets up the carry chain to oscillate and trigger a 74LVC163 counter. We can configure the chain as 8-bits or 12-bits, and measure the frequency of oscillation as divided by the counter. The carry chain can also be split with an optional buffer (AND gate) after the 4th element to reduce the capacitance. The whole thing sits on about 1.5 square inches of board space:


    At these distances, we don't have to worry about transmission line effects, so all connections are unterminated. Here's a trace of the counter output:


    We're probing pin 11 on the '163 counter (divide by 16 output), and the carry-chain is configured as two 4-bit segments linked with the AND gate. We can calculate the tpd of the carry-chain based on the 4.29MHz measured frequency as follows:

    • 8-bit carry-chain w/ AND gate: 4.29 MHz x 16 = 68.64 MHz = 14.5ns period / 2 = 7.25ns tpd

    Removing the AND gate from the circuit is pretty much a wash -- the delay from the added capatiance is just about equivalent to gate delay we take out: 

    • 8-bit carry-chain, no AND gate: 7.2ns

    So, we have about 0.9ns per bit. The 12-bit carry chain showed a pretty linear growth in the delay, with 0.9ns per bit as well:

    • 12-bit carry-chain: 10.8ns

    The tpd of the adder includes the carry chain plus the switch-time of the 74CBTLV3253, which is 2.9ns (typical). That will remove one bit from the carry chain, so a net addition of about 2ns. The final inverter in the chain should be counted since the carry chain will need to be buffered from the rest of the CPU. So that gives us about 9.2ns for the “A to C” tpd of an 8-bit 74CBTLV3253 FET Switch Adder (roughly 1.2ns per bit). 

    Not bad at all, and certainly MUCH faster than an equivalent circuit using conventional gates (a conventional ripple-carry adder would be roughly 3ns per bit with NC7SV logic). So a great result, all told, but unfortunately not quite fast enough for 100MHz operation. We’ll have to keep working to squeeze out just a little more performance out of this circuit.

  • The ALU

    Drass10/17/2020 at 18:14 0 comments

    Let's take a closer look at the ALU. The overall structure is actually fairly straight forward:
    ALU Block Diagram.png
    There are registers at the inputs, ALUA, ALUB and ALUC. From there, there are independent paths for the adder and other functions in order to keep capacitance as low as possible for the adder. The shift buffers (SHR and SHL) are placed after the OR function so either the ALUA or ALUB can be shifted by feeding a zero to the other input. Logical operations and shifts are both very fast so there is no issue having them in series. There is a dedicated left-shift buffer rather than using the adder to add a value to itself, as is commonly done. This is so we don't have to connect the A and B inputs of the adder together, which would once again add capacitance.

    The R and C registers at the outputs of the ALU capture the ALU result and carry at the end of the cycle. There are paths that bypass these registers to recirculate R and C back into the ALU inputs. Thse are required when two inter-dependent ALU operations follow one after the other immediately. This is the case, for example, when adjusting the high-byte during address calculation.

    Control signals going to the ALU are applied only at the outputs in order to select the desired ALU operation output. The control signals can therefore be generated without penalty *during* the cycle while the ALU itself is working. The Flags To Modify (FTM) register is used to capture Write-Enable control signals for each flag that must be updated. The flags are actually updated in the cycle following the ALU operation based on the R and C values. The A7, B6 and B7 hold the indicated bits from the A and B inputs and are used to evaluate the V flag.

    The theory of operation for the ALU is that all inputs must be prepared and loaded into registers in the prior cycle. At the clock-edge, the ALU begins working immediately, and the results are captured into output registers at the very end of the cycle. The ALU is thus bracketed by registers on both sides, and can be neatly inserted as a pipeline stage into the datapath. 

    One thing to note is that the ALU does not invert the B input of the adder for subtract operations. Instead, the B input is inverted in the prior cycle. This manouver reduces the propagation delay through the adder and conveniently shifts the burden to the prior cycle -- which is typically a operand read of the SBC instruction. There is plenty of time to invert the operand on the way in from memory.

    And that's a nice segue to the setup for memory: 
    In this design, memory too has dedicated registers, namely ADL, ADH, WE and DOR (Data Output Register). Just as with the ALU, these registers are also loaded in the cycle prior to the memory operation. The result of a memory read is clocked into a register also. but rather than using a dedicated register, the data read is placed directly into an appropriate internal register in the CPU (ALUB, ADL, ADH or IR).

    This arrangement is very well suited to synch RAMs, which have registered inputs internally. When using synch RAM, ADL, ADH, WE and DOR merely act as shadow registers to the synch RAM's own internal registers. An asynchronous data bus can run at the outputs of ADL and ADH, where traditional RAM, ROM and other peripherals can operate as usual. Of course, very little time will be available for such peripherals in the normal cycle, so it is likely that all aynchronous I/O will be wait-stated (or buffered). More on that later.

    Equipped with these registers, both memory and the ALU can be treated as pipeline stages. In both cases, we set up the inputs in one cycle, the operation is completed in the next, and the result is captured in registers at the end of the cycle. The critical path for the pipeline stage includes the CLK-to-Q delay of the input registers and Data-to-CLK setup time of the output registers. If the output is going directly into synch RAM internal registers (when using the ALU to calculate an address, for example),...

  • Clocking Registers

    Drass10/17/2020 at 16:01 0 comments

    At 100MHz, the cycle is only 10ns long. At that time scale, issues that can go ignored at slower clock-rates suddenly become very material. Clock-skew is one such issue. The cycle is so short that even small delays on clock signals will be material.

    What kinds of delays could we be dealing with? Suppose we have a clock signal internal to the CPU with a 1.2ns rise-time (Tr) driving a 5" trace with ten flip-flops on it. A 50Ω trace on FR4 will present 3.3pF of parasitic capacitance per inch, and each flip-flop will add 3pF of capacitance in addition (assuming AUC logic). The cumulative delay on that trace is something in the order of 3.5ns relative to the input clock signal (i.e., prop delay = Tr + RC, so 1200ps + (5 * 3.3pF * 50Ω) + (10 * 3pF * 50Ω) = 3.5ns). 3.5ns may not seem like much, but it represents more than a third of the cycle at 100MHz!

    The moral of the story is to manage capacitance on clock lines carefully. To that end, I'm contemplating using a CDCVF310 1:10 Clock Driver to distribute the clock around the board. A two level clock tree can provide a dedicated trace for up to 100 destinations with minimum capacitance. We can then adjust for the tpd of the clock drivers themselves by using a CY2302 Zero-Delay-Buffer (ZDB) to synchronize these internal signals to the input clock.

    Beyond capacitance, there are four key specs in the CDCVF310 clock-driver datasheet that we should examine to better understand skew:

    • Tpd = 2.8ns max -- Propagation Delay: CLK input to Yn output propagation delay
    • Tsk(o) = 150ps max -- Output Skew: the variation in the tpd between outputs, i.e., from Ym to Yn
    • Tsk(p) = 250ps max -- Pulse skew: the variation in tpd from PLH to PHL
    • Tsk(pp) = 350ps max -- Part to Part skew: the variations in tpd from various ICs on the board

    With a multi-level tree, all four specs may come into play, and the total skew can add up to be a problem if we're not careful. Consider two Flip-Flops in series, like this:Attachment:


    If the clock-delay from FF1 to FF2 is longer than the tpd of FF1 plus the data delay to FF2, then FF2 will not latch the intended value correctly and the circuit will fail. One way to ameliorate the problem is to use trace delays in our favour. We can wire CLK signals so traces go from downstream flip-flops to upstream ones, hence clocking them in reverse order. Another option is to introduce delay in the data signals until the travel time between the flip-flops exceeds the longest clock-skew by some safety margin.

    And that brings us neatly into the issue of Write-Enable signals (WE) and how they might impact skew. We have a few implementation options to consider:

    1) On the C74-6502, write signals are all routed to a 74AC273 register and released together on the clock-edge -- like this:Attachment:


    The 74AC273 is cleared mid-cycle by a low-going pulse. Active-high WE signals arrive at the 74AC273 at various times throughout the cycle, but then travel to their destinations more or less together. A challenge with using this approach in this design is the potential skew between the outputs of the’273 register. There is no spec for skew mentioned on the 74AC273 datasheet, but it can be as much as 1ns on a 74LVC273. (From the datasheet, Tsk(o) = 1ns max, “Skew between any two outputs of the same package switching in the same direction."). In addition, it’s also more difficut to generate the mid-cycle pulse to clear the ‘273 reliably at these clock-rates.

    2) To minimize skew, we ideally want nothing in the path between the clock and a flip-flop’s CLK input, as in this alternative based on a 2:1 FET Mux at the data inputs of a register:Attachment:
    This method accomodates both active-high and active-low WE signals equally well. The FET switch will add 5Ω of series resistance at the data inputs of the flip-flop, and with it some minimal additional delay that we can safely ignore here. The switch-time of the mux becomes the...

eightycc wrote 3 hours ago point

Very nice project. One small nit to pick with the writeup: Amdhal's 5990-700 delivered in June 1988 had a 100 MHz clock. In most cases, mainframes got there first. Of course the 5990-700 consumed prodigious amounts of power, floor space (raised floor please), and cost $6.1 million.

  Are you sure? yes | no

Drass wrote 2 hours ago point

Thanks for the comment. There was also the Fluroinert-cooled Cray 2 supercomputer, clocking in with a 4.1ns cycle (244 MHz) in 1985. I have no idea what it sold for, but I am sure it was not in the “commercially reasonable” category. There may have been others as well, but I’d venture none that were as dramatic looking: :)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 2 days ago point

At 100MHz you end up with sub-ns edges, that means that you have not only transmission line effects because you shift to the GHz spectrum range, but also you need very careful grounding ! Return currents are very significant at these frequencies and could create ground-bounce effects worse than transmission-line effects. I guess your PCB has a ground plane and a + plane, and you need proper ground vias close to the data/signal vias to help... Yes I've been watching this kind of videos lately :-D

The lesson is that single-ended signals have invisible return currents, even differential lines NEED proper grounding.

Looking forward to seeing the rest of your system !

  Are you sure? yes | no

Drass wrote 2 days ago point

It's a three headed monster: logic errors, propagation delay and signal integrity all can get you!  VCC and GND planes are a must, and yes one has to be careful when switching reference planes. A lot to think about!

  Are you sure? yes | no

danjovic wrote 3 days ago point

Sound? Sure? A 100MHz 6502 is travelling at warp speed!! 

  Are you sure? yes | no

Drass wrote 2 days ago point

Yeah, trying to warp time! :) Thanks for the like @danjovic.

  Are you sure? yes | no

Drass wrote 4 days ago point

Thanks for dropping by Yann. Always good to hear from you.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 3 days ago point

and I'm happy to read you again !

That endeavour sounds exciting... but why don't you use existing monolithic adders ?

  Are you sure? yes | no

Drass wrote 2 days ago point

Yes, particularly exciting since the outcome is far from guaranteed! :) It looks feasible, but time will tell.

Regarding the adder, there is only 6.5ns available for the 8-bit adder and 16-bit incrementer. I haven’t found anything in discrete logic that can do the job.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 5 days ago point

Oh my.........

  Are you sure? yes | no

