Many years ago I completed the challenge to fit a CPU into the smallest available CPLD on the market - the MCPU. Ever since then I have been pondering about a new challenge in minimalism in CPU-Design. I had completed a TTL-based CPU even before the MCPU. Clearly, the only direction left is to go fully discrete and build a minimal CPU out of discrete transistors.
To make things interesting and add a modern twist, I decided to investigate a logic family that uses light emitting diodes (LEDs) as an active element. LTL is a logic family from a past that never happened. It combines 1950s transistor logic with low current green InGaN LEDs that were invented in the 1990s.
I already completed the design up to a prototype of a sub-system (see header image) and will use this project to document the steps I have taken to get there.
Thanks to express shipping, the populated PCBs arrived less then a week after ordering them. A top down photo of one ALU board with 4 bitslices is shown below. The dual-input diodes for the boolean unit multiplexer line up neatly in one row.
I tested the board using an ATMega168 to generate stimulus signals. Unfortunately, I found a small mistake I made earlier in the design: The generate term in the carry chain should not have an additional carry input, see below. Luckily it was easily possible to fix this by removing one diode per slice from the board. You can see the unpopulated pads in the center of the top down image.
Removing this signal would have allowed to densify the layout significantly and also add the carry-chain inverter to the output, so the carry signal is non-inverting. But well, next time...
The list below shows the control signal configurations that were tested. Only the first three cases apply to operations that are used by the MCPU ISA. In addition, Y=B is needed for address loading. The other operations would allow for a later extension of the instruction set.
Clockspeed was tested up to 2 MHz. Even for the worst case (ADD with carry in=1), the signal integrity looked excellent, suggesting that also 4 MHz and higher may pass. One limitation seems to be in the pull-up capability of the clock driver with a fan-out of 8. It may be a good idea to reduce the collector resistor on high fan-out drivers or to design a push-pull driver.
The ALU board is quite a bit more complex than the datapath board. To still be able to meet a <10x10cm² footprint, the layout had to be densified compared to the adress-datapath board.
The circuit of one slice of the ALU board is shown above. The number if dual-diodes is quite high in this design due to wide-input AOI gates used as multiplexers for the boolean units. The XNOR gate on the right side is based on a cross coupled transistor pair. No LED could be used here due to stacking of several PN junctions in the current path.
Eagle was used for the layout, making extensive use of the block functionality. Since only a two layer PCB is used, a good strategy is needed to make the layout as tight as possible.
As a general rule, I used the front side for local interconnects and the power rails. Each slice was layouted with a fixed height of 0.65", which allows placing two gates between the power rails. Copper fill on both front and rear side is used for the ground plane. The rear side is only used for global routing (I/O and control signals).
After doing some estimations regarding copper line resistivity I noticed that it was possible to go to much smaller line widths.
The image above shows a comparison of the old and new polarity hold latch layout. The new layout is much more dense and takes 30% less PCB space. One interesting observation was that going from 0603 to 0402 resistors did increase the layout area, since it was not possible to pass connections below the components anymore. Therfore I stayed with 0603 passives. Using smaller transistor packages was not an option either, since I was not able to source the transitor in a smaller package.
The full layout, excluding ground plane, is shown above. Again, it was possible to fit four bit-slices on one board. In contrast to the previous design, I decided to include buffers for the control signals, located at the top. The fan out of the clock signal on this board is 8 gates. Combining several boards requires a clock tree for proper drive-ability.
The total PCB size is 10x7 cm², only marginally larger than the adress-board. Due to the more dense layout it contains 379 devices (75xT, 95xD, 67xLED, 122xR, 20xC) , 41% more than the datapath board.
Time for another update. Mind you, at this point I am not documenting work that was already done in the past, but am following my actual development in real time. Therefore updates will be spaced much further apart, depending on how much time I can devote top this project.
After finishing the adress-datapath design, it is time for the data-datapath.
The data-dapath encompasses the left side of the system design. The original M-CPU ALU was rather simple and only supported the two operations that were needed directly by the ISA: (Y,Cout)=A + B, Y=A NOR B.
However, I changed the datapath a little to allow use of latches in the design. As a consequence, all reads are directed through the ALU. Storing the accumulator to memory also requires data to pass throught the ALU. Therefore, in addition to the operations above, also Y=A and Y=B have to be supported. When looking at the detailed logic, this adds quite some overhead.
I lot of deliberation followed, also considering to merge accu+data latch into one edge triggered register as done in the original MCPU. The problem is, in that case an additional data input latch has to be added to the address-datapath. In the end it turned out to be more part-count efficient to stay with the latch based architecture.
Two different designs were evaluated in detail.
A first implementation option for a single bit of the data-datapath is shown above. This design implements a full adder consisting out of two XNOR2 gates, an inverter and an AOI2 gate. Y=A/Y=B/Y=A OR B is realized with an AOI3 based multiplexer in the first stage. The carry line can be fixed to zero or one by using the Sinv and Sadd control inputs. This allows configuring the second XNOR gate to invert the result, resulting in Y=A NOR B. The table above shows the individual component usage. The total part count for this option is 80 per bit.
The second option I looked into is a bit more universial and uses an AND-OR-INVERTER based multiplexer for the first stage of the ALU. (EDIT: 20/03/28, updated to fixed version.) By using the four control signals, it is possible to generate any boolean functions, offering the possibility to extend the instruction set of the CPU at a later time. This design is, of course, heavily inspried from the MT-15 ALU. Since the first stage is now also able to implement NOR, it was possible to remove the Sinv control signal. Instead only Sadd is present, which will inhibit carry generation if set to 0.
The design functionality was tested in the test bench above.
Designing PCBs with regular structures using free tools turned out to be surprisingly tedious. In the end I used the free version of Eagle due to it's block feature which allowed replicating of circuit and layout units. The other options I looked at were EasyEDA and KiCad.
Unfortunately, the free version of Eagle is limited to a single sheet, so I ended up transferring my nice hierachical LTspice implementation into a single paged mess of a schematic. The image above shows a one bit slice.
Curiously enough, none of the many LEDs in the LTL gates represented the actual state of the address output pins. Therefore I introduced two additional indicator LEDs.
The final PCB layout is shown above. I managed to fit a four bit wide slice of the adress path into one 8,5x7cm² PCB.
Implementation of the address path is straightforward based on the previously designed gates.
The first design of the adress path is shown above. A half adder is needed to increment the PC. Since the gates are relatevely fast, I used a ripple carry adder. There are two latches for address and PC respectively. An AOI2 gate is used as multiplexer between PC and external address. The address latch can be loaded with zeros by pulling both control inputs of the MUX to low. This can be used to reset the PC.
Six of the bit slices were combined in a testbench to test the design of the full width datapath. A two phase clock is needed to alternatingly clock both latches. In this setting the circuit is configured to reset the PC and then increment it with each clock cycle.
Output traces of clock, reset and the first two output bits are shown above. Everything works nicely.
The original MCPU architecture is shown above. There are separate flows for the address and data-path. The state machine is not shown. Since this design is catered for a CPLD, it assume edge triggered flip-flops for all registers.
The MCPU programmers model is based on two registers: Accumulator and PC. The ISA consists of four instructions in a fixed encoding that is based on a two bit opcode and a six bit memory address. This ISA can be directly mapped to the datapath shown above with minimal control overhead. See MCPU link for more information including assembler, emulator and example code.
Now to the last category of building blocks: Flip Flips.
If you grew up learning about digital electronics in the advanced CMOS era, like me, you will most likely be accustomed to using edge triggered flip flops for everything. Unfortunately, it turns out that proper edge triggered flip flop require at least 6 NAND gate equivalents, unless you have dynamic CMOS logic at your disposal.
For those of us who were suddenly beamed into the discrete LTL age, latches are a much more part count efficient solution, as they only consume about half as many components as a static edge triggered flip flop.
A commonly known minimal representation of a gated D-latch in NAND2 is shown above. See also Wikipedia article. A nice propery of this design is that it only requires a single clock input and has both inverted and non-inverted data outputs. Data from Din is forwarded to the output while Clk is high. The state of the latch is frozen on the high->low transition of the clk and held while clk is low.
When using this design in high speed circuits it becomes apparent that it has a nasty habit of generating glitches. The origin of this effect is the NAND2 gate in the lower left. The clock signal arrives on one input directly and on the other it is delayed through the NAND2 gate in the top left.
This effect can be somewhat reduced by tweaking the propagation delay of the logic gates. In LTL this is easily possible by changing the LED color to change threshold voltage. In this case I introduced a faster "red" (hence the R) gate with lower threshold as the top left gate.
There is also a way to reduce gate count to three by replacing one of the NAND2 gates with a wired AND. This is described in a now expired patent. (I don't know where I found this first, I believe it was somewhere in the hackady TTLers community, but I can not find the source again. Please let me know if I forgot to give credit to someone)
The patent also describes to work around the aforementioned glitch by introducing one faster gate. A disadvantage of this design is that it is only has an inverted output. The clock input is also inverted compared to the previous design: Data will be forwarded for clk='0' and held for clk='1'.
As a next step we will look into options to design XOR2 gates in LTL.
Another option is to use an AOI2 gate and two inverters. The number of components is almost the same as the NAND2 implementation, but now the propagation delay is only two gates. Furthermore, often inverted signals are alrady available as output from a previous stage. In that case, the inverters can be omitted.
An LTL version of a XOR2 gate based on a cross coupled transistor pair is shown above. First, this device has an output inverter to restore the low level. An input diode and resistor is added to avoid current sinking during high. To fix the threshold levels, two additional diodes were added (D1, D3). LEDs cannot be used in this place, because there are other elements in the current path (D1,D2, output transistor from preceding gate) that add to threshold voltage. The threshold level is still defined by the differential voltage between two inputs. This is still of concern, but a little less relevant now since the output levels have been restored. Assuming the input low level is 1xVCEsat, the threshold level is equal to VD1-VD3+VBE+D2+VCESat: 0.7+0.7+0.2 ~ 1.6V. This is much higher than the 0.7V of the bare transistor, however still not the same as the LTL threshold levels.
After successful validation of the LTL concept in real hardware I started to build up a library of common gates in LTspice as a foundation for the design of a CPU.
The basic gate of LTL is the NAND2 gate. Symbol and circuit shown above. In my final design I used gates with different threshold level so I added a "G" to a high threshold device with a green LED.
Every gate was tested in a simple testbench in LTspice. I arbitrarily chose a fan out (FO) of 7 for the test case, although this does not occur in the real design.
Test waveforms for the NAND2 gate are shown above - nothing peculiar. There is a little crosstalk between the inputs of the gates if one input is high and the other one is pulled low. This is due to the extremely high slew-rate of the falling edge on the output. In a physical implementation this will hopefully be reduced a bit by additional parasitic capacitances to the power plane.
Next is the NOR2 gate. This can be easily realized by a wired AND of two LTL inverters. A minor but very important detail: If a gate uses a wired AND at the output, neither of the LEDs will be representative of the output signal. In practice this means that additonal indicator-LEDs may have to be added to monitor certain nodes.
Last one is the AOI2 gate (AND OR INVERT). You may not be familiar with this kind of gate, but it is a very useful building block due to it's simple implementation. For example, it can be used as a multiplexer or as part of an ALU.
Finally, a list of part counts for each gate. Since my intention is to build a CPU with a minimal amount of discretes, it is important to keep track of this.
Not too exciting, so let's get to the more special building blocks next...