05/31/2020 at 22:43 •
The assembled control unit already arrived a while ago, but I only got around testing it today.
You can see a photograph of the PCB above. Unfortunately I already noticed a stupid mistake at this level: I forgot to route the reset signal to the I/O header. Since this was a circuit level mistake, no DRC caught this. Well, luckily that's easily fixed with an additional wire directly to the reset-driver.
04/27/2020 at 20:30 •
Designing the control unit is one of the more complex tasks in MCU design. What used to be a few lines of VHDL turns out to be a tedious task when boiling it down to discrete gates. Or maybe I need a better design flow...
Lets look at the problem at hand. The final LCPU will consist of an 8 Bit datapath and an 8 Bit adresspath. The upper two bits of the addresspath will be used for the opcode. Since both the data and the ALU board are 4 bits wide, two of each ports have to be used.
To complete the CPU, we need a control unit that outpus the right control signals in depends of opcode and state machine cycle.
The original MCPU has a rather simple control unit that consists of a state machine that is directly mapped to opcodes. Decoding of the states is localized at the datapath. Since the LCPU uses a bitslice approach, it is more efficient to also include decoding of states into control signal in a centralized control unit. Generally, that is a design choice that allows for extending of the design and easier bugfixing. In terms of robustness it would be better to do localized state decoding to reduce the number of control signals.
There are 8 ALU board control inputs (Carry In, CLK_ACCU, CLK_DAT, Aluctrl[4:0]) and one output (Carry Out). The address board has 4 control inputs (SEL_PC, SEL_DATA, nCLK_PC, nCLK_ADR) and two outputs (OPC1,OPC0 -> A[7:6].
The control board itself has a few internal registes for carry and states and inputs for a two phase clock (phi1,phi2) and reset.
As a first step, I mapped all control signals to states and clock sub-cycles. The LCPU is still able to execute every instruction in a maximum of two clockcycles, despite switching from edge-triggered FF to latches.
ALU control signal encoding in dependence of Opcode and state.
03/28/2020 at 17:53 •
Thanks to express shipping, the populated PCBs arrived less then a week after ordering them. A top down photo of one ALU board with 4 bitslices is shown below.
The dual-input diodes for the boolean unit multiplexer line up neatly in one row.
I tested the board using an ATMega168 to generate stimulus signals. Unfortunately, I found a small mistake I made earlier in the design: The generate term in the carry chain should not have an additional carry input, see below. Luckily it was easily possible to fix this by removing one diode per slice from the board. You can see the unpopulated pads in the center of the top down image.
Removing this signal would have allowed to densify the layout significantly and also add the carry-chain inverter to the output, so the carry signal is non-inverting. But well, next time...
The list below shows the control signal configurations that were tested. Only the first three cases apply to operations that are used by the MCPU ISA. In addition, Y=B is needed for address loading.
The other operations would allow for a later extension of the instruction set.
Clockspeed was tested up to 2 MHz. Even for the worst case (ADD with carry in=1), the signal integrity looked excellent, suggesting that also 4 MHz and higher may pass. One limitation seems to be in the pull-up capability of the clock driver with a fan-out of 8. It may be a good idea to reduce the collector resistor on high fan-out drivers or to design a push-pull driver.
03/18/2020 at 19:22 •
The ALU board is quite a bit more complex than the datapath board. To still be able to meet a <10x10cm² footprint, the layout had to be densified compared to the adress-datapath board.
The circuit of one slice of the ALU board is shown above. The number if dual-diodes is quite high in this design due to wide-input AOI gates used as multiplexers for the boolean units. The XNOR gate on the right side is based on a cross coupled transistor pair. No LED could be used here due to stacking of several PN junctions in the current path.
Eagle was used for the layout, making extensive use of the block functionality. Since only a two layer PCB is used, a good strategy is needed to make the layout as tight as possible.
As a general rule, I used the front side for local interconnects and the power rails. Each slice was layouted with a fixed height of 0.65", which allows placing two gates between the power rails. Copper fill on both front and rear side is used for the ground plane. The rear side is only used for global routing (I/O and control signals).
After doing some estimations regarding copper line resistivity I noticed that it was possible to go to much smaller line widths.
The image above shows a comparison of the old and new polarity hold latch layout. The new layout is much more dense and takes 30% less PCB space. One interesting observation was that going from 0603 to 0402 resistors did increase the layout area, since it was not possible to pass connections below the components anymore. Therfore I stayed with 0603 passives. Using smaller transistor packages was not an option either, since I was not able to source the transitor in a smaller package.
The full layout, excluding ground plane, is shown above. Again, it was possible to fit four bit-slices on one board. In contrast to the previous design, I decided to include buffers for the control signals, located at the top. The fan out of the clock signal on this board is 8 gates. Combining several boards requires a clock tree for proper drive-ability.
The total PCB size is 10x7 cm², only marginally larger than the adress-board. Due to the more dense layout it contains 379 devices (75xT, 95xD, 67xLED, 122xR, 20xC) , 41% more than the datapath board.
03/04/2020 at 06:25 •
Time for another update. Mind you, at this point I am not documenting work that was already done in the past, but am following my actual development in real time. Therefore updates will be spaced much further apart, depending on how much time I can devote top this project.
After finishing the adress-datapath design, it is time for the data-datapath.
The data-dapath encompasses the left side of the system design. The original M-CPU ALU was rather simple and only supported the two operations that were needed directly by the ISA: (Y,Cout)=A + B, Y=A NOR B.
However, I changed the datapath a little to allow use of latches in the design. As a consequence, all reads are directed through the ALU. Storing the accumulator to memory also requires data to pass throught the ALU. Therefore, in addition to the operations above, also Y=A and Y=B have to be supported. When looking at the detailed logic, this adds quite some overhead.
I lot of deliberation followed, also considering to merge accu+data latch into one edge triggered register as done in the original MCPU. The problem is, in that case an additional data input latch has to be added to the address-datapath. In the end it turned out to be more part-count efficient to stay with the latch based architecture.
Two different designs were evaluated in detail.
A first implementation option for a single bit of the data-datapath is shown above. This design implements a full adder consisting out of two XNOR2 gates, an inverter and an AOI2 gate. Y=A/Y=B/Y=A OR B is realized with an AOI3 based multiplexer in the first stage. The carry line can be fixed to zero or one by using the Sinv and Sadd control inputs. This allows configuring the second XNOR gate to invert the result, resulting in Y=A NOR B.
The table above shows the individual component usage. The total part count for this option is 80 per bit.
The second option I looked into is a bit more universial and uses an AND-OR-INVERTER based multiplexer for the first stage of the ALU. (EDIT: 20/03/28, updated to fixed version.) By using the four control signals, it is possible to generate any boolean functions, offering the possibility to extend the instruction set of the CPU at a later time. This design is, of course, heavily inspried from the MT-15 ALU. Since the first stage is now also able to implement NOR, it was possible to remove the Sinv control signal. Instead only Sadd is present, which will inhibit carry generation if set to 0.
The design functionality was tested in the test bench above.
02/22/2020 at 17:17 •
02/22/2020 at 11:25 •
Designing PCBs with regular structures using free tools turned out to be surprisingly tedious. In the end I used the free version of Eagle due to it's block feature which allowed replicating of circuit and layout units. The other options I looked at were EasyEDA and KiCad.
Unfortunately, the free version of Eagle is limited to a single sheet, so I ended up transferring my nice hierachical LTspice implementation into a single paged mess of a schematic. The image above shows a one bit slice.
Curiously enough, none of the many LEDs in the LTL gates represented the actual state of the address output pins. Therefore I introduced two additional indicator LEDs.
The final PCB layout is shown above. I managed to fit a four bit wide slice of the adress path into one 8,5x7cm² PCB.
02/22/2020 at 10:29 •
Implementation of the address path is straightforward based on the previously designed gates.
The first design of the adress path is shown above. A half adder is needed to increment the PC. Since the gates are relatevely fast, I used a ripple carry adder. There are two latches for address and PC respectively. An AOI2 gate is used as multiplexer between PC and external address. The address latch can be loaded with zeros by pulling both control inputs of the MUX to low. This can be used to reset the PC.
Six of the bit slices were combined in a testbench to test the design of the full width datapath. A two phase clock is needed to alternatingly clock both latches. In this setting the circuit is configured to reset the PC and then increment it with each clock cycle.
Output traces of clock, reset and the first two output bits are shown above. Everything works nicely.
02/22/2020 at 00:17 •
The original MCPU architecture is shown above. There are separate flows for the address and data-path. The state machine is not shown. Since this design is catered for a CPLD, it assumes edge triggered flip-flops for all registers.
The MCPU programmers model is based on two registers: Accumulator and PC. The ISA consists of four instructions in a fixed encoding that is based on a two bit opcode and a six bit memory address. This ISA can be directly mapped to the datapath shown above with minimal control overhead. See MCPU link for more information including assembler, emulator and example code.
02/19/2020 at 20:23 •
Now to the last category of building blocks: Flip Flips.
If you grew up learning about digital electronics in the advanced CMOS era, like me, you will most likely be accustomed to using edge triggered flip flops for everything. Unfortunately, it turns out that proper edge triggered flip flop require at least 6 NAND gate equivalents, unless you have dynamic CMOS logic at your disposal.
For those of us who were suddenly beamed into the discrete LTL age, latches are a much more part count efficient solution, as they only consume about half as many components as a static edge triggered flip flop.
A commonly known minimal representation of a gated D-latch in NAND2 is shown above. See also Wikipedia article. A nice propery of this design is that it only requires a single clock input and has both inverted and non-inverted data outputs. Data from Din is forwarded to the output while Clk is high. The state of the latch is frozen on the high->low transition of the clk and held while clk is low.
When using this design in high speed circuits it becomes apparent that it has a nasty habit of generating glitches. The origin of this effect is the NAND2 gate in the lower left. The clock signal arrives on one input directly and on the other it is delayed through the NAND2 gate in the top left.
This effect can be somewhat reduced by tweaking the propagation delay of the logic gates. In LTL this is easily possible by changing the LED color to change threshold voltage. In this case I introduced a faster "red" (hence the R) gate with lower threshold as the top left gate.
There is also a way to reduce gate count to three by replacing one of the NAND2 gates with a wired AND. This is described in a now expired patent. (A similar design is described in this patent)
The patent also describes to work around the aforementioned glitch by introducing one faster gate. A disadvantage of this design is that it is only has an inverted output. The clock input is also inverted compared to the previous design: Data will be forwarded for clk='0' and held for clk='1'.