There are plenty of great implementations of different and historically important CPUs available on various FPGA-based platforms, but to my knowledge very few trying to implement calculator CPUs. Old calculators where one of the first examples of "microcontrollers" because on the same chip they contained CPU, RAM, ROM and I/O (interface with keyboard and displays). Given that they are a bit different, I thought it would be useful and interesting to describe the design path I took.
The original patent (https://patents.google.com/patent/US3934233 ) explains the guts of the calculator CPU in a very detailed and accurate manner (albeit in a bit of a specific “patentesque” language intellectual property lawyers may be familiar with). However, there are few challenges of using this patent for direct implementation in VHDL:
- 1970s MOS technology is heavy on latches, which doesn’t align too well with FPGAs which are all about clocked registers
- The design had to expanded to include both TI and Sinclair (the originals are distinct and separate chips, each with own mask and ROM contents)
- The original is not microcode driven
Even so, the end result still somewhat resembles the main components of the original CPU. This project log describes the internals of calculator, as they come together is the main CPU entity implementation file - https://github.com/zpekic/Sys0800/blob/master/TMS0800/tms0800.vhd
The main source of info driving the display unit is the A register. It is however not just a bunch of simple BCD digits that can be directly multiplexed out to the 7 segment + decimal point display because:
- TI and Sinclair have different numeric formats
- Different digit values are used for negative sign
- Error processing is different (Sinclair has essentially none, while TI uses a bit 5 in BFLAG register)
- TI displays the decimal point on the place indicated by value in the LSD of register A, while Sinclair always displays the decimal point in fixed place
- TI has blanking of leading zeroes, Sinclair doesn’t
All of the details above are hidden from the main entity. What comes out is the multiplexed segment (anode) / digits (cathode) output which can drive the display but also the columns of the keyboard. Note that the digits are driven from “digit10” (MSD) to “digit0” (LSD) because that is the only reasonable way to implement leading zero blanking. This is a problem because most of the calculations happen from LSD towards MSD (such as in add/subtract start with 1s then 10s, then 100s etc.).
There are two of these in the CPU, one containing the TI, another the Sinclair code. During the build time, the appropriate “.asm” file is parsed and loaded as binary content into these ROMs. Note that:
- TMS800 instructions are originally 11-bit wide words. In this design the extra MSB bit is used to indicate a hardware breakpoint (not present in original chips)
- 320 words adds up to 1 256 word ROM + 1 64 bit ROM. In 1975 chip real estate was tight, but on modern FPGAs it is much simpler to “round up” to 512 words
- Both ROMs are driven from the same 9-bit program counter register (“signal pc:…”) and both outputs go into a 2-to-1 12 bit multiplexer as selected by the “Sinclair” input signal
- The “.asm” files are checked in, but they can be re-generated by running the GenAsm C# helper tool which downloads them from Ken Shiriff’s site
@echo off AsmGenerator\bin\Debug\AsmGenerator.exe http://files.righto.com/calculator/sourceCode.js TMS0800\sourceCode_ti.asm AsmGenerator\bin\Debug\AsmGenerator.exe http://files.righto.com/calculator/sourceCode_sinclair.js TMS0800\sourceCode_sinclair.asm
This CPU is interesting also in a sense that it is sort of a rudimentary SIMD (single instruction, multiple data) processor. Many instructions define 1 out of 16 “masks” in the lower 4 bits of the instructions. These mask are encoded in “km: masktable” constant (from line 169 onward):
- 32 entries are used, lower 16 are for TI (as address bit 4 is simply tied to “sinclair” which is 0 or 1, and address 3..0 are tied directly to lower 4 bits of the current instruction)
- Width is 44 (11 * 4) bits with some encoding: valid BCD digits 0..9 means mask will be applied, F means it will not. Two separate values will be created from the mask: 11 BCD digits and 11 enable/disable bits (lines 422 and 423)
What this means is that with TI mask M11 (Mant1), only digits 2 to 11 will be affected (think of this as for (digit := 2; digit <= 11; digit++)… loop) and for those digits that are “affected” the value will be available to the ALU.
This component encapsulates the microcoded core that drives the design. The organization and working of the microcode is described in another log but looking at it as a black box, it needs:
- Current instruction (for the microcode, this is the program that is being invoked)
- Conditions (so that it can sense the internal state of the CPU in order to drive conditional microcode execution, waits, loops etc)
- Clock and reset signal to drive it (at reset, the microinstruction register is set to 0 and starts there to initialize the CPU state)
What comes out is a set of control signals (32 in this case) that drive various enable/disable and multiplexer select signals in the CPU. Note that the control unit consumes some control signals internally for its own operation (if/else, repeat, wait etc) that the CPU doesn’t need to know about so they remain inside the control unit.
There are 3 “working registers” in the CPU used for calculations: A, B, C, all 11 4-bit BCD digits long. The "samdigit" encapsulated the basic data unit in the calculator, and it contains a 4 input multiplexer and a 4-bit register, updated on rising edge of common clock signal, if all the enable signals are on. The functionality is controlled by 2-bit "reg_verb" coming from the microinstruction word:
bcd_fromleft - used to do /10, meaning digits are shifted down. Only digits enabled by "m = '1'" are affected, and depending on the mask of digit to the left, value 0 can be picked up. So for example, "99999999912" becomes "09999999912" if mantissa digits are masked in. In case of /10 the loop has to start from LSD and go toward MSD (enable register is rotated left - uc_e(e_rol))
DIV10 => -- SRLA, SRLB, SRLC uc_ss(ss_off) or uc_if(cond_e11, upc_next, uc_label(CONTINUE)), 77 => uc_ss(ss_off) or uc_sam(sam_update) or uc_reg(bcd_fromleft), 78 => uc_ss(ss_off) or uc_e(e_rol) or uc_goto(uc_label(76)),
bcd_fromright - used to to *10, meaning digits are shifted up. Same logic as for shift down applies, but on the right side (from LSD). The shift up loop starts from MSD and goes toward LSD (enable register is rotated right - uc_e(e_ror))
MUL10 => -- SLLA, SLLB, SLLC uc_ss(ss_off) or uc_if(cond_e11, upc_next, uc_label(CONTINUE)), 74 => uc_ss(ss_off) or uc_sam(sam_update) or uc_reg(bcd_fromright), 75 => uc_ss(ss_off) or uc_e(e_ror) or uc_goto(uc_label(73)),
bcd_fromalu - new BCD value is loaded from the output of ALU
Fourth multiplexer input is simply used to "recirculate" the value in the digit, this is the "nop" operation, and as explained elsewhere these are always defined as zero values in microinstruction to facilitate easy "or addition" of microinstruction fields. This pattern is used quite a bit for other registers too, in order to avoid "gated clocks" most synthesis tools dislike.
Digit only changes value if:
m is enabled ("1") - all digits in the same position for A, B, C registers are tied to the same mask enable bit
nEnable is enabled ("0") - all digits in the same register are tied to same nEnable
Given the above, the digit part of the "SAM" - (Sequentially Addressed Memory as called in the original patent) can be seen as a matrix of 3 rows (A, B, C) and 11 columns (digits), that can be individually targeted for update using mask and enable signals driven by the microcode.
Very similar to SamDigit, except containing a single bit. Given that there is no shifting left and right, these bits do not need to be connected to their neighbors left and right. The SAM matrix is 2 rows (AF and BF registers, and 11 columns (flags)). The function of AF and BF registers is to keep state during calculation, and for that purpose support following verbs:
bit_zero - clear the bit
bit_load - load from external input, which is tied to af(i) xor bf(i)
bit_invert - as expected, but also used to set by first zeroing then inverting
This loop creates the "SAM" - 11 digit registers A, B, C and 11 bit flag registers AF and BF, which (along with carry flag CF) make up the "programming model" of the calculator.
11 - bit mask word is generated based on the value in key/mask ROM
11 - digit constant is also extracted from there (if mask is 0, the constant will be zero)
AF, BF - simple hook-up because these don't care about left and right neighbors
A, B, C - this is split up in three sections:
- leftmost - they have no left neighbors, so that side is "plugged" with zero
- middle - they are same, uniformly hooked up to left and right neighbors
- rightmost - they have no right neighbors so that side is "plugged" with zero
These multiplexers are used in many places in the design. They act as regular multiplexers but the selection is driven by enable register which is assumed to have only at most 1 bit set (in other words, it is usually some "one hot" ring counter), and to save on some FPGA real estate only has 44 input signals, instead of a regular 64 a full 16-to-1 by 4 multiplexer would have.
amux, bmux, cmux, kmux
The ALU width is 4 bits, meaning it can process 2 4-bit digits and produce 1 4-bit output. To bring the right digit value to the inputs of the ALU, these multiplexers are hooked up to the registers A, B, C, K (constant values) and outputs are brought into ALU. They are all tied to the same enable. So if enable register is "11111101111" that means ALU can process any two out of A(4), B(4), C(4), K(4) values ("0" is enable in this case)
The mux11x4 is reused here to bring values of flags AF, BF and M (mask) to the CF ("carry flag") logic, and also to the debug unit for display.
This is a purely combinatoric blob of logic in which most of calculations happen. It contains following main components:
input multiplexer ("rs") - There is only few needed combinations needed for calculation: AB, AK, CK, CB
output multiplexer for value ("y") - Selects the output based on the function. All calculations happen in parallel, and this mux brings the right one to the output
output multiplexer for carry ("cout") - similar to "y" but applicable only for add and subtract, in other cases, carry in is passed through
The only arithmetic operations are add and subtract. These can be done in simple binary, or in BCD. The BCD is done with correcting the binary result with 2 lookup tables ("adcbcd" and "sbcbcd"). So the maximum value of 9 + 9 + 1 = 0x13 has to become 19 (decimal).
One can view the CF flag as a single bit "ALU" which can be updated with various values based on the verb defined in the microinstruction. This allows CF to be cleared, set, set to carry out or ALU etc. In most cases, CF is updated only if the mask is set. For example, if this routine is run with masking only the 2 exponent digits, the CF will be updated only twice, although the loop will iterate over all the 11 digits (this is also a future optimization possibility to short circuit remaining iterations if no remaining digits in the loop are masked in):
SFX => -- SFB, SFA uc_ss(ss_off) or uc_if(cond_e11, upc_next, uc_label(CONTINUE)), 106 => uc_ss(ss_off) or uc_sam(sam_update) or uc_flag(bit_zero), 107 => uc_ss(ss_off) or uc_sam(sam_update) or uc_flag(bit_invert), 108 => uc_ss(ss_off) or uc_e(e_rol) or uc_goto(uc_label(105)),
These processes set the "src" register that controls the ALU inputs, and "dst" register which enables one of the 5 "rows" (A, B, C, AF, BF) in the "SAM". Driving these (e.g. A, K and C, K etc.) directly from microinstructions would mean that for different instructions, different microcode routines would be needed. By driving source and destination using these helper registers, that can be abstracted into the same routine. For example when implementing 2 instructions such as AKA (A <= K) and AKB (A <= B) only entry points are different, and then they both branch to the same implementation routine:
240 => -- AKA uc_e(e_rol) or uc_src(src_ck) or uc_dst(dst_a) or uc_goto(uc_label(COPYS)), 241 => -- AKB uc_e(e_rol) or uc_src(src_ck) or uc_dst(dst_b) or uc_goto(uc_label(COPYS)),
"enable" register is 11 bit ring counter. It supports the following microinstruction verbs:
e_init - highest bit is active ("0"), most useful case as the next rotate left with select LSD for calculations that go from LSD to MSD (e.g. add and subtract)
e_rol - rotate towards MSD, for most calculations
e_ror - rotate towards LSD for display and shift up (*10)
This register drives most multiplexers in the calculator CPU, effectively converting 11 digit, 44-bit parallel design into a simple 1 digit, 4 bit serial CPU.
Basic debugging in the form of detailed tracing of internal state is built in right into the CPU. Each microinstruction is extended with an 8-bit word that can be interpreted either as:
'0' & 7 - bit ASCII code
-- or --
'1' & multiplexer selection
This simple trick allows a trace subroutine in the microcode to be written, and executed after each instruction if debug bit fed into the CPU is flipped. In case of the multiplexer selection, the 4-bit output is converted into ASCII representation of the hex character and is presented at the trace_ascii output.
The external debug output circuit "listens" to this trace_ascii code, and acts according to a simple protocol:
if trace_ascii is zero, sets the ready output and does nothing
if trace_ascii is not zero, clears ready and drives the output unit (for example UART or VGA) - once character is displayed, sets the ready output
The tracer routine just needs to wait for the ready to be set before proceeding with next character output. Here is the end of tracer routine. Last two characters "printed" are CR and LF and each takes branching into the single microinstruction subroutine clears the output to 0 if the external unit is ready, before continuing:
60 => uc_ss(ss_off) or uc_setchar(char_CR) or uc_if(cond_charsent, uc_label(CLEARTXD), upc_repeat), 61 => uc_ss(ss_off) or uc_setchar(char_LF) or uc_if(cond_charsent, uc_label(CLEARTXD), upc_repeat), 62 => uc_ss(ss_off) or uc_goto(uc_label(NEXTI)), CLEARTXD => -- subroutine to reset txd to make it ready for next character uc_ss(ss_off) or uc_setchar(char_NULL) or -- reset output character uc_goto(upc_return), -- return to caller