Opcode Encoding for a Stack Machine

A project log for TTL Operation Module (TOM-1)

A 16-bit TTL CPU and stack machine built out of 74xx chips.

Tim RyanTim Ryan 06/23/2020 at 00:430 Comments

When coming up with requirements for TOM-1, I knew that the opcode space would be limited enough that the system would probably not require microcode, since I had seen Forth implementations which did not require many instructions. This meant I didn't tackle the question of how opcodes actually would be decoded until later. This organically built up into a collection of signals I could embed ad-hoc in a binary generated by Python:

def push_literal(arg):

This function would write out the TOM-1 "push literal" opcode, by writing the CPU flags for two clock cycles (a full opcode) and a 16-bit operand. Take a look at the TOM-1 system diagram, or read this explanation: On the first clock cycle, we ask for the ROM to be on the TOS (top of stack) bus, for the D register to increment, and for us to write the value of TOS to the data stack. On the second clock cycle, we just leave ROM on the TOS bus. At the end of the second cycle TOS latches the 16-bit new value from ROM.

Having written this out, it's clear that describing these flags isn't helpful in forming an "intuition" of what is happening in the CPU with each opcode. After struggling with different 2:4 and 3:8 decoder strategies, I actually came across another microcode-less design which seemed to share some design goals, as well as having an 8-bit opcode width. Here is the opcode format used by the #Microcode-less TTL CPU :

Opcode format:
    7        6        5        4        3        2        1        0
|Carry fb| On Zero|  Src_0 |  Src_1 |  Src_2 |  Dst_0 |  Dst_1 |  Dst_2 |
The instruction decoder is composed of two demultiplexer ICs (2 x 74HC138) and is driven by the instruction register. A CPU instruction word is 8 bits wide, 3 bits select the data source and 3 bits select the destination. Each demultiplexer ICs apply the control signals to the selected destination/source at the execute phase (3. phase). The instruction decoders (2x74HC138) use up 6 bits of an instruction word, I used the remaining two bits for instruction modifications [carry and zero].

Just from looking at this encoding, we can infer a few things about the CPU:

This is confirmed in the project's description which reads "Eventually each CPU instruction is a hardwired MOVE instruction, the instruction code itself determines the component that will be the data source (e.g. accumulator, input port, RAM, program memory, etc...) and the data destination (accumulator, adder, inverter, output port, program counter, etc...)." The circuit is an example of transport-triggered architecture. From taking a look at the schematics, "src" registers appear to include the accumulator and program memory, and destinations include the program counter, accumulator, adder, and inverter. Other chip enable lines appear to be cycle-dependent.

TOM-1 Opcode Design

We can look back at the TOM-1 system diagram. TOM-1 design and draw some parallels. While I wouldn't describe the architecture as TTA, it does have three "concerns" each clock cycle:

  1. Selecting a stack register (D or R) and incrementing or decrementing it
  2. Computing a new value to be loaded into TOS (the accumulator)
  3. Performing a stack or ram load or write

With this in mind, here are the first 4 bits of the TOM-1 opcode which will probably not change:

  1. DR_SEL — LO selects the D register, HI selects the R register.
  2. DR_UP — LO decrements, HI increments. This flag does nothing if DR_CCK is LO.
  3. DR_CCK - HI enables clocking (incrementing/decrementing the register) and LO disables it.
  4. TOS_DISABLE — When LO, TOS output is enabled. This means that TOS will be available on the BUS as an ADD and NAND operand, as the RAM address input, and an input to a buffer connected to the stack bus. When HI, TOS is disabled.

These signals cover concern #1 and part of concern #2.

After reworking the rest of the signals, I came up with the following design, build around using dual 2:4 decoders. The next two bits select from one of four behaviors using the stack bus:

  1. Stack outputs value to stack bus (as an operand)
  2. TOS writes to stack
  3. Stack writes value to to RAM
  4. RAM writes to stack

The last two bits select from one of four behaviors on the TOS bus:

  1. Enable buffer from ROM to TOS bus
  2. NAND result is on bus
  3. ADD result is on bus
  4. JumpIf0 is performed

This covers the rest of the CPU's concerns. Though it doesn't impact the TOS bus, including our conditional Jump flag here makes sense–we perform a jump if the value in TOS is 0!

Putting this all together as a graph:

Opcode format for the TOM-1:
    7        6        5        4        3        2        1        0
| DR_SEL | DR_UP  | DR_CCK | TOS_EN |STACKb_1|STACKb_2| TOSb_1 | TOSb_2 |


A new opcode is read clock cycle, and it cleanly describes the behavior for that cycle of a) both of the stack address counters, b) whether TOS is an operand in ALU calculations, c) the behavior of the stack bus, and d) the behavior of the TOS bus, which is written to TOS on even cycles. I have to go back and clean up the old signals in my Python code, but this should make it easier to know for each opcode a) which part of the circuit is affected by each flag and b) for what reason.

Comparing this to the more compact #Microcode-less TTL CPU, here are my takeaways:

A note about expandability

On this last point: once I felt satisfied with the basic TOM-1 design, I went back to investigate how to add I/O to the system. It's fairly trivial to hook up a latch or a bidirectional latch to the TOS bus, and at first I thought about implementing a single 3:8 decoder to select all these different operations: ADD, NAND, IN, OUT, etc. But! Because this is a stack machine, there are some harsh tradeoffs to this design. If you are trying to stream data fast, it must come through TOS by popping off the stack. But the stack is limited to 256 bytes, which is partly consumed by program code already. Whereas RAM is plentiful if the CY62256 is used (32K values).

After rethinking the decoder design, I considered how a 2:4 decoder would work, and realized that the arbitrary TTL gates and ad-hoc wires I'd been using (e.g. "STACK_W_nR") might be more intuitive if they were just a decoding. However, 2:4 is a small number of options to decode to and it still limits what can be done with the RAM device connected to the stack bus. Taking the above encoding and adding onto that an "input" or "output" operation using the available bits seems unlikely. But! Every RAM write and read shares a common input value for its operation: its address. And the maximum address of the CY62256 is only 15 bits. So A16 might become the ninth decoder bit I need. 😂 Similarly, we might be able to implement HALT by jumping to an odd address.