The SIMPL Machine

A project log for SIMPL

Serial Interpreted Minimal Programming Language for exploring microcontrollers & new cpu architectures

monsonitemonsonite 01/14/2021 at 12:280 Comments

In the previous log it was identified that SIMPL would be hosted as a virtual machine running from ROM on the chosen microcontroller.

In this log I explore the practicalities of creating a simulated stack machine running on a Teensy 4.0 and programmed using the Arduino IDE.

The aim is to run SIMPL on a virtual machine with a minimum instruction set with fewer than 32 primitive instructions, in order to keep the complexity down.

As SIMPL is based around a 16-bit wordsize and a 16-bit address space, it will be a better fit to a machine that has a native 16-bit architechture. For this reason, much of the early exploration of SIMPL has been done on the MSP430 range of 16-bit microcontrollers, rather than the 8-bit AVR devices.

I stated in the last log that the SIMPL machine would be based on a stack architecture.

Unfortunately there are very few stack machines available as almost every modern microcontroller has a register based design. A large set of registers are almost essential for the efficient implementation of a high-level language such as C.

With no stack machines readily available, we need to create our own stack structures in software, either on an existing microcontroller,  as a simulation, or on a soft core design on an FPGA.

This provides 3 options which I wish to explore in turn.

  1. Use an MSP430 16-bit microcontroller to implement the SIMPL machine
  2. Simulate the SIMPL machine on a high performance 600MHz ARM Cortex M7 using a $20 Teensy 4.0
  3. Implement the SIMPL machine as a soft core on an FPGA using verilog.

Option 1  will be done using a low cost Launchpad development board. Fortunately Dr. ChenHanson Ting has written extensively about implementing his eForth system on an MSP430 so the mechanics of a stack machine have already been defined.

Option 2 makes use of the low cost Teensy 4.0 board as a target machine. The Teensy 4 with it's 600MHz clock can readily simulate many of the early microprocessors at many times their original operating speed. 

Option 3. Teensy 4 may also be used to simulate experimantal cpus with custom instruction sets. One of these is James Bowman's J1 Forth cpu which might make a suitable candidate for the SIMPL machine, as it has already been implemented and proven in verilog on a Lattice ICE 40 FPGA.

As the MSP430 implementation has been covered elsewhere, and the MSP430 performance is somewhat limited by modern ARM standards, I intend to explore Options 2 and 3, with a software simulation of the J1 Forth cpu providing experience that will be directly relevant to the FPGA implementation.

The J1 Forth CPU

In 2010 James Bowman created a simple 16-bit stack machine that was targeted at executing the Forth language.  Since then it has been implemented in verilog and also as custom silicon that has found it's way into a number of graphics controller ICs by FTDI and their Singaporean silicon fabricators Bridgetek Pte.

The J1 is described in this J1 2010 Euroforth Paper

The J1 has a very compact instruction set, a minimal stack machine architechture  and can be described in fewer than 100 lines of C code or verilog. This makes it easy to understand and easy to implement on an FPGA using opensource tools.

The J1 instruction word is 16-bits wide and the individual bit fields operate directly on the hardware without an intermediate layer of instruction decoding. 

However, memorising 16-bit instructions is not easy, and an assembler or high level language is essential for program development. This is where I believe that SIMPL can be employed to advantage, as a human readable pseudo-code that allows interactive programming of the J1 - as an alternative to assembly language or Forth.

The J1 Architecture.

Having read James Bowman’s J1 Paper several times over, I managed to build up a simplified model of his instruction set and architecture.

J1 uses 16 bit long instruction words – where each word is divided into different length fields.

The following 3 images – taken from James’s 2010 EuroForth paper – explain the architecture:

J1_ISAJ1_encodingJ1 ALU Codes

Looking at Table II above we see that it describes the J1 ALU operations. T is the Top element of the stack, and N is the second or Next element of the stack.

The ALU operations are applied to T and N.

Bits 15, 14 and 13  define the Instruction Class – and there are 5 classes of instruction

1 x x   Literal

000   Jump

001    Conditional Jump

010    Call

011     ALU instruction

Bit 12 if  set in ALU mode provides the return from subroutine mechanism by loading the top of the return address stack R into the PC.

Bits 11, 10, 9 and 8  define a 4 bit ALU opcode – allowing up to 16 arithmetical, logical and memory transfer instructions.

Bits 7, 6 ,5 and 4 are used to control the data multiplexers – so that data can be routed around the cpu according to which of these bits are set.  Here lies a little anomaly with the J1, in that Bit 4 is not used, and it would seem more logical to use it to provide the return function bit – currently done in bit 12.

Bits 3 and 2 define how the return stack pointer is manipulated in an instruction. It can be incremented or decremented by setting these bits.

Bits 1 and 0 define the manipulation of the data stack pointer  – it has a range of  +1,0,-1,-2 depending on the setting of these bits. To pop off the stack you subtract 1 from the stack pointer, to push on the stack, you add one to the stack pointer.

As the various control fields of the J1 instruction exercise different parts of hardware – they can operate in parallel – so for example a return or exit from subroutine can be had for free.

Modifying the J1 Instruction Set.

Whilst the J1 instruction set is neat and compact – the unused bit anomaly in Bit 4 is a bit of a sticking point with me. 

If we put the “return bit” from Bit 12 into the Bit 4 field, this would free up the Bit 12 field.  

The instruction word will then break down into 4, 4-bit fields which makes it much easier to express as a hexidecimal number.

Bits 15, 14, 13, 12   Instruction Class     

Bits 11, 10, 9, 8      ALU Operation

Bits 7, 6, 5, 4        Register transfers      T->N, T->R, N->[T], R->PC

Bits 3, 2, 1, 0        Stack Pointer modifications

Bit 12 becomes an augmented  version of the Jump instruction. It provides a mechanism to create look-up tables based on the ascii value of the SIMPL command character.

The J1 Simulator

The J1 cpu may be simulated in about 90 lines of C code.

// J1 CPU Simulator by Samawati 27-3-2015 

static unsigned short t;
static unsigned short s;
static unsigned short d[0x20]; /* data stack */
static unsigned short r[0x20]; /* return stack */
static unsigned short pc;    /* program counter, counts cells */
static unsigned char dsp, rsp; /* point to top entry */
static unsigned short* memory; /* ram */
static int sx[4] = { 0, 1, -2, -1 }; /* 2-bit sign extension */

static void push(int v) // push v on the data stack
  dsp = 0x1f & (dsp + 1);
  d[dsp] = t;
  t = v;

static int pop(void) // pop value from the data stack and return it
  int v = t;
  t = d[dsp];
  dsp = 0x1f & (dsp - 1);
  return v;

static void execute(int entrypoint)
  int _pc, _t;
  int insn = 0x4000 | entrypoint; // first insn: "call entrypoint"
  do {
    _pc = pc + 1;
    if (insn & 0x8000) { // literal
      push(insn & 0x7fff);
    } else {
      int target = insn & 0x1fff;
      switch (insn >> 13) {
      case 0: // jump
        _pc = target;
      case 1: // conditional jump
        if (pop() == 0)
          _pc = target;
      case 2: // call
        rsp = 31 & (rsp + 1);
        r[rsp] = _pc << 1;
        _pc = target;
      case 3: // alu
        if (insn & 0x1000) {/* r->pc */
            _pc = r[rsp] >> 1;
        s = d[dsp];
        switch ((insn >> 8) & 0xf) {
        case 0:   _t = t; break; /* noop */
        case 1:   _t = s; break; /* copy */
        case 2:   _t = t+s; break; /* + */
        case 3:   _t = t&s; break; /* and */
        case 4:   _t = t|s; break; /* or */
        case 5:   _t = t^s; break; /* xor */
        case 6:   _t = ~t; break; /* invert */
        case 7:   _t = -(t==s); break; /* = */
        case 8:   _t = -((signed short)s < (signed short)t); break; /* < */
        case 9:   _t = s>>t; break; /* rshift */
        case 0xa:  _t = t-1; break; /* 1- */
        case 0xb:  _t = r[rsp];  break; /* r@ */
        case 0xc:  _t = (t==0xf008)?eth_poll():(t==0xf001)?1:(t==0xf000)?getch():memory[t>>1]; break; /* @ */
        case 0xd:  _t = s<> 2) & 3]); /* rstack+- */
        if (insn & 0x80) /* t->s */
           d[dsp] = t;
        if (insn & 0x40) /* t->r */
           r[rsp] = t;
        if (insn & 0x20) /* s->[t] */
          (t==0xf008)?eth_transmit(): (t==0xf002)?(rsp=0):(t==0xf000)?putch(s):(memory[t>>1]=s); /* ! */
        t = _t;
    pc = _pc;
    insn = memory[pc];
  } while (1);
/* end of cpu */

We can now combine this simulated J1 cpu with the SIMPL interpreter.  Each of the 16-bit wide primitive J1 instructions will be allocated to one of the SIMPL ascii character commands.

A crude version of this was hacked together in April of 2017 for the Arduino DUE or any of the larger memory Arduinos.

The intention now is to get thus running on a 600MHz Teensy 4.0 offering more memory and much greater speed

There will be the overheads of the J1 simulation and also the SIMPL virtual machine, but with the performance of the Teensy 4.0 it will be a good platform to explore the possibilities of a dedicated SIMPL machine.