Simple Compiler

A very simple compiler for minimalist home brew CPUs

Similar projects worth following
Simple Compiler for Home Brew CPUs
The idea behind this project is to compile a minimal sub-set of a high level language such as C or Pascal for home brew CPUs. Specifically for my Subleq CPU project. I have already built an assembler and an interpreter (for the assembler) for this project.--- There are serious C compilers that document their "backend" (vbcc is one, see the uploaded pdf) but even those are not that simple.---After much trolling of the Internet I selected "SC" as a base for my experiments."SC" (Simplest Compiler) by Stas Sidorenko ( is written in about 700 lines of Pascal. It is a very very minimal language but the code is quite readable and has good structure.

This project is rather large and the work is not that linear. To make the logs readable I will from time to time rewrite them.

Project logs:

  1. Convert SC.PAS to SimpleC.c(done).
  2. Adjust the code for the CPU word width and memory model (done).
  3. Build tool train(done):
    • Subleq code generator (using the Simple Compiler OpCodes as input),
    • Subleq Assembler, and
    • Subleq Interpreter.
  4. Modify the Simple Compiler to export OpCodes to suit OpCode2Subleq.
  5. Link up and debug the tool chain (done):
    • SimpleC -> OpCode2Subleq -> Subleq_Asm -> Subleq_Int

Mission Completed

I now have a working Subleq Compiler. Yes it is very primitive and very very slow but that can be made better.

Here is the test code ("1.SC"):


  if A<B then
    write A*10<b
  else begin
    write 1<B*10
    write B*10, (a+b)*(-5)
  write C
Here is the tool chain in action:

C:\AlanX\SimpleC\OpCode2Subleq>SimpleC  -o 1.opc 0<
Tokenised Code:
Line      1: BEGIN
Line      2: A = 4
Line      3: B = 1
Line      4: IF A < B THEN
Line      5: WRITE A * 10 < B
Line      6: ELSE BEGIN
Line      7: WRITE 1 < B * 10
Line      8: WRITE B * 10 , ( A + B ) * ( - 5 )
Line      9: END
Line     10: C = - 1
Line     11: WRITE C
Line     12: END

C:\AlanX\SimpleC\OpCode2Subleq>OpCode2Subleq -i 1.opc -o 1.sasm

C:\AlanX\SimpleC\OpCode2Subleq>Subleq_Asm -i 1.sasm -o 1.code -l 1.list

C:\AlanX\SimpleC\OpCode2Subleq>Subleq_Int -i 1.code -l 1.list -t 1.trace

SUBLEQ Interpreter

 10 -25

Interpreter finished

Press any key to continue . . .


The interpreter takes a full 4 seconds to execute! Hard to believe anything could be this slow.

Code Size

  • - 12 lines
  • 1.opc - 74 lines including the header (95 words)
  • 1.sasm - 3126 lines including comments (5530 words)


Rather than fix/improve Simple Compiler it may be better to port a more complete compiler to work with OpCode2Subleq.

The tool train would suit a Transport Triggered Architecture (TTA) like my "Weird CPU" if the word size is increased from 8 bits to 12 bits or 16 bits.


Subleq is horribly inefficient.


It is pretty clear that the root source of most of these simple compilers is Wirth's PL/0 as the language syntax (i.e. the expression part) is identical with the same names, omissions and unary minus fault. Unfortunately they did not carry through the rest of PL/0.

I have sourced Wirth's original PLZero Pascal source code which I will follow for upgrades to Simple Compiler.


I have split the tokeniser out of Simple Compiler as it can be done and it reduces the code size of Simple Compiler.

String Tokens

I prefer working with string tokens (e.g. "IF" than working with integer tokens (e.g. _if) as I don't need a list of token definitions. So I will rework the code to suit.

PL/0 in C

Ported Wirth's PL/0 Pascal source code into C. Now I am a native Pascal programmer and Wirth was the Inventor of Pascal. So it is a bit sad that I ported his code to C.

PL/0 Extensions

The basic PL/0 is a bit primitive (no strings!). So I have added arrays and (Pascal) strings. It also recognise the majority of dialects I have seen. But no built in functions, or user functions or procedure parameter lists. I am not trying to clone a full version of Pascal, just a very basic compiler for home brew CPUs.

The compiler syntax error reporting needs a bit of work to be more helpful, and finally there are no run-time checks!

But overall I am very happy with this little compiler.

P-Code to Subleq

The PL/0 compiler exports "P-Code". This needs to be translated into Subleq. There are 26 Instructions to recognise. Few of the instructions are simple.


A C port of Niklaus Wirth's PLZero.pas code. It is a complete rewrite and restructure. Support for arrays and strings. Support for most dialects. Includes pCode2OpCode.

application/x-zip-compressed - 717.85 kB - 07/23/2017 at 02:17



Niklaus Wirth PL/0 source code

pas - 15.50 kB - 07/10/2017 at 02:25


The full tool train and test data

x-zip-compressed - 76.57 kB - 07/06/2017 at 09:35


The Simple Compiler and test data

x-zip-compressed - 14.33 kB - 07/06/2017 at 09:34



The SC language discription

plain - 1.24 kB - 06/24/2017 at 15:31


View all 11 files

  • Code Generation

    agp.cooper07/20/2017 at 11:06 0 comments

    Code Generation

    I have been testing some fairly big programs. Another grammar fault I found is that you cannot assign negative constants. Fixed.

    Here is the grammar at the moment:

      PL/0 Grammar:
      program = block "." .
      block = [ "const" ident "=" ["-"] number {"," ident "=" number} ";"]        // Add negative constants
              [ "var" ident {"," ident} ";"]
              [ "var" ident "[" number "]" {"," ident "[" number "]"} ";"]        // Add arrays
              { "procedure" ident ";" block ";" } statement.
      statement = [ ident ":=" expression
                    ident "[" expression "]" ":=" expression                      // Add arrays
                    ident ":=" string                                             // Add string assignmnet to ident
                    ident "[" expression "]" ":=" string                          // Add string assignmnet to arrays
                  | "call" ident
                  | "?" ident | "write" ident | "putc" ident                      // Add "write" (int) and "putc" (char)
                  | "!" expression | "read" expression | "getc" expression        // Add "read" (int) and "getc" (char)
                  | "begin" statement {";" statement } [";"] "end"                // Add optional ";" before "end"
                  | "if" condition "then" statement [ "else" statment ]           // Add "else"
                  | "while" condition "do" statement ].
      condition = "odd" expression |
                  expression ("="|"#"||"!="||"<>"|"<"|"<="|">"|">=") expression . // Add "!+" and "<>"
      expression = [ "+"|"-"] term { ("+"|"-") term}.
      term = factor {("*"|"/"|"%"|"mod") factor}.                                 // Add "%" and "mod"
      factor = ident | ident "[" expression "]" | number | "(" expression ")".    // Add arrays
    Add comments (to tokeniser):
      // ... EOL
      { ... }
      /* ... */
      (* ... *)

    Still have not fixed the unary minus fault but its not really an issue, just put parentheses around it.

    The above grammar may make you eyes roll but you need it to code a compiler.

    The Plan

    The plan for building the code generator is to rework the interpreter:

    1. Change the stack to downward growth from upward growth (done).
    2. Swap out the P-Code instructions for pseudo CPU OpCode (done).
    3. Convert the interpreter into a OpCode lister (done).
    4. Covert the opcodes to Subleq.

    The reason for doing this to the interpreter is that changes can be incrementally checked.

    The CPU I am modelling is as minimal as practical. The current pseudo OpCodes are stack focused. Here are the registers:

    Registers Comment
    Ax Primary Register
    Bc Secondary Register
    Tx Temporary Register (used for swap)
    PC Program Counter (?)
    SP Stack Pointer (downward growing)
    BP Base Pointer

    The program counter (PC) is not actually a register but a subroutine return address. It is set by the compiler at compile time, that is the real PC register is not actually accessed during execution.

    The stack and register pseudo opcodes:

    Stack operations: Register moves:
    FromStack (Ax=stack[Ax]mov Ax,Bx
    ToStack (stack[Ax]=Bx)mov Ax,SP
    push Ax mov SP,Ax
    pop Ax mov Ax,BP
    pop Bx mov BP,Ax
      mov PC,Ax
    Move Immediate (literal): mov Ax,PC
    mov Bx,Imm mov Tx,Ax
      mov Bx,Tx

    I have favoured register moves over stack operations because stage operations are expensive. Here are the remaining pseudo opcodes:

    Increment/Decrement: Mathematical Operations: Boolean Operations:
    dec Ax sub Ax,Bx odd Ax
    inc Ax add Ax,Bx eq Ax,Bx
      mul Ax,Bx lt Ax,Bx
    Jump: div Ax,Bx le Ax,Bx
    jmp mod Ax,Bx ne Ax,Bx
    jz   gt Ax,Bx
        ge Ax,Bx

    Add four input/outputs instructions:

    read (integer)read
    get (char)getc
    write (integer)write
    put (char)putc

    Recoded the pseudo opcode interpreter into a code lister (done). Called it PCode2OpCode.

    I have been reviewing the OpCode output and I have used indirect addressing with the Ax register which of course is not a valid instruction for the x86 instruction set. It is pseudo opcodes after all. I also changed the format to make it easier to import, for example I use "mov_Ax,Bx" rather than mov Ax, Bx" and I dropped the "L" prefix for address labels.


  • PL/0 Completed?

    agp.cooper07/19/2017 at 01:39 0 comments

    PL/0 Completed?

    Brave words! Can it really be completed? There are still things I would like to add but I need to move on. The current state of play is:

    1. Recognises most PL/0 dialects.
    2. Tolerant to redundant ";"
    3. One dimensional arrays
    4. Pascal strings

    To do:

    Would like to add an ASM instruction (pass through to the target CPU).

    Rework the error messages to provide better hints.

    Here is an example program:

    { Program Test String }
    var Str[16],Prompt[16];
      procedure ReadStr;
      var i,test;
      { Read a Pascal string, uses Str }
        while test=1 do begin
          getc Str[i];
          if Str[i]<32 then begin
            Str:=i-1; { Str == Str[0] }
          end else begin
            if i=16 then begin
            end else begin
      procedure WriteStr;
      var i;
      { Write a string, uses Str }
        while i<=Str do begin
          putc Str[i];
      procedure WritePrompt;
      var i;
      { Write Prompt }
        while i<=Prompt do begin
          putc Prompt[i];
      Prompt:="String Test";
      call WritePrompt;
      putc 10; { new line }
      Prompt:="Enter a string: ";
      call WritePrompt;
      call ReadStr;
      Prompt:="You wrote: ";
      call WritePrompt;
      call WriteStr;
      putc 10; { new line }
    You may notice I have used redundant ":" in this example.

    And here is the run:

    C:\AlanX\pl0>.\PLZero_Tokeniser\PLZero_Tokeniser     -i test18.pl0   -o test18.token -nl
    C:\AlanX\pl0>.\PLZero_Compiler\PLZero_Compiler       -i test18.token -o test18.code  -t  test18.table
    C:\AlanX\pl0>.\PLZero_Interpreter\PLZero_Interpreter -i test18.code  -o   -s  test18.stack
    String Test
    Enter a string: Hello World!
    You wrote: Hello World!
    Press any key to continue . . .

    I think that is pretty cool!


    The next task is to write a P-Code to Subleq translator.


  • PLZero Compiler

    agp.cooper07/13/2017 at 16:12 0 comments

    PLZero (PL/0) Compiler

    The compiler is almost done. The Nikluas Wirth Pascal has been recoded in C. The structure completely rebuilt. I don't even think Mr Wirth would recognise his code! I removed the tokeniser and interpreter from the compiler and are now stand alone programs (part of the tool train).

    Here are a couple of working programs.


    const m=7, n = 85; 
    var print, x, y, z; { Assignment to print variable is printed }
    procedure multiply; 
    var a, b; 
    	a := x; b := y; z := 0; 
    	while b > 0  do 
    		if odd b then z := z + a; 
    		a := 2 * a; b := b / 2; 
    	x := m;
    	y := n;
    	call multiply;

    Prime number search (

    { calculate prime numbers }
      pcmax = 101;  { max on 8 is 2045 }
      print, { memory mapped print routine memory location }
      pc,    { prime candidate }
      d,     { divisor }
      t1,    { temp #1 }
      t2,    { temp #2 }
      isnot; { isnot a prime flag }
      print := 1;
      print := 2;
      pc := 3;
      t1 := pc/2;
      while pc <= pcmax do
          { trial division to test for prime }
          d := 3;
          isnot := 0;
          { This keeps t1 >= to the sqrt(pc) }
          if (t1*t1) < pc then t1 := t1 + 1;
          while d <= t1 do
              t2 := pc / d;
              if (t2*d) = pc then { not prime }
                  isnot := 1; { remember not prime }
                  d := pc;    { force loop exit because we don't have break }
              d := d + 2;
          if isnot = 0 then print := pc;  { store at print will print it }
          pc := pc + 2;
    Wirth's test program:
      m =  7,
      n = 85;
      print, putc, x, y, z, q, r;
    PROCEDURE multiply;
    VAR a, b;
      a := x;
      b := y;
      z := 0;
      WHILE b > 0 DO BEGIN
        IF ODD b THEN z := z + a;
        a := 2 * a;
        b := b / 2
    PROCEDURE divide;
    VAR w;
      r := x;
      q := 0;
      w := y;
      WHILE w <= r DO w := 2 * w;
      WHILE w > y DO BEGIN
        q := 2 * q;
        w := w / 2;
        IF w <= r THEN BEGIN
          r := r - w;
          q := q + 1
    PROCEDURE gcd;
    VAR f, g;
      f := x;
      g := y;
      WHILE f # g DO BEGIN
        IF f < g THEN g := g - f;
        IF g < f THEN f := f - g
      z := f
      x := m;
      y := n;
      CALL multiply;
      x := 25;
      y :=  3;
      CALL divide;
      x := 84;
      y := 36;
      CALL gcd;

    I have edited the tokeniser to recognise:

    • any case for token words (i.e. BEGIN, begin or Begin etc.) but variables are still case sensitive
    • "var" and "int" as an alternative
    • "<>", "!=", "#" as alternatives
    • "print" (lower case) is recognised at any level as the integer output.
    • "putc" (lower case) is recognised at any level as the char output.

    I rather like the upper case for keywords.

    Here is the result from Wirth's program:

    Begin PL/0:
    End PL/0.

    I need a final check of the code, add a execution list option before moving on to the Subleq translator.


  • Fixing Unary Minus

    agp.cooper07/08/2017 at 00:48 1 comment

    Language Structure

    Programming languages can the presented/described in format called Backus-Naur Form (BNF). Here is a proper example from "Let's Build a Compiler!" Jack W. Crenshaw:

    <b-expression> ::= <b-term> [<orop> <b-term>]*
    <b-term> ::= <not-factor> [AND <not-factor>]*
    <not-factor> ::= [NOT] <b-factor>
    <b-factor> ::= <b-literal> | <b-variable> | <relation>
    <relation> ::= | <expression> [<relop> <expression]
    <expression> ::= <term> [<addop> <term>]*
    <term> ::= <signed factor> [<mulop> factor]*
    <signed factor>::= [<addop>] <factor>
    <factor> ::= <integer> | <variable> | (<b-expression>)
    Here is what I decoded from Simple Compiler in a "bnf" like form:
    Factor         -> Identifier
                   -> Integer
                   -> ( BoolExpression )
    Term           -> Factor
                   -> Factor * Factor+
                   -> Factor / Factor+
    Expression     -> Term
                   -> + Term
                   -> - Term
                   -> Term + Term+
                   -> Term - Term+
    BoolExpression -> Expressiom
                   -> Expression = Expression+
                   -> Expression <> Expression+
                   -> Expression < Expression+
                   -> Expression <= Expression+
                   -> Expression > Expression+
                   -> Expression >= Expression+

    (Note, my decoded format is as "coded" and is upside down compared to the bnf format.)

    Okay what does all this mean? Basically the minimum number of precedence levels (from "Let's Build a Compiler!" Jack W. Crenshaw) is 8:

    LevelSyntax ElementOperator
    0factorliteral, variable
    1signed factorunary minus
    2term*, /
    3expression+, -
    4b-factorliteral, variable, relop
    7b-expressionOR, XOR

    The Simple compiler only has 4, so it was doomed from the start. Even without the bit operations it still needed 5 levels. What failed was the unary minus. This is why I had to put brackets around the "-5" the the code:

      if A<B then
        write A*10<b;
      else begin
        write 1<B*10;
        write B*10, (a+b)*(-5);
      write C;

    The original coder did try with "Expression -> - Term" as the C=-1 does work, but a true unary minus needs a higher precedence than "*" for "(a+b)*-5" to work.

    So I need at added "SignedFactor" between "Factor" and "Term" based on Crenshaw's bnf, and remove the "Expression -> + Term" and "Expression -> - Term":

    Factor         -> Identifier
                   -> Integer
                   -> ( BoolExpression )
    SignedFactor   -> Factor
                   -> + Factor
                   -> - Factor           
    Term           -> SignedFactor
                   -> SignedFactor * SignedFactor+
                   -> SignedFactor / SignedFactor+
    Expression     -> Term
                   -> Term + Term+
                   -> Term - Term+
    BoolExpression -> Expressiom
                   -> Expression = Expression+
                   -> Expression <> Expression+
                   -> Expression < Expression+
                   -> Expression <= Expression+
                   -> Expression > Expression+
                   -> Expression >= Expression+

    It took a little while to get my head around the internals of the recursive procedure calls but eventually I got it to work:

    void SignedFactor(void)
      char op;
      if ((op==_plus)||(op==_minus)) {
      if (op==_plus) {
        // Nothing to do
      } else if (op==_minus) {
    In the above code:
    • Th global "tTok" is saved to local "op" in case it is required for later use (other procedures may "eat it"). Basically parser procedures "eat" the token if it belongs to them.
    • Next, if the token is for this procedure (i.e. it is a "+" or a "-") then "eat" the token by getting the next token (GetNextToken()).
    • Okay, the procedure has to wait for the "Factor" that will be operated on to come back (i.e. pass through to Factor()).
    • Factor() has come back and the result will be in "Ax". If "op" is a "+" or a "-" then do the operation (emit the code).
    • Okay all done, return to the calling procedure (Term()).

    Here is the current language structure for Simple Compiler:

    Factor         -> Identifier
                   -> Integer
                   -> ( BoolExpression )
    SignedFactor   -> Factor
                   -> + Factor
                   -> - Factor
    Term           -> SignedFactor
                   -> SignedFactor * SignedFactor+
                   -> SignedFactor / SignedFactor+
                   -> SignedFactor % SignedFactor+
    Expression     -> Term
                   -> Term + Term+
                   -> Term - Term+
    BoolExpression -> Expression
                   -> Expression = Expression+
                   -> Expression <> Expression+
                   -> Expression < Expression+
     -> Expression <=...
    Read more »

  • Subleq Efficiency

    agp.cooper07/06/2017 at 04:29 0 comments

    A Not Very Happy Interpreter

    The interpreter is not very happy. I have used the OpCode jump address without considering the Subleq actual address (i.e. the OpCode instruction size does not match the Subleq macro size). This is a bit tricky as I do not calculate Subleq instructions lengths and therefore cannot calculate the Subleq address directly (at least in OpCode2Subleq). A good fix is to precede each Subleq instruction with a label identifying the OpCode Address, eg. "_OPCxxxxx". This has no cost to the final integer code size.

    The use of the OpCode instruction pointer label "_OPCxxxxx" works very well. Very easy to find the OPC code in Subleq now.

    Fixed a coding error, where I coded "jz" as if it was "jeqz"? Fixed and now the interpreter executes and presents the correct answer.

    Subleq Efficiency

    Here is the final test of my tool train:

    C:\AlanX\SimpleC\OpCode2Subleq>SimpleC  -o 1.opc 0<
    Tokenised Code:
    Line      1: BEGIN
    Line      2: A = 4
    Line      3: B = 1
    Line      4: IF A < B THEN
    Line      5: WRITE A * 10 < B
    Line      6: ELSE BEGIN
    Line      7: WRITE 1 < B * 10
    Line      8: WRITE B * 10 , ( A + B ) * ( - 5 )
    Line      9: END
    Line     10: C = - 1
    Line     11: WRITE C
    Line     12: END
    C:\AlanX\SimpleC\OpCode2Subleq>OpCode2Subleq -i 1.opc -o 1.sasm
    C:\AlanX\SimpleC\OpCode2Subleq>Subleq_Asm -i 1.sasm -o 1.code -l 1.list
    C:\AlanX\SimpleC\OpCode2Subleq>Subleq_Int -i 1.code -l 1.list -t 1.trace
    SUBLEQ Interpreter
     10 -25
    Interpreter finished
    Press any key to continue . . .

    Code Efficiency

    12 lines of high level code became 95 words (73 OpCodes) which turned into

    5530 words of Subleq code (1903 lines of uncommented code). That is about 26 lines of Subleq per OpCode. And the code takes a full 4 seconds to execute!


  • Converting the Simple Compiler to a Subleq Compiler

    agp.cooper06/24/2017 at 17:03 0 comments

    Compiler Subleq Backend

    In order to convert Simple Compiler to a Subleq compiler, I have to replace the Simple Compiler Assembler (OpCode) Lister and Assemble (OpCode) Interpreter with a Subleq OpCode exporter. I will use an external Subleq assembler and interpreter. The Subleq OpCode exporter is called the Subleq backend. The Subleq backend contains all the Subleq macros, but only composite macros (which takes an operand) are accessible. Here is the list of Subleq macros:

    • jump
    • clear
    • not
    • shl
    • inc
    • dec
    • chs
    • label
    • return
    • putc
    • getc
    • wrtInt
    • rdInt
    • sub
    • add
    • copy
    • ncopy (negative copy)
    • jlez
    • jgez
    • jeqz
    • jgtz
    • jltz
    • jnez
    • jmin
    • store
    • absolute
    • xor
    • and
    • or
    • nand
    • nor
    • new sub
    • new add
    And here is th list of compiler OpCode that are supported:
    • system
    • movAxVar
    • movVarAx
    • movAxImm
    • movAxBx
    • movBxAx
    • pushAx
    • pushBx
    • pushFx
    • popAx
    • popBx
    • popFx
    • addBx
    • subBx
    • mulBx
    • divBx
    • jmp
    • jz
    • wrtAx
    • wrtLn
    • rdAx
    • halt
    • cmpAx
    • orAxAx
    • xorAxAx
    • setEq
    • setNE
    • setLT
    • setLE
    • setGT
    • setGE
    The "system" command sets up Subleq with:
    • HALT (0)
    • OUTPUT (-1)
    • INPUT (-2)
    • Z (zero)
    • T (temp)
    • t (tmp)
    • P (+1)
    • N (-1)
    • the CPU model:
      • Ax
      • Bx
      • Fx
      • SP
      • DP
    • and few useful constants:
      • _SP (-32)
      • _DP (-16384)
      • _MIN (-32768)
      • _CR (13)
      • _LF (10)
      • _SPACE (32)
      • _MINUS (45)
      • _ZERO (48)
      • _NINE (57)

    The Subleq system also has a point to pointer copy (PPC) routine (as it is assumed the monitor code will be stored in ROM and pointer to pointer copy requires self modifying code).

    I will likely add many more constants to support string output later.

    I have added the binary operation macros:

    • XOR Ax,Bx
    • AND Ax,Bx
    • OR Ax,Bx
    • NAND Ax,Bx
    • NOR Ax,Bx
    • NOT Ax
    • SHL Ax
    • SHR Ax (to be added)

    These bit operations don't update the flag register.

    Integer Minimum Value

    The minimum integer value (MIN) for a 16 bit integer is -32768. Subleq has a major problem with MIN. It treats MIN as equal to 0 unless special precautions are taken. This is not usually a problem unless the code uses the sign bit (such as for bit operations etc.). In the end I rewrote all the test macros (i.e. jeqz, jnez, jlez, jltz, jgez and jgtz) to be MIN aware.

    I also used MIN as the return value for multiplication overflow and attempt to divide by zero. There is a "jmin" macro if required.

    Working Subleq Backend

    I have got a working Subleq backend test-bed (TestMacro2Subleq), for example the following Subleq macros:

      // iMul
      // iDiv

    Produces after assembly and interpretation:

    C:\AlanX\TestSubleq>TestMacro2Subleq  1>Test.sasm
    C:\AlanX\TestSubleq>subleq_asm -i Test.sasm -o Test.code -l Test.list
    C:\AlanX\TestSubleq>Subleq_Int -i Test.code -l Test.list -t Test.trace
    SUBLEQ Interpreter
    6   <- Entered integer
    2   <- Entered integer
    12  -> Integer multiplication (=6*2)
    3   -> Integer division (=6/2)
    0   -> Integer remainder (=6%2)
    Interpreter finished
    Press any key to continue . . .

    The Subleq code produced is too long to list here.


  • Translated the Pascal Code to C Code

    agp.cooper06/22/2017 at 16:18 0 comments

    C Code

    The code has been translated. Quite a bit of clean up:

    • General regrouping of code.
    • Fixed memory reference error (mixed up low/high byte order).
    • Simplified the symbol table.
    • Added code to export the code being tokenise to "stderr"
    • Reworked the error reporting routine.
    • Fixed up the EOF problem.
    • Reordered the constants and pushed them out to a header file.
    • Pushed out the assembler and interpreter code to include files.
    • Changed the CPU model.
    • Added some options to the command line.

    The current version of the tool tranin has been uploaded.

    Understanding the Code

    The compiler code in most of the minimalist program codes I reviewed are very similar. Even to the point where missing instructions (i.e.">") and constant names are the same (i.e. SETLSS rather than SETLT and SETLEA instead of SETLE).

    A good site for compiler construction (but not complete) is I will likely refer back to this site for when I add functions to my code.

    A good pdf on compiler construction (by Jack W. Crenshaw) is (I have uploaded is book compiler.pdf).

    But the place to go is "Recursive-Descent Parsing" Chapter 6.6 of the AWK book pp 147-152. (I have uploaded a pdf of the book and the programs from Chapter 6).

    The NewCPU Model

    The CPU model for nearly all the compiler I have reviewed is:

    • Ax: Accumulator register
    • Bx: Secondary register
    • Fx: Flag register
    • SP: Stack Pointer
    • DP: Data Pointer (Simple Compiler)
    • 16 bit data and address word width

    Usually all data is on the stack. The Simple Compiler uses a separate Data Pointer (DP) (i.e. modelling a Pascal style "heap memory"). I will keep this method.

    The the assembler code therefore revolves around Ax and Bx:

    • MOV Ax, Bx
    • MOV Bx, Ax
    • PUSH Ax
    • PUSH Bx (added)
    • PUSH Fx (added)
    • POP Ax
    • POP Bx
    • POP Fx (added)
    • MOV AX, IMM
    • MOV AX, [ADDR] (uses DP)
    • MOV [ADDR], AX (uses DP)

    Two basic jumps:

    • JMP [ADDR]
    • JZ [ADDR]

    The remainder maps the language symbols (i.e. '+', '-', '*', '/', '=',' <>','<','<=','>','>=') :

    • ADD Ax, Bx
    • SUB Ax, Bx
    • IMUL Ax, Bx (software)
    • IDIV Ax, Bx (software)
    • SETEQ
    • SETNE
    • SETLT (added)
    • SETLE (added)
    • SETGT
    • SETGE

    The set commands set (for true) or reset (for false) the AX depending on the Flag register. True is defined as Ax=1 and false is defined as Ax=0.

    The following commands set the flag register (Fx):

    • OR Ax, Ax (Sets the Flags based on the Ax)
    • XOR Ax, Ax (Clears the Ax and sets flags)
    • CMP Ax, Bx (Sets flags based on Ax-Bx)

    Finally there is a HALT and a couple of I/O commands:

    • HALT (same as reset)
    • wrtAx (converts an integer to characters before printing)
    • wrtLn (write a new line)
    • rdAx (added, reeads characters and converts to integer)

    Language Construct

    Languages can me modelled in Backus–Naur Form (BNF). Before extending the language I need to map the existing language. I use the syntax from the AWK handbook for the language construct:

    // Factor         -> Identifier
    //                -> Integer
    //                -> ( BoolExpression )
    // Term           -> Factor
    //                -> Factor * Factor+
    //                -> Factor / Factor+
    // Expression     -> Term
    //                -> + Term
    //                -> - Term
    //                -> Term + Term+
    //                -> Term - Term+
    // BoolExpression -> Expressiom
    //                -> Expression = Expression
    //                -> Expression <> Expression
    //                -> Expression < Expression
    //                -> Expression <= Expression
    //                -> Expression > Expression
    //                -> Expression >= Expression
    // Statement      -> Begin
    //                -> While
    //                -> If
    //                -> Write
    //                -> Read
    //                -> Assignment
    // Begin          -> Statement    \\ This is where a statement separator (";") should added!
    //                -> End
    // While          -> BoolExpression Statement
    // If             -> BoolExpression then Statement
    //                -> BoolExpression then Statement "else" Statement
    // Write          -> BoolExpression , BoolExpression+
    // Read           -> Input
    // Assignment     -> Identifier = BoolExpression

    Note: the "+" at the end of a line means it can be repeated.

    Its okay but I think the program should begin with a "begin" and end with an "end". The Statement procedure needs to be modified for this. There probably needs to be a semi-colon (";") as a statement separator as well. I think it might get messy without a statement delimiter later.


    Read more »

View all 7 project logs

Enjoy this project?



prosto wrote 10/11/2018 at 07:12 point

add llvm r code  generate

  Are you sure? yes | no

agp.cooper wrote 10/12/2018 at 04:13 point

I have looked at adding a SubLEq CPU model (i.e. code generator) to a compiler (such as Small-C, vbcc  and LCC). Small-C has 107 p-Codes while LCC is too complex for me to decode. vbcc looks to be easier than lcc but still pretty complex. But in any case its looks like a very big job.

PL/0 has about 20 to 32 p-Codes (depending on extensions), so you can see why I started there.

The main problem with SubLEq is that it is just too inefficient. The top down approach, i.e. Tokenizer -> Compiler (p-Code) -> Assembler, approach is just truly terrible with regard to efficiency.

I played around with an interpreter (think of a FORTH like interpreter) for my TTA CPU which is bottom up and that seems to be a better way to go. Building up (i.e. interpreting) Op-Code like instructions seems to be more efficient. Basically emulate an existing CPU that a compiler supports. There is a Small-C compiler for the i8080 for example. Very likely only a sub-set of the i8080 needs to be emulated.


  Are you sure? yes | no

prosto wrote 10/15/2018 at 10:10 point

AST RUBY yarv have small code number

  Are you sure? yes | no

legacy wrote 07/26/2017 at 20:49 point

Small C compiler also had a few articles published in 90s.

  Are you sure? yes | no

agp.cooper wrote 07/24/2017 at 17:09 point

Hi BigEd

The LPG-30 is an interesting machine but I am working on a micro-code CPU at the moment. Still about 30 chips but programmable for TTA/Subleq and/or OpCodes. Initial calcs suggest about 5 times slower than a direct TTA/Subleq. It is a bit like, which in my book is pretty cool.

I will finish the p-code to subleq code tomorrow and that will be the end of this project (until I have a new CPU).


  Are you sure? yes | no

Ed S wrote 07/25/2017 at 11:30 point

sounds interesting!

  Are you sure? yes | no

agp.cooper wrote 07/26/2017 at 10:28 point

Hi BigEd,
Add a new project "CPU4" with the design concept.
Currently testing the circuit in Tina.
I have modelled Subleq and TTA with it.

Regards AlanX

  Are you sure? yes | no

Ed S wrote 07/26/2017 at 10:31 point

excellent - I'll keep an eye on that! #CPU4 

  Are you sure? yes | no

agp.cooper wrote 07/22/2017 at 06:00 point

Hi Dylan,
I have looked at other peoples basic C compilers (C4 is a good example) and the basic compiler structure for PL/0 and C are almost the same. The main downside for C is the work to support for all extras that the language offers. PL/0 is minimal but with the addition of arrays and strings I find it (so far) fine for a home brew CPU.
I have finished the P-Code to OpCode translation program and I am ready for the OpCode to Subleq translation. To get this far in less than two weeks indicates that PL/0 was a good choice.
Not hard to make PL/0 look a bit like C if that is acceptable but if you really want C compiler than have a look at Small C. The Small C compiler has a separate code generator function so you can port to your own CPU.

  Are you sure? yes | no

Dylan Brophy wrote 07/22/2017 at 06:36 point

Thank you, I will look into it.

  Are you sure? yes | no

Dylan Brophy wrote 07/21/2017 at 23:41 point

Would you ever be interested in writing an actual C/C++ compiler?

  Are you sure? yes | no

agp.cooper wrote 07/07/2017 at 11:10 point

Hi BigEd,
I downed, compiled and ran the test code for "C--". Yes it compiles and works but missing lots. I think I would start with something more advanced.
I have been looking at Chris Lewis' Small C at the moment. The backend (i.e. the code generator) looks pretty easy to modify.


  Are you sure? yes | no

Ed S wrote 07/24/2017 at 11:36 point

Hmm, indeed, on further inspection C-- is a bit too weak. I do like the look of your PL/0 findings. The other language we are now considering is PLASMA. It's VM-based and targeted at 6502 so the challenge would be a port of the VM.

  Are you sure? yes | no

agp.cooper wrote 07/24/2017 at 12:29 point

PLASMA looks pretty good (more advanced than PL/0). I would go with PLASMA.

I would have been happy to know about it a month ago!

But I have learnt a lot from PL/0 about compilers and what is expected from the CPU!


You can try PL/0 by downloading and running one of the batch files (test or test2). It will prompt you fro a string and then print it back. Try "Hello World!" for both. Test.bat will also do some multiplication and division the hard way. Test2.bat uses a short string so it will drop the "!". Both codes export the tokens, symbol table, P-Code, OpCode, the run output and the Stack. All interesting if you want to build your own compiler otherwise not very interesting at all!


The byte code for PLASMA has been defined so pretty easy to port to any target.

Its stack/forth like so pretty simple to port as well.


Subleq is a dog. It takes 1200 lines of code to port P-Code. Hopefully your CPU is not so bad. 

When I started with Subleq I expected to build an 8 bit machine. Half way through I realised it had to be at least 12 bit to be usable. Even 16 bit can only handle small programs! The problems become harder as a simple front panel will not cut it for IO for Subleq (I am not going to enter 1,000s words of code via a front panel).

I have moved away from Subleq.

Regards AlanX

  Are you sure? yes | no

Ed S wrote 07/24/2017 at 13:37 point

Thanks, some good thoughts there. I was a bit tempted by PL/0 but I think you've talked me out of it!

You mustn't be surprised if a one-instruction computer has low code density, that must be part of the deal. How about a low-complexity early machine like the LGP-30? You could call yours the AGP-301

  Are you sure? yes | no

agp.cooper wrote 07/07/2017 at 10:29 point

Hi BigEd,
C-- looks like a good choice. I wish I started there. To late now for me.
But it has no operator precedence! But it is left to right. So you will need to rebuild the parser. The code generator hides some of the complexity. Exporting "XOR" for the token "EQUALS" takes a moment to work out what it is doing (Ax xor Bx sets Ax to zero if Ax == Bx). I quite like the direct link between brackets and push/pop.
This code and my code are about the same length but my code has a builtin interpreter (but less comments).

  Are you sure? yes | no

Ed S wrote 07/06/2017 at 12:10 point

Interesting project - will have a good look! At present we have a vague idea of porting C-- compiler from PCW 1988

  Are you sure? yes | no

agp.cooper wrote 06/28/2017 at 01:07 point

Hi Lucas,

It is style thing. I prefer if {} else if ... because:

1 the indentation is strictly controlled by the {...} set (important for me)

2 the code avoids the break and the issue of not using the break (deliberately or by accident)

3 not using the break is harder to understand (and introduces bugs)

3 the code is shorter!

4 execution wise it is no worse (maybe better)

5 easier to translate into assembler

6 I am not restricted to a simple "==" test (I can use strcmp() and other boolean constructions)


In my test program I use strcmp as it is much easier to work with strings in this case.


The main down side is that the code is a bit harder to read but if you add a line before the next "if {}else if" it is a lot easier to read (the line can be deleted later).


All comments are take it or leave it, just an alternate point of view from a hobbyist programmer.


Regards AlanX

  Are you sure? yes | no

Lucas O. Wagner wrote 06/27/2017 at 18:12 point

hi there!  in your Execute command in SimpleCv2.C, i'd recommending using a switch statement instead of the else if stuff.  e.g.:

switch (IR) {
  case _movAxVar:
  case __movVarAx:

maybe you can move your code to github (or similar) for version control?

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates