I've begun work on the virtual machine interpreter that will run on the Cat-644. I am writing it as a separate Atmel Studio project, as I could see this being useful outside of this.
The interpreter assumes that once it is invoked, it will never exit. Any interrupts already 'hooked up' to C functions will still operate normally. The interpreter can call out to single user-defined function called 'syscall.' Syscall is free to call out to other C functions, as the C stack will be available and left intact for this purpose.
simavr Windows Port (small side project)
I am mostly a Linux user. An exception to that rule is AVR development. I really like Atmel Studio's IDE, especially the AVR simulator in single-step mode. Every hardware peripheral is right there, clock by clock, in a nice graphical way. You can use gdb for this, but it is simply more convient to press the single-step key and watch port registers turn on and off. It was extremely useful for debugging the PS/2 keyboard and VGA signal code. What is not good about it is the lack of serial port support. You can watch a byte appear on the serial register, and you can poke 1 byte at a time into the register, but it is a pain to do so. This is where the simavr open source AVR simulator really shines. It has full serial port emulation, and on linux even gives you a pty you can attach a terminal to. When developing a bytecode interpreter, I want both: I want to single-step through code in Atmel Studio on Windows, especially when watching things like that stack frame and or studying the timings of different routines. And then, I want to run a program at full speed, and interact with it like it is on a serial port. What I needed was simavr on Windows. SImavr had mingw support, but I didn't want to set up mingw just for this, and I was curious what it would take to get it to run on Visual Studio. I got it working well enough for my current project: https://github.com/carangil/simavr-visualstudio
The goal is for the interpreter handlers of all of these instructions to fit within 256 instruction words on the AVR. This is because of the way I am fetching these instructions. All the instruction handlers fit on a single 256 word page of flash, so they all have the have MSB address. This is so the ZH register can be set up once. The instruction bytes themselves are directly loaded into the ZL register, and an IJMP is performed. At least all the entry points for all the instructions must fit in this page: If certain instructions are long, the handler can jump out another routine.
The virtual machine has 4 16-bit registers, labeled A, B, C and D.
Register 'A' is the special Accumulator register, and most instructions require its use. This is to keep the number of possible instruction encodings as small as possible. All instructions that don't have immediate data are 1 byte long, and instruction that need data are followed by 1 or 2 bytes.
There is a stack, managed by the 'Y' 16-bit index register of the AVR. This is separate from the C stack. This doesn't have to be, but this is the case at the moment.
- LI (Load immediate) Can load an immediate 16-bit value into any register A,B,C,D
- Swap: Swap the contents of A with either B,C or D
- Arithmetic Instructions: Performs an operation between registers B,C,D and A, and stores result in the accumulator (register A)
- subr (not yet implemented: performs reg-=A instead of A-=reg
- cmp Does a trial subtraction, but doesn't modify registers. Sets internal flags.
- adc: Add-with-carry, allowing 32-bit and higher math.
- Syscall The C function 'syscall' is called, with 2 16-bit arguments. The first argument is the contents of A, and the second argument is the contents of B. The return value of the C function is returned to the interpreted program in register A. The rest of the registers are unmodified. Complex, operating-system like operations will be done here, in native AVR code, as opposed to being code in the interpreter. This will include serial i/o, disk i/o, graphics routines, memory allocation, etc.
- Jump instructions (some of these implemented) All jumps are to relative addresses
- jmpr: jump to 16-bit address
- Accumulator value jumps: looks at value in register 'A', not the result of the last operation
- janz: jump if A is not zero
- jaz: jump if A is zero
- jan: jump if A is negative (if MSB bit set)
- Arithmetic comparison jumps: looks at the result of 'cmp, add, etc
- je: jump if values were equal
- jne: jump if value were not equal
- ja: jump if unsigned value in accumulator was larger than register (A > reg)
- jb: jump if unsigned value in accumulator was smaller than register (A < reg)
- Stack Instructions These can go under 'memory instructions', but are a little bit of a special case
- push reg : Push any register to the stack
- pop reg: pop any register into the stack
- pop <immediate> pop n items off the stack (items are discarded, not loaded anywhere)
- pick reg, <immediate> load the nth item on stack into any register
- put reg, <immediate> store any register into the nth position on the stack
- Memory Instructions I wanted to keep memory access simple: The other instructions don't access memory at all. I wasn't sure if it was going to be more common to store a register in a computed address (where the Accumulator is probably where the computed address ends up), or it was going to be more common to store the result of a computation (in the accumulator) to an address already contained in another register. I decided to support both cases. This results in 4 instructions:
- ld A, *reg (Load value pointed to by register A,B,C or D into A) (4 encodings)
- st *reg, A (Store value A to memory pointed to by B,C or D (3 encodings)
- ld reg, *A (Load value pointed to by A into B, C or D) (3 encodings)
- st *A, reg (Load value in B C or D to memory pointer to by A) (3 encodings)
- Subroutines I am undecided if the VM data stack should be used for this, or a separate stack instead. Forth has a separate stack, and it comes in handy to not have the return address in the way.
I feel like the above is a pretty comprehensive instruction set for the kind of machine I am making. I chose 16-bit instead of 8-bit, because for many of the operations, it only requires 1 additional clock cycle per instruction. Often on an 8-bit machine, two operations are cascaded to make 16-bit operations anyway. Basic arithmetic instructions complete one every 6 AVR clock cycles (20mhz avr = 3.3 million arithmetic instructions per second), and instructions involving data memory (push, pop, ld, st) take around 12 clock cycles. (1.6 million per second.) The cat-644 in fast mode (every other line on the display is dropped) runs at an effective rate of 10mhz, so divide the numbers above in half. When compared to a 1 Mhz 6502, which is often cited as 300 to 400 k instructions per second), I think this interpreter pulls ahead.
I need to finish some instruction handlers. I also need to write an assembler, since hand assembling bytecode is a little annoying. Also, every time I change a handler, the bytecodes change, because the bytecode is the offset into the interpreter code. I need an automated way to dump out the addresses of all the handlers, and use that to generate the bytecodes.