It is time for me to write the virtual machine interpreter for this project. In a previous project log I once went over a fast instruction dispatch mechanism. I'll go over that briefly here: One goal was to have single-byte instructions. Reading 1 byte internal SRAM on the AVR takes 2 clocks. So, at a minimum we have this:
ld Reg, X //2 clocks, fetches 1 vm instruction
This loads a 1 byte instruction from where X is pointed. Remember, X is a 16-bit register made up of 2 8-bit registers. Using X as an instruction pointer for the VM has the advantage the there is special hardware to increment (even with carry) the 16-bit pointer at the same time we do the fetch. So, we have:
ld Reg, X+ //2 clocks, fetches 1 vm instruction, increments instruction pointer for next instruction
Now, we need to decode the instruction to a handler. There are many ways to do this, one is a lookup table. I decided to have the instruction simply be the value of the lookup table by making all instruction handlers aligned to 256 words. This means if a handler is located at 0x1300, it is the handler for instruction 0x13. The handlers are small, so there is a lot of empty memory between handlers. This was going to be accepted as a trade off, and there was going to be a small number of instructions. If we use the Z register to hold the address of instruction handler, and we preload the low byte of Z (ZL) with zero (and never write to it again), we can now fetch/decode the instruction with:
ld ZH, X+ //2 clocks. fetches 1 vm instruction, increments instruction pointer for next instruction, Z is a pointer to the instruction handler ijmp (jump to address in Z) //2 clocks to jump to the handler
The AVR has the instruction IJMP which jumps to whatever address is in 'Z'.
So what does a handler look like? Well, the handlers are short. Typical operations planned were things like 16-bit register add, with 2 operands. (Similar to x86):
handler for add A, B:
//AL and BL are #defines to some of AVR's registers add AL, BL //1 clock add low byte adc AH, BH //1 clock add high byte jmp fetch //jump back to the instruction fetcher (2 clocks)
Of course 'jmp fetch' can be replaced with a copy of the instruction fetcher, so we don't waste 2 clocks jumping to the fetcher:
add AL, BL //1 clock add low byte adc AH, BH //1 clock add high byte ld ZH, X+ //2 clocks. fetches 1 vm instruction, increments instruction pointer for next instruction, Z is a pointer to the instruction handler ijmp (jump to address in Z) //2 clocks to jump to the handler
So in 6 clocks we can do one basic 16-bit mathematical operation, and set-up for the next instruction handler. The only problem is this depends on the instruction handlers being laid out in memory in a very wasteful way.
After leaving this project alone for a while, I realized if the number if instructions is kept low enough, all the instruction handlers might fit into 1 'page' of 256 instructions. This means instead of locking ZL to zero, and changing ZH, I can lock ZH to some properly aligned area, and change ZL. How many handlers can I fit here? The above 16-bit math handler comes out to 4 instruction words. 256/4 = 64. This means I can fit 64 simple handlers here. In practice I expect to be able to handle more than 64 instructions, because I don't need all instructions to be quite so fast. Simple register-to-register math ops should be made fast. More complex instructions that already need more time to execute, such as division, i/o, etc, can have a simple 1-word 'stub' handler in the aligned section that jumps-out to a larger space. For these complex instructions, the aligned handler section of code will act more like a jump table. This jump takes 2 clocks to execute, but on a slow division operation, it would probably not be noticable. If I can't get all the needed instructions to fit, I could also make one of the instruction handlers a 'prefix code', switching ZH to a second page of instruction handlers. The slower or lesser-used instructions would be located here.
I decided the VM running should be a 16-bit CPU instead of 8. The reason is if I apply the above handler strategy to 8-bit values, a simple 'add' handler takes 5 clocks. The 16-bit version takes 6 clocks. Often, the user will chain two 8-bit operations together into a 16 bit operation, so doing a 16-bit add in 6 clocks is better than doing two 8-bit adds at 5 clocks each. Most of the interpreter time is spend fetching the next instruction and jumping to the handler, so while I'm in the handler I may as well make the most of it and do 16-bit operations.
I also decided on a register machine. The reason is VM registers can be mapped to the many (32!) AVR registers. A stack machine means slow SRAM access: 4 clocks for each 16-bit value read or written. A simple 'add' operation of popping 2 values, adding, and pushing 1 value, would take 14 clocks. If you optimize by keeping the 'top of stack' in a register, you do better, but the stack is still icky to access.
I also don't need or want 'decodable' instructions. There isn't an opcode field, register field, etc like you would see on a lot of common instruction sets. The possible combinations of instructions are simply enumerated. There is a handler to add register B to A. There is another handler to add register C to A. The handlers are simply numbered, and they are numbered accordingly to where to handler is located in program memory. Obviously the actual binary values of all the instructions in the instruction set won't be known until all the handlers are written and in their final form.
First Proposal: Accumulator Architecture
I was originally going to go with a simple accumulator architecture. There would be 4 registers, A, B, C, and D. Any math operation could require that 'A' is the destination. 'A' is referred to as the accumulator. This is attractive because it dractically cuts down on the number of combinations. Instead of handlers for all of these cases:
There are simply only 3 cases: add one of the registers to 'A'
Where did add A,A go? Don't need it: adding A to itself is the same as shifting left A by 1.
Other useless instructions:
xor A,A : clears A
and A,A : does nothing
or A,A: does nothing
sub A,A : clears A
So 3 handlers per basic math op. Add in a few more operations:
add, sub, and, or, xor, compare, swap, mov, movr (A is src instead of destination) (9)
9*3 = 27. 27 instructions and basic math is taken care of. Add in mul, div, mod, some load/store, push/pop, call, return, and jumps, and we have a pretty complete instruction set in a small number of actual opcodes.
Then I sat down and tried to figure out how to generate code for this thing. I tend to think in stack machines. Even if you take a langauge like 'C', you can break down expressions into a forth-like RPN notation that is very easy to compile for a stack machine. So, my problem is, given a list of operations in stack-machine order how do you apply this to an accumulator machine? I tried to some up with different scemes, and they all involved a lot of swapping variables around.
I came up with this as a compromise between an accumulator machine, and a general register machine. In this scheme, I allow any register to be a destination, but the second operand must be the register next to it. And it has to be in particular winding order. If there are 4 register, I have this:
A <- B -< C <- D <- (wraparound to A)
This makes it easy to generate stack-like code for it. This simple stack sequence like this:
push 5 push 3 add push 2 push 1 add xor
would turn into this:
mov a,5 mov b,3 add a,b mov b,2 mod c,1 add b,c xor a,b
I end up with a stack machine that has a hardware depth of '4'. The stack pointer is implicit: There is no pointer, the person or compiler creating the code keeps track of the stack depth. If more than '4' is needed, the registers can be spilled onto the real stack.
This also means, because there is no dedicated accumulator, we need 4 possible instructions per math operation:
This does reduce the number of operations that can be supported, beacuse, remember, I can only have 64 simple handlers in the table.
Right now I am writing an emulator in C on my PC for this type of instruction set, to see what I really need. One idea, is getting rid of a register and only having 3. This aligns well to the FORTH keywords dup, over, swap, rot, which operate on the top 3 stack items. PICK is used for items under that. The burden is on the compiler to create reg/reg operations tracking the top 3 stack items, but all items under that are placed in the real stack and will be accessible by a pick-like instruction. The advantage over the accumulator machine is a stack paradigm language, such as forth or RPN can easily generate code for this. I call it the rolling accumulator, because as you evaluate expressions, the register that represents the current 'accumulated' value rolls around the ring of adjacent registers.
That's all for now. If anyone is reading I'm interested in hearing your thoughts on if the rolling accumulator has any advantages.
Another thing I am pondering: what would an accumulator oriented language look like? My only complaint for an accumulator machine is generating code from a stack-language is awkward. Suppose during evaluation of an expression, you need to evaluate a sub-expression? You need to put the current result somewhere, evaluate the sub-expression in the accumulator, and then go back to what you were doing. What does that look like as a language?