Close

Virtual Instruction Set Candidates

A project log for Cat-644

An 8-bit computer based around an AVR ATMega-644 and junk-box SRAM.

Mark Sherman 02/01/2017 at 07:322 Comments

It is time for me to write the virtual machine interpreter for this project. In a previous project log I once went over a fast instruction dispatch mechanism. I'll go over that briefly here: One goal was to have single-byte instructions. Reading 1 byte internal SRAM on the AVR takes 2 clocks. So, at a minimum we have this:

ld Reg, X   //2 clocks, fetches 1 vm instruction

This loads a 1 byte instruction from where X is pointed. Remember, X is a 16-bit register made up of 2 8-bit registers. Using X as an instruction pointer for the VM has the advantage the there is special hardware to increment (even with carry) the 16-bit pointer at the same time we do the fetch. So, we have:

ld Reg, X+ //2 clocks, fetches 1 vm instruction, increments instruction pointer for next instruction

Now, we need to decode the instruction to a handler. There are many ways to do this, one is a lookup table. I decided to have the instruction simply be the value of the lookup table by making all instruction handlers aligned to 256 words. This means if a handler is located at 0x1300, it is the handler for instruction 0x13. The handlers are small, so there is a lot of empty memory between handlers. This was going to be accepted as a trade off, and there was going to be a small number of instructions. If we use the Z register to hold the address of instruction handler, and we preload the low byte of Z (ZL) with zero (and never write to it again), we can now fetch/decode the instruction with:

ld ZH, X+   //2 clocks.  fetches 1 vm instruction, increments instruction pointer for next instruction, Z is a pointer to the instruction handler
ijmp   (jump to address in Z)   //2 clocks to jump to the handler

The AVR has the instruction IJMP which jumps to whatever address is in 'Z'.

So what does a handler look like? Well, the handlers are short. Typical operations planned were things like 16-bit register add, with 2 operands. (Similar to x86):

handler for add A, B:

//AL and BL are #defines to some of AVR's registers

add AL, BL   //1 clock add low byte

adc AH, BH   //1 clock add high byte

jmp fetch   //jump back to the instruction fetcher (2 clocks)

Of course 'jmp fetch' can be replaced with a copy of the instruction fetcher, so we don't waste 2 clocks jumping to the fetcher:

add AL, BL   //1 clock add low byte

adc AH, BH   //1 clock add high byte

ld ZH, X+   //2 clocks.  fetches 1 vm instruction, increments instruction pointer for next instruction, Z is a pointer to the instruction handler

ijmp   (jump to address in Z)   //2 clocks to jump to the handler

So in 6 clocks we can do one basic 16-bit mathematical operation, and set-up for the next instruction handler. The only problem is this depends on the instruction handlers being laid out in memory in a very wasteful way.

Less waste.

After leaving this project alone for a while, I realized if the number if instructions is kept low enough, all the instruction handlers might fit into 1 'page' of 256 instructions. This means instead of locking ZL to zero, and changing ZH, I can lock ZH to some properly aligned area, and change ZL. How many handlers can I fit here? The above 16-bit math handler comes out to 4 instruction words. 256/4 = 64. This means I can fit 64 simple handlers here. In practice I expect to be able to handle more than 64 instructions, because I don't need all instructions to be quite so fast. Simple register-to-register math ops should be made fast. More complex instructions that already need more time to execute, such as division, i/o, etc, can have a simple 1-word 'stub' handler in the aligned section that jumps-out to a larger space. For these complex instructions, the aligned handler section of code will act more like a jump table. This jump takes 2 clocks to execute, but on a slow division operation, it would probably not be noticable. If I can't get all the needed instructions to fit, I could also make one of the instruction handlers a 'prefix code', switching ZH to a second page of instruction handlers. The slower or lesser-used instructions would be located here.

Instruction Set

I decided the VM running should be a 16-bit CPU instead of 8. The reason is if I apply the above handler strategy to 8-bit values, a simple 'add' handler takes 5 clocks. The 16-bit version takes 6 clocks. Often, the user will chain two 8-bit operations together into a 16 bit operation, so doing a 16-bit add in 6 clocks is better than doing two 8-bit adds at 5 clocks each. Most of the interpreter time is spend fetching the next instruction and jumping to the handler, so while I'm in the handler I may as well make the most of it and do 16-bit operations.

I also decided on a register machine. The reason is VM registers can be mapped to the many (32!) AVR registers. A stack machine means slow SRAM access: 4 clocks for each 16-bit value read or written. A simple 'add' operation of popping 2 values, adding, and pushing 1 value, would take 14 clocks. If you optimize by keeping the 'top of stack' in a register, you do better, but the stack is still icky to access.

I also don't need or want 'decodable' instructions. There isn't an opcode field, register field, etc like you would see on a lot of common instruction sets. The possible combinations of instructions are simply enumerated. There is a handler to add register B to A. There is another handler to add register C to A. The handlers are simply numbered, and they are numbered accordingly to where to handler is located in program memory. Obviously the actual binary values of all the instructions in the instruction set won't be known until all the handlers are written and in their final form.

First Proposal: Accumulator Architecture

I was originally going to go with a simple accumulator architecture. There would be 4 registers, A, B, C, and D. Any math operation could require that 'A' is the destination. 'A' is referred to as the accumulator. This is attractive because it dractically cuts down on the number of combinations. Instead of handlers for all of these cases:

add A,A

add A,B

add A,C

add A,D

add B,A

add B,B

add B,C

add B,D

add C,A

add C,B

add C,C

add C,D

add D,A

add D,B

add D,C

add D,D

There are simply only 3 cases: add one of the registers to 'A'

add A,B

add A,C

add A,D

Where did add A,A go? Don't need it: adding A to itself is the same as shifting left A by 1.

Other useless instructions:

xor A,A : clears A

and A,A : does nothing

or A,A: does nothing

sub A,A : clears A

So 3 handlers per basic math op. Add in a few more operations:

add, sub, and, or, xor, compare, swap, mov, movr (A is src instead of destination) (9)

9*3 = 27. 27 instructions and basic math is taken care of. Add in mul, div, mod, some load/store, push/pop, call, return, and jumps, and we have a pretty complete instruction set in a small number of actual opcodes.

Then I sat down and tried to figure out how to generate code for this thing. I tend to think in stack machines. Even if you take a langauge like 'C', you can break down expressions into a forth-like RPN notation that is very easy to compile for a stack machine. So, my problem is, given a list of operations in stack-machine order how do you apply this to an accumulator machine? I tried to some up with different scemes, and they all involved a lot of swapping variables around.

Rolling Accumulator

I came up with this as a compromise between an accumulator machine, and a general register machine. In this scheme, I allow any register to be a destination, but the second operand must be the register next to it. And it has to be in particular winding order. If there are 4 register, I have this:

A <- B -< C <- D <- (wraparound to A)

This makes it easy to generate stack-like code for it. This simple stack sequence like this:

push 5
push 3
add
push 2
push 1
add
xor

would turn into this:

mov a,5
mov b,3
add a,b
mov b,2
mod c,1
add b,c
xor a,b

I end up with a stack machine that has a hardware depth of '4'. The stack pointer is implicit: There is no pointer, the person or compiler creating the code keeps track of the stack depth. If more than '4' is needed, the registers can be spilled onto the real stack.

This also means, because there is no dedicated accumulator, we need 4 possible instructions per math operation:

add a,b

add b,c

add c,d

add d,a

This does reduce the number of operations that can be supported, beacuse, remember, I can only have 64 simple handlers in the table.

Right now I am writing an emulator in C on my PC for this type of instruction set, to see what I really need. One idea, is getting rid of a register and only having 3. This aligns well to the FORTH keywords dup, over, swap, rot, which operate on the top 3 stack items. PICK is used for items under that. The burden is on the compiler to create reg/reg operations tracking the top 3 stack items, but all items under that are placed in the real stack and will be accessible by a pick-like instruction. The advantage over the accumulator machine is a stack paradigm language, such as forth or RPN can easily generate code for this. I call it the rolling accumulator, because as you evaluate expressions, the register that represents the current 'accumulated' value rolls around the ring of adjacent registers.

That's all for now. If anyone is reading I'm interested in hearing your thoughts on if the rolling accumulator has any advantages.

Another thing I am pondering: what would an accumulator oriented language look like? My only complaint for an accumulator machine is generating code from a stack-language is awkward. Suppose during evaluation of an expression, you need to evaluate a sub-expression? You need to put the current result somewhere, evaluate the sub-expression in the accumulator, and then go back to what you were doing. What does that look like as a language?

Discussions

esot.eric wrote 02/06/2017 at 12:39 point

Rereading, and having recently picked up more on forth, etc. I get the rolling-accumulator, now. I just had my head in another space that first round. This is quite well-explained.

It's interesting that forth is often touted as so microcontroller-friendly, I think a recent HaD blog article even claimed it to be amongst the closest to the hardware-level of any language (even closer than C), yet no microcontrollers I've run into have e.g. math-operations which can directly act on items from the stack; they're more like you described, performing the majority of operations on registers, alone. And accesses to the stack require pushes/pops to registers.

This is an interesting endeavor. On the one hand it's almost like you're developing a new machine-level instruction-set *for* forth. On the other hand, your new instruction-set is designed around the features and limitations of (and optimizations for) the architecture you're implementing it on (AVR). A virtual-machine for a virtual-machine, maybe? Definitely a great learning-experience for us AVR-ites.

I think the answer to what an accumulator-based language would look like is "assembly" ;)

  Are you sure? yes | no

esot.eric wrote 02/02/2017 at 00:05 point

I was with yah up to the rolling-accumulator. At first I thought you meant that you'd have an e.g. "rotate registers left" instruction, which would shift B->A, C->B, D->C, and A->D, so that you could use the same math instructions (where A is always the destination). And I thought that up until "since there's no dedicated accumulator" wherein I got completely lost. I'll try to reread that with a fresh mind later-on.

The explanation for the earlier stuff makes sense. Nice in keeping it smaller than 64KB. Also, having a "next instruction is extended" instruction seems smart. And the rationale for 16-bits seems wise, as well.

Gave me some ideas regarding my now slightly-less-improbable x86 emulation while reading this. Thanks!

  Are you sure? yes | no