Close

Few assembly language tricks I used here

A project log for Brainf*cktor

Standalone brainf*ck computer in less than 1kB of code

jaromirsukubajaromir.sukuba 01/04/2017 at 08:472 Comments

Here I want to document some of assembly language tricks I used in this project. People doing it every day for years will scoff, but the perhaps somebody may be interested. I'm going to target it for PIC18, but it should be applicable for other MCU platforms too.

Just quick vocabulary:

; denotes comment, ignored line

call is jump to subroutine with program counter (PC) saved on stack

jump is jump to subroutine with no PC saving

return is restitution of PC from stack

Prepare your variables

Sometimes you can gain some bytes of code memory if you properly arrange your variables. Some MCU platforms do have faster access for something like first 256B of RAM, or some access bank


Exploit reset values

Sometimes you don't need to explicitly define the register value, if it is switched into defined state by hardware after reset and you know it applies to your conditions.

My code needs to wipe out random values from buffers from RAM and enter them into defined state (0x00). Because my working variables and display/input buffers are located at the beginning of RAM, I can wipe it out by setting data pointer at zero and write zero/increment pointer 256 times in single loop.

Setting the data pointer takes two instruction words when using instruction LFSR, which properly loads all 12 bits of data pointer. But datasheet states the upper nibble of the data pointer is zeroed after reset

Unfortunately, reset value of lower byte is undefined, so I have to force it zero. Still, CLRF instruction (clear single register) takes one byte versus two bytes of LFSR. In subsequent code I employ LFSR to properly initialize data pointers, though.

Fallthrough

Let's have subroutine that prints ASCII character on your display

print_ascii:
    ;assume it's called with value in W register
    ;do this and that
    return

Now you may want to have subroutine that prints unpacked BCD (binary 0..9) on the display. You can do it like this

print_binary:
    addlw    '0'    ;adjust the W value into ASCII 0..9
    call    print_ascii
    return

print_ascii:
    ;assume it's called with value in W register
    ;do this and that
    return

Which would definitely work, but you can omit all the cal/return thing by having subroutine with two entry points, or having fall-through to function.

print_binary:
    addlw    '0'    ;adjust the W value into ASCII 0..9
                    ;and now fall through the ASCII print
print_ascii:
    ;assume it's called with value in W register
    ;do this and that
    return

You call the print_binary subroutine, it does its job and returns via return instruction, which "belongs" to print_ascii subroutine.

Return/return merging

Let's have subroutine which prints 3 consecutive characters from buffer, utilizing print_ascii subroutine as before. The simplest approach goes like this

print_three:
    ;prepare number 1
    call    print_ascii
    ;prepare number 2
    call    print_ascii
    ;prepare number 3
    call    print_ascii
    return

This would certainly work, but the last subroutine call is somehow pointless. First you call subroutine print_three (consumed one stack level), then you do this and that up to last print_ascii call, where you call it (consume one stack level), the function does what is expected, then it returns (decrease stack usage) just to return again (return stack to previous value). We can save one return by doing this:

print_three:
    ;prepare number 1
    call    print_ascii
    ;prepare number 2
    call    print_ascii
    ;prepare number 3
    jump    print_ascii

It works as before, but the last call (with stack increase) is replaced by plain jump. Now the function print_three will not return via "its own" return, but via return "belonging to" function print_ascii. One return spared. Also stack level spared (only at the single subroutine "call", not in the previous calls), what could be sometimes beneficial.

Multiple passes

I used this one for hexadecimal number printing - in fact it is a bit convoluted use of fall-through. I could do my print_hex function like this

print_hex:
    ;extract low nibble of number
    call    print_ascii
    ;extract high nibble of number
    call    print_ascii
    return
print_ascii:
    ;assume it's called with value in W register
    ;do this and that
    return

Instead, I did it like this

print_hex:
    ;extract low nibble of number
    call    print_ascii
    ;extract high nibble of number
    ;fall-through
print_ascii:
    ;assume it's called with value in W register
    ;do this and that
    return

It takes low nibble, calls subroutine print_ascii and returns back to extract high nibble of the number, with the number in working register it falls-through to print_ascii again and now its return instruction will set program counter back from where print_hex was called. One call and return saved.

Use relative branching if possible

Many MCU platforms do have option of relative branching, as opposed to absolute jumps. Relative branches usually take less program space, but have limited range. One can take advantage of this for local, related functions, like the display functions I mentioned before. For example, PIC18 has relative branch range +-1023, which is perfect fit for 1kB contest.

Discussions

Lucas Rangit MAGASWERAN wrote 01/17/2017 at 19:35 point

Great explanation of the tricks. Did you perform these optimizations after having everything coded? If so, how much code space and stack did it save for this project? Thanks for sharing.

  Are you sure? yes | no

jaromir.sukuba wrote 01/17/2017 at 19:39 point

I was doing the optimizations "on the run" while writing code, so I can't tell you how much space I gained. 

  Are you sure? yes | no