1024: a space-saving odyssey

A project log for kiloboot

1kB TFTP Ethernet bootloader for ATmega328P and ENC28J60

mitxelamitxela 01/02/2017 at 16:230 Comments

The writeup for this project is very long and detailed, and doesn't really fit the log format. I started out just wanting to learn about networking (it's fun!) and diverted into the bootloader project after a bit. So if you want to follow the journey starting from the very basics of internetworking, see the page on

The relevant part for this, as an entry to the 1kB contest, is the optimization. It's all written in AVR assembly, but even so I went to great lengths to squeeze everything in to such a small space. I'm used to optimizing for speed, but optimizing for prog mem requires a different mindset.

Just as a comparison, the smallest TFTP bootloader for this platform I could find was 8kB. There are other, more expensive ethernet boards that implement a lot more of the IP stack and can make things a lot easier, but the ENC28J60 is only the PHY and some help with the data-link layer (it has a hardware CRC calculator). I deliberately chose this chip because it's cheap, popular, and my original goal was to learn about networking, not have everything done for me. There was an interesting bootloader by kehribar that uses UDP broadcast messages, which greatly simplifies the process, you don't even need to give the device an IP. But that will only work on the local network, using your proprietary protocol. Using TFTP, the file can be anywhere on the internet, maybe the other side of the planet, which may or may not be important, but it has a certain charm to it.

Many of the optimizations are quite general, even obvious. Others, not so. But I made sure that I wasn't optimizing to the point of obfuscation, and I hope the result is easy enough to understand (if you're familiar with assembly, anyway).

Some highlights:

The ethernet chip is controlled by writing to its registers over SPI. Many of these registers are 16-bit, so have a "low" and a "high" byte that need accessing in separate operations. All of these 16-bit registers are contiguous in memory (the high byte's address is exactly one more than the low byte) so we can write an automated subroutine for them.

In assembly, the concept of a subroutine is far more relaxed than in something like C. The Program Counter is the address in memory the processor is reading from – our "current location" in the code. All that the call function does, is push the current Program Counter onto the stack, and jump to a new address. All that the ret (return) function does, is pop two bytes from the stack into the Program Counter. When you understand that, you can be a lot more relaxed about what's going on. For instance, to do a read from SPI, we need to output a zero on the MOSI line. Rather than load zero right before every function call, I made a single command directly before the doSPI routine, which loaded zero, and labelled it doSPIzero.

  clr r16

  out SPDR,r16
  in r16, SPSR
  sbrs r16, SPIF
  rjmp SPIwait
  in r16, SPDR

Most of the time we'll call doSPI, having set the data to be output, but if we want it to output zero, we call doSPIzero and the program counter falls through into the doSPI function, since there isn't a ret. This type of overlapping routine optimization is quite easy and saves a lot of program space. So, for the enc28j60writeWord routine, we do something a little more complex:

; r16 = op | address (L register)
; r17 = dataL
; r18 = dataH
  push r16
  rcall enc28j60write
  pop r16
  inc r16
  mov r17, r18

; r16 = op | address
; r17 = data
  cbi PORTB,PB2
  rcall doSPI
  mov r16,r17
  rcall doSPI
  sbi PORTB,PB2

enc28j60write can be called to set one byte to a register. For enc28j60writeWord, we want to call enc28j60write twice, with different arguments. So the first time, we call it within the enc28j60writeWord routine, which does the deed then the ret sends it back to enc28j60writeWord. We switch over the arguments, and then this time, fall through into the enc28j60write routine, so that the ret pops the other address from the stack, which is the address which originally called enc28j60writeWord. You'll notice that while we're doing this, there's no problem with using the stack for other data, too – we can happily push the arguments onto the stack in-between the function calls, just as long as we pop them in the right order, and the number of calls that take place is the same as the number of rets. A similar optimization that I often like to do, is whenever a subroutine ends with a call to another subroutine, followed by a ret, we can change it into a single jump to the other subroutine.

Wherever possible, instead of constructing packets from scratch, it's much more compact to edit the received packet to transform it into the packet we want to send. This works very well for ARP and ICMP echo (ping). It also works well for acknowledging the TFTP data packets.

I pulled a number of dirty tricks here, things like re-using the IP identifier. This is supposed to be unique and generated for every IP message. On condition that the server is behaving properly (not pulling the same trick), there shouldn't be any problem with this. The packet is identified by the ID and the addresses combined.

By making such small changes, we don't need to recalculate the checksum. Instead, we can just add or subtract an offset to it, which is usually small.

But in some situations, we have to generate a packet from scratch. This means after all that optimization, we have to write a checksum routine after all – or does it? Although this changed after I made use of the EEPROM for the IP addresses, originally all of the addresses were hard coded, meaning that 90% of the packet is known in advance, and so, we can calculate the checksum using the C-style preprocessor.

#define IPmsg1ChecksumA (($45+$e6+$40+$40)<<8) + $97 + $11 + low(37+strlen(FILENAME)) + (myIP0<<8)+myIP1 + (myIP2<<8)+ myIP3 + (svIP0<<8)+svIP1 + (svIP2<<8)+ svIP3
#define IPmsg1ChecksumB (lwrd(IPmsg1ChecksumA) + byte3(IPmsg1ChecksumA))
#define IPmsg1Checksum ~(lwrd(IPmsg1ChecksumB) + byte3(IPmsg1ChecksumB))
.db high(IPmsg1Checksum), low(IPmsg1Checksum)

The first part does the shifting and the byte-grabbing to work out the word sum (since I'm using the length of the filename in an adiw instruction in the code, we already know it'll be limited to <63 bytes), then the second adds the third byte, equal to the carry, onto the main value to give us a wrapped 16-bit sum. We do that twice, just in case the sum was 0x1FFFF. The tilde does a logical not, the one's complement. And the .db writes the two bytes we care about into prog mem. An IPv4 header checksum calculation in two bytes!

By the end of the project, I relaxed a little since it was obvious I was on target. But at one point I was quite worried, and really squeezing every last byte. A good example is when it came to writing to the flash memory, which must be done a page at a time. The ATmega328p's pages are 128 bytes. Inconvenient! The data comes in in blocks of 512, so we have to call the write routine four times, which is the worst possible amount.

If it were two, we could just call it twice. If it were eight, we could set up a loop to iterate. But the usual method of iterating is ldi, call, dec, brne. So iterating over four is the same amount of prog mem used as calling four times. However, I realized that we can do something clever. If we precede the definition of our subroutine with a call to that subroutine, and call that call instead, it will run the routine twice – first as it's called within the routine, and second as it falls through. So... for every call instruction we precede it with, it will double the number of times the routine is run. Like this:

    rcall SPMtwopages
    rcall SPMwritePage
  ; ... code goes here ...

There we go, calling it four times, in just three instructions – saving us an entire two bytes. Yipee!

For most of the development I had a UART output enabled to check what was going on. When it came to removing this, I found that some parts stopped working. It turns out that some aspects of the ENC28J60 require a delay after being configured, and the debug code happened to be providing that delay.

To write a dedicated delay function is pretty much the extreme of the type of optimization we're doing here – we want to burn as many cycles as possible in the fewest instructions.

Luckily I'd already assigned an unused register as a "zero" register, which should always contain zero. This makes some of the non-immediate instructions a bit easier to use, like add-with-carry. Now we can make use of it in another way:

  dec zeroReg
  brne wait

This wait function burns 768 cycles in just two instructions. The zeroReg is used as a counter, the first dec sets it to 255, and after the last brne, it will be back at zero. As long as we leave it zero at the end of every routine, we can always depend on it being zero at the start. If we wanted to lengthen this, a clever way of padding it is to add an rjmp to its own address plus one. This forms a single instruction, two-cycle nop.

If you like this sort of thing, I've written a lot more on the main project page, or you could jump right into the source code, which has too many optimizations to list. If there are any bits people would like me to elaborate on, let me know.