Close

Optimizing Display Code

A project log for AND!XOR DC25 Badge

We're going bigger, better, more Bender.

zappZapp 12/30/2016 at 18:000 Comments

Over the past few months I have been working on custom drivers for various peripherals for our badge. Most of this is to learn how to build the drivers and the underlying hardware. This would have been an impossible task for me just a year before working on our DC24 badge. At this point I've come up with a custom display driver which I believe is as close to optimized as I can get it. In the process here are the factors that need to be considered:

  1. Maximize bulk transfer speed - select the highest possible SPI clock rate the MCU and display will safely operate at
  2. Minimize latency between bulk transfers - utilize DMA to asynchronously transfer chunks while preparing the next chunk for transfer
  3. Minimize data to be transferred - reduce color depth, pixels, and commands to be sent

During prototyping we've been limited to 8 Mhz on the SPI bus. At 320x240 16-bit, a single frame requires 1.2 Mbits! Over SPI, the render can take seconds. No where close to what we need.

One of my favorite tools as of late is my Saleae 8 channel logic analyzer. It's nice and compact enough to travel with me for work too.

For displays like the ILI9341 (pictured), the typical command sequence is to set a window (0x2A and 0x2B commands) with the X and Y dimensions followed by raw pixel data. Below is a trace of the window command. Taking 69 usec to transfer 10 bytes (0x2A, start x, end x, 0x2B, start y, end y). The coordinates are 16-bit integers. Also notice the delay between transfers as the MCU is executing other commands.

To draw a single pixel, the address command above plus a 16-bit color must be sent to the display - 12 bytes total! That's a lot of overhead for one pixel.

Here's what drawing Bender on a 128x128 display at 16-bits looks like. In this case the window is set to 0, 128, 0, 128 then 16-bit colors are streamed to the display.

Looking at the clock line, notice there are gaps here that we should be able to optimize out. Rendering appears to take 682 msec. Zooming way in we see there is a lot of space between transfers. Toggling the CS line seems to be wasting time. Data is being transferred two bytes at a time because it's being converted from a 24-bit BMP pixel-by-pixel.

So How Do We Improve This?

An easy way is to remove the BMP overhead by pre-processing the images. The display is in 16-bit 565 mode. By converting the image to 565 big-endian raw data using ffmpeg we can stream the pixels in the display's native format. The following command will convert an image to 565be raw format.

ffmpeg -i RED.BMP -f image2 -s 128x128 -pix_fmt rgb565be RED.raw

In addition we can remove some of the SPI blocking to reduce delays. The image below shows the flash memory and LCD transfers occurring asynchronously.

The transfer in green is the LCD and red is the flash memory being setup to read the next chunk of the image. Notice the pixel data has fewer gaps. We are simply reading the 565BE data from flash and streaming it to the LCD. At this point we can draw 10.5 FPS.

The flash setup time is wasting quite a bit of time, roughly 800 usec. We're paying this penalty many times per frame. To reduce the number of times occurs we can make a classic tradeoff between memory and performance. Next step is to increase the chunk size from 2 rows to 32 rows per transfer. This uses about 8k of memory and gives us a nice speed boost to 11.7 FPS.

The transfers are basically blocking each other at this point. Again in the image above LCD is on the top and flash is on the bottom.

Last Step

Now that large chunks of pixel data can be read and streamed to the LCD very quickly, the last step is to send the pixel while the flash is reading the next frame. I can only transfer 254 bytes at a time, but can do so asynchronously. The trick is to setup the next 254 bytes of the chunk soon after the first returns but not interrupt the flash transfer. This allows the flash to be read with very little disruption setting up the next 32 rows of image data while the last are being sent.

In the image above, the 32 rows are clearly defined. The image is 128 pixels tall so 4 transfers of this size are required to draw a full frame. The flash is a bit slower than the LCD and ends up slowing the LCD down a bit. The green box highlights a full frame. Note the LCD does not start until the first chunk of data is read from the flash. But the flash rarely stops.

At this point we're able to draw 17.5 FPS, excelsior!

Putting it all together

What I learned through this whole process, is allow the hardware to do what it does best. Give it very large chunks of data to transfer and let it do so in the background. Even with a 64 Mhz MCU, my code will slow down the transfer greatly. Once data can be streamed in large chunks, do so in parallel with other operations to maximize efficiency. At the beginning of this process, even with DMA enabled, it took me 685 msec to draw a single frame from flash. I am now down to 57 msec. Although I can go further, with -O3 gcc optimization it's down to 52 msec but this introduces other bugs I need to fix. So for now I'll stick with -O0.

Discussions