#13 - The Hunt for (a bug in something that is) Red (in) October

A project log for Game-o-Tron Mini

A roughly credit card sized gaming handheld

David BoucherDavid Boucher 10/08/2017 at 10:312 Comments

And with the contrived title out of the way, I'll explain.

After connecting up the screen (as above) I noticed an annoying problem: roughly a couple of times a minute I would get display corruption. It would only last for one frame, so roughly 1/40th of a second, but it was noticeable. Arrgh!

As I'd not seen this on my breadboard prototype, my first thought was that it was either a bad connection or having all the wires wedged in close together was causing interference. Checking for a bad connection was the simplest thing to start with, so I tried shaking it and prodding the various wires (with something non-conductive or course) with it switched on. Neither of these things triggered the problem, so maybe it was something else?

To check if interference was the problem, I disassembled the electronics from the case so that the wiring could be spread out a little. I thinking was that if interference was the problem, this would reduce or eliminate it but when I powered everything up there was no change and I still saw the occasional corrupted frame.

After checking the supply voltage and trying a few different power supplies to no effect, I decided to look at the software. The first thing that I did was lower the SPI frequency. This eliminated the problem but at a cost of reducing the frame rate. Maybe there was a better way?

I looked at the code that sends the display data over SPI as this is where the bulk of the time is spent. For each line of the display, SPI.beginTransaction() is called, a line data is sent using SPI.transfer() and SPI.endTransaction() is called. Would moving the SPI.beginTransaction() and SPI.endTransaction() calls outside the loop work and would it improve performance?

I tried it and it did work and the bug did not return however it was not significantly faster. For some reason, I decided to try it with the SPI frequency increased back to its original setting.

The bug did not return. I got a rock solid display at 40 FPS. I tried increasing the frequency further. The bug still did not return. I was able to increase the SPI frequency to its maximum value and get a whopping 80 FPS, more than enough for my purposes.

I can't say exactly what the problem was here, or even if the problem was with the Teensy or the screen, but it seems that performing an SPI transaction for each line was causing some sort of stability problem and doing one transaction for the entire screen instead fixed it. I would be interested to hear you theories in the comments. 

In any case, the end result is one squashed bug and a performance improvement!

AN UPDATE AND CORRECTION (11th December 2017):

I've discovered today that there is an odd problem with this in that it is only stable at this speed with certain colours. bright primary and secondary colours are fine, but other colours occasionally get corrupted part way down the screen.

I first noticed this on a test screen with orange text on a black background. Sometimes the bottom part of the screen would render as red text instead of orange. As far as I can tell, a single bit is being lost somewhere, changing the colour of each subsequent pixel. Depending on the original colour, this may or may not be noticeable.

The only way I've found to fix this is to lower the SPI frequency. This has brought the frame rate of my current test program (which is doing a bit more than the one I was using in the original post) from 63FPS to 36FPS. :(


deʃhipu wrote 10/09/2017 at 10:55 point

Hmm, since the bug only appeared sometimes, I and went away when you changed time-related parameters (speed, time of the transaction), I would look for possible race conditions. Are you using DMA with that SPI? Perhaps when you were sending a single line, the code would finish before the DMA transfer finished, and you already started filling a buffer that was still being sent?

  Are you sure? yes | no

David Boucher wrote 10/09/2017 at 17:04 point

I'm not using DMA and the calls are synchronous, but a race condition is a possibility.

I've checked the docs again and it looks like I should de-assert chip select along with each endTransmission() call, whereas the original version of the code only did it at the end of a frame. That's another possible cause but I've not had time to investigate further yet.

  Are you sure? yes | no