Entry 25: ONCE MORE INTO THE ABYSS (of displays)

A project log for Aiie! - an embedded Apple //e emulator

A Teensy 4.1 running as an Apple //e

Jorj BauerJorj Bauer 01/22/2022 at 15:040 Comments

_Alternate title: "BRING ON THE HACKS"... strap in, it's gonna be a ride.

When we started brainstorming about Aiie! v10, one of the first questions asked was, "why can't we just use an 800x600 display and show the whole display instead of hacking around it on a 320x240 display?"

What a lovely, simple, innocuous question. And oh boy what a road it's been.

Let's start with the physical... what 800x600-like displays exist for embedded systems? There aren't many, and they tend to be fairly pricey. There are NTSC, VGA, and HDMI panels (obviously requiring those kinds of output from your project, which I don't have; I've toyed with NTSC and VGA so either of those would be feasible). If I'm looking for SPI, though - there is basically just the RA8875 chip which supports 800x600 and 800x480, which are both good resolutions for Aiie v10 as discussed in my last log entry.

I'd like this to be as cheap as possible, though. Which means understanding the parts really well and ultimately deciding if I'm using someone else's carrier board or making my own. The 800x480 40-pin displays are cheaper than the 800x600 displays. And has them for under $18.

So some digging later,  I'd bought a 4.3" 800x480 display from Yes, it's cheap... but doesn't include the RA8875 driver. Pair it with the $40 Adafruit RA8875 driver board and we should be good to go. Can I make this display work in any reasonable way?

Well, maybe. Let's look at the software side. There is an RA8875 driver for the Teensy, that's good! But it doesn't support DMA transfers to the SPI bus. That's bad.

What exactly does that mean? It means excruciatingly slow screen draws. Like, 1 frame every 6 seconds if we're drawing one pixel at a time. This is the actual RA8875 drawing that way...

That may be terrible, but it gets worse.

The first version of Aiie used a similar direct-draw model and was always fighting for enough (Teensy) CPU time to emulate the (Apple) CPU in real-time, because it was spending so much time sending data to the display. Not to mention the real-time requirements of Apple 1-bit sound. It was a bad enough set of conflicts that I added a configuration option at some point to either prioritize the audio or the display; you could have good audio or good video but not both. And if you picked video, then the CPU was running in bursts significantly faster than the normal CPU followed by a total pause while the display updated.

All of that was solved when I converted to a DMA-driven SPI display. The background DMA transfers don't interfere with the CPU or sound infrastructure at all. So I definitely, absolutely, completely want to use eDMA-to-SPI transfers to avoid this bucket of grossness.

So let's start somewhere real... let's change teensy-display.cpp so it will be able to drive this thing, and see how it goes. Here's one line that's a great starting place in teensy-display.cpp:


That's the memory buffer where the display data is stored. TEENSYDISPLAY_HEIGHT and WIDTH are 320 and 240, respectively - matching the 16-bit display v9 uses. The DMA driver for the ILI9341 automagically picks up changes there and squirts them over SPI to the display at 40-ish frames per second.

What happens when we change TEENSYDISPLAY_HEIGHT and TEENSYDISPLAY_WIDTH to be 800x480 instead of 320x240? We get our first disappointment, that's what!

arm-none-eabi/bin/ld: region `RAM' overflowed by 450688 bytes

Simply put, there just isn't enough memory on the Teensy to be able to hold the display data. Let's dive in to that a bit.

The Apple //e has 128k of RAM. The Teensy 4.1 has 1MB of RAM. I would seem, on the face of it, that there should be enough RAM for a 480x480x16-bit display buffer - that's 750k. Yes, there's more overhead in the rest of Aiie... but a rough calculation says that for us to be over by 450-ish kilobytes per that error message, we'd have to be using the (1 meg minus 750k for the driver equals) 250k remaining RAM, plus the 450k that we're overflowing. Subtract the 128k of the Apple's actual RAM, and ... 572k seems like a lot of overhead to run this emulator. If that's true maybe there are optimizations that could be made elsewhere. But it turns out that's not actually true, for an architectural reason on the Teensy hardware itself.

The problem is that there are three banks of RAM on the Teensy 4.1. Borrowing some of the documentation from PJRC's Teensy 4.1 page:

Ignoring FLASH for the moment, we've got three banks of RAM - RAM1, RAM2, and PSRAM.

RAM1 is "tightly coupled" to the processor and is the fastest. RAM2 has a 32k cache around it, and PSRAM is the slowest. (FLASH isn't RAM, but it is possible to put static content and code in it.)

The DMAMEM adornment on my code means that the array is in RAM2. Since it's only 512K, there isn't enough space for the 750k array at all.
Both RAM1 and RAM2 are part of the stock Teensy 4.1 board. PSRAM is either one or two add-on 8 megabyte chips; Aiie v9 requires one. I've been adding both just for future flexibility, which means not adding a FLASH chip (you can either have one of each, or two PSRAM chips).

We could, therefore, move this display RAM up to PSRAM and it would fit. But there's a reason it's called DMAMEM; it's the preferred place for DMA transfers to happen. They can come out of RAM1, which is faster; but they get no benefit of doing so, which means RAM1 is better suited to local variables that are being used and disposed of all the time. And while it will fit in PSRAM, that memory is much slower than RAM2; enough that it will affect eDMA's ability to drive the display. In practice I find that it's roughly 10% the speed of using RAM2. So RAM2 is highly preferred, but not required. If we could get the display up to, say, 300 frames per second in RAM2 then we might get 30 fps when using PSRAM and that would be sufficient.

But the fastest I've driven the ILI9341 is somewhere around 60fps. And 6fps would just not cut it. So this really needs to fit in RAM1 or RAM2 to be viable.

How do we jam 600 x 480 x 2 bytes -- 750k of data -- in to a single 512k chip?

We don't. That's not possible. But maybe we can alter our expectations of what's being drawn, or how it's being drawn, to keep a smaller buffer footprint.

Generally display driver chips have some idea of windowing. They don't mean having a window you can drag around the screen; they mean that there's an area of the screen to which all drawing is confined. It's a way of restricting what part of the screen is being updated at any given time, which reduces the number of bytes you've got to transfer. What if we restrict it to just the Apple display area of 560x384? Well, 560 x 384 x 2 = 430080, which is 420k. It fits! But to take up that much space for DMAMEM, we'd have to get all of the object constructors and variables out of RAM2 to make room; that's almost the whole chip by itself, most of which is the Apple's 128k of memory (it's about 182k total, so there's about 54k of overhead that might be optimizable). Okay, it's possible. But not elegant.

If we've got a DMA transfer sending just the Apple display window data to the display, it means we can't update anything outside that rectangle without pausing and restarting the Apple's display. So any time we want to update the disk access lights, or the battery indicator, or dump a debugging string to the screen (like that "37 FPS" in the previous log entry's screen shots), we'd have to shut down the Apple DMA display; change the window; redraw the drive indicator or whatever; reset the window; and restart the DMA display. I suspect all of that will badly mess with the CPU timing, since those events would have to happen inline with the CPU process rather than asynchronously from the DMA process, but it's possible to do something clever and complex around caching and multithreading. Doesn't seem like a showstopper, but the work involved in getting all the other stuff out of RAM2 feels like a problem. I have no idea how to get C++ objects out of that space, for example. So I'd prefer another solution if one can be found. But let's keep thinking this through for a minute, and we'll see that there's a confluence coming up.

560 x 384 x 2 bytes of data being transferred via SPI will be 560 x 384 x 2 x 8 bits of SPI data. The SPI clock rate in Aiie v9, with an ILI9341, is 50MHz. A good approximation of our frame rate with this increased resolution would be

50 MHz / (800 pixels wide * 384 pixels tall * 2 bytes per pixel * 8 bits per byte) = 14.5 frames per second

That's not a complete show stopper but it's definitely troubling. But reading Adafruit support's reply in this thread about bitmap drawing speed on the RA8875 board, they say the maximum SPI bus speed is the same as the clock speed on the RA8875, which is 20MHz. Which means  

20 MHz / (800 pixels wide * 384 pixels tall * 2 bytes per pixel * 8 bits per byte) = 5.8 frames per second

Uh oh. 5.8 fps is really bad. The only two ways to deal with that are to up the SPI clock - and in practice, I find that it's possible to go past stated maximums to some extent, so maybe there's room here to go faster - or to reduce the amount of data being transferred.

Can we reduce the data we need to send?

One way to do that would be to define dirty areas of the screen that need updating. I'd written code to do this back in the original Aiie because of its terrible display update speed, and it would be possible to get that all working again. But it really complicates the DMA code. Instead of one free-running DMA transfer that's constantly sending the whole screen, giving us a constant update rate, we'd need to trigger DMA transfers of the dirty area from the main code loop. That would typically give us better performance with a fluctuating FPS rate. While there's nothing wrong with that, it doesn't really help us with games that are updating the whole screen; we'll be reduced back to the lowest rate. And potentially worse, because there may be an impact on the real-time CPU scheduling while we set up each DMA transfer and talk to the SPI bus directly. This doesn't actually solve the problem, it just makes it less obvious that it's still broken under some circumstances.

Okay, I've been saying "560 x 384 x 2" and have been deliberately ignoring the "x2" piece of that. Yes, the display is 16 bits per pixel, which is where that comes from. But it doesn't have to be. The RA8875 (and most other 16-bit display driver chips) lets us drop it to 8 bits per pixel. This literally cuts in half the memory requirement, which will double the frame throughput. The tradeoff, of course, is that it's harder to accurately represent any individual color; instead of 5 bits of data to represent the red component, you're dropped to 3 bits. 6 bits of green drop to 3 bits also. And 5 bits of blue drops to only 2 bits. But let's ignore that for the moment, too... we're still experimenting.

If we drop to 8bpp, then the whole thing fits in to RAM2 without hacking any of the C++ constructor allocation stuff. Two problems at least partially addressed with the same solution. Now the theoretical 20MHz SPI clock can transfer 11.6 windowed frames per second, which is still pretty awful but is closer to usable. Personally I think 12 fps is the cutoff for usability here; 30 is ideal, but I haven't seen anything that's a real show stopper at 12 fps other than real-time video. And that's a rarity, at best, on the //e. So I'm not happy with 11.6 at all and want better.

While I'm backing off of the layered hacks to keep it as simple as possible... let's back off of one more. The windowing improvement is an optimization. Optimizations should come after the core code works well. So let's go back to the full 800 x 600 display and see what happens - the theoretical frame rate drops to 5.2 fps, but the buffer does fit in RAM2. If we can get this working well enough then maybe we can apply windowing as a later optimization t make things better in the majority of cases - where there aren't drive indicators flashing or debug messages being displayed.

This is the point where I draw the rest of the fucking owl. 

How to Draw an Owl How to Draw an Owl 1 Draw Some Circles ...

The RA8875 driver doesn't have eDMA support at all, so I had to write it. Which I did, and which was complicated, and which deserves its own log entry. So I'm going to hand-wave past that as "yep, it's solved, nothing to see here" for now. We've got DMA, and it does get about 6 fps with a 20MHz SPI clock.

So now we're back to hacks. What other hacks might get that frame rate up?

Well, let's see how far we can push that SPI clock. Setting the _clock rate in teensy-display.cpp to about 25000000 works, but setting it to 30000000 is completely black. A little playing around and we find that at 29MHz we get this psychedelic show:

The empirical maximum, for me in this configuration under these circumstances, seems to be about 26MHz which yields roughly 7 fps. But LOOK AT HOW NICE THAT 80-COLUMN TEXT LOOKS.

So, first goal accomplished - YES, we can use this display. But can we make it perform well enough? Stay tuned for more...