-
Entry 29: A Little Bit of This, and a Little Bit of That...
02/10/2022 at 21:45 • 0 commentsAs I've been writing in the last few updates: I've been working on support for the RA8875 display - so the next generation of the Aiie will have a display that can accommodate the 560 pixels wide that the Apple //e has in "double hi-res" modes (80-column text, double-low-res and double-hi-res graphics). That's all because without it, AIie has been doing some janky hacks to display those graphics modes on a panel that's only 320 pixels wide.
Most of what I'd think someone uses a handheld //e emulator for doesn't involve 80-column text, so until now I've sort of ignored the problem. It wasn't until I heard from Alexander Jacocks about how he was building an Aiie that the topic came back to the fore. I opened up a discord for us to chat, and we've been talking about what's lacking in the current build... and of course the display was the number one hot item.
We've spent a couple months talking about the hardware and software with a few other people that have also joined our discord, and (as I've written) we've got the 800x480 panel up and running at about 14 frames per second.
It's looking like we won't get past that point. We're pushing the display's (single) SPI bus as fast as it will go. It's possible that some hack can get a little more out of it (if we abandon the display outside of the "apple" screen area then it might be possible to only update the 560x192x2 pixels of the "Apple Screen"); and I've got some framework for only updating the parts of a screen that have been modified by the Apple emulator... but when all is said and done, if a full-screen game is updating the whole screen, there's not much you can do about the lack of bandwidth.I think that's kinda okay. If I rebuild the PCB so it can accommodate either the ILI9341 320x240 display *or* the HA8875 driver from Adafruit with a 4.3" 800x480 display, then the user can choose -- do I want 30 frames per second with the smaller display and some graphics issues at higher resolution, or do I want all the pixels at half the speed? Putting the choice back to the builder feels like a reasonable trade-off to me.
Which brought me to the next crossroad. I don't want to abandon the folks that have already built an Aiie. The original Mk 1 is a dead end, unfortunately, because of lack of CPU to do what I wanted. But the Mk 2 has plenty of capacity and I really don't want the addition of a new display to strand folks; I still haven't taken advantage of everything those have to offer! How can I continue to support the Mk2 platform without having to fork the software?
Well, that's not too hard actually. Since there is plenty of space in the Teensy, it doesn't mind having two copies of the graphics and two display drivers built in. Take one of the unused pins from the Mk2, turn it in to a jumper or switch, and /Voila/ you've got selectable displays. Buy both if you want, and swap them as necessary. (This may not be ideal when we get to having an actual case, but for now at least it's plausible.)
From there I dropped back to the *nix variants of Aiie. I do most of my development and debugging on a Mac, using SDL libraries to abstract the windowing. It had been doubling the resolution of the ILI panel... but I've undone that. Now that the Teensy code supports two different displays with different resolutions, the SDL wrapper does the same... and when you're running it with the ILI ratio, it's natively 320x240 and not 640x480. Which means that, among other things, it became very ugly very quickly... and now this problem that has existed on Aiie since the start suddenly became a priority for me.
My first take at this was to logically "or" every two pixels together. If either of them is on, then the result is "on".
The text is sort of legible... but that white rectangle with the three dots in it is the letter 'a', inverted. As long as we're talking about black-on-white text it's... meh, probably okay.
Next up we have straight linear average: average the R, G, and B components to figure out what pixel we're going to draw. It winds up looking very similar.
It's hard to see, but there is a subtle improvement there - this is the inverted 'a' when you get up close and personal with it:
Not ideal, certainly. But it's something. And it got me thinking that the right thing to do here has to do with the way we actually see colors.
The RGB color space is what's called an additive color space. You add the RGB lights together in order to get white. It's not intuitive though: if you add red and green, what color do you get? Yellow. If you add Yellow and Blue what do you get? White. (Yellow is Red + Green; so Yellow + Blue is the same as Red + Green + Blue; and that is white.)
The HSV color space (Hue, Saturation, and color Value) was made as an attempt to model how we perceive color. It separates the color (which is basically the Hue) from the brightness of that color (the Value) in a way that lets you manipulate the color more naturally without accidentally changing the brightness.If you want more of the background, take a look at the Wikipedia article on HSL and HSV color spaces. For our purposes, let's jump to the end here: what I want is to represent, in one pixel, some combination of what two pixels are actually trying to show. In order to do that I need some blended pixel data, and that sounds a whole lot like I should be using HSV.
The problem I had here is that I'll have to do a bunch of math on every pair of pixels, every time we need to draw them. My quick attempts bogged down the Teensy badly, and so I left the ILI panel using RGB averaging as a "close enough for now" solution.
But now I'm looking at the SDL port, where I've got the full CPU of a Mac to play with! How does it look, I wonder? Well a few algorithms later... convert RGB to HSV for both pixels; average the two H and S and Vs; convert back to RGB, and throw that pixel back to the display...
As George Takei might say... "Oh, My."
Here's the same at full resolution, for comparison:
You can see that the full resolution is nicer... but if you had never seen it at full resolution, the half-resolution version is really not all that bad.
So... how do I get that sexiness on to the Teensy?? How do I not overload the CPU?
Let's follow the logic. The reason it's slow is because it's doing a lot of calculation. The reason it's doing a lot of calculation is because the algorithms for RGB/HSV conversion are kind of messy (they're not well optimized for computers to do them). The reason I have to do so many calculations is because the driver is trying to mix two arbitrary colors.
But there are only 16 possible colors. The SDL port might be using 24-bit colors to show them on the screen, but the Apple //e only knew about 16 actual colors. Which means that we only ever have to blend 16^2 possibilities -- with 256 possible outcomes. That's small enough to make a look-up table! Assuming that we start with these 8-bit colors:
static const uint8_t palette8[16] = { 0x00, // 0 black 0xC0, // 1 magenta 0x02, // 2 dark blue 0xA6, // 3 purple 0x10, // 4 dark green 0x6D, // 5 dark grey 0x0F, // 6 med blue 0x17, // 7 light blue 0x88, // 8 brown 0xE0, // 9 orange 0x96, // 10 light gray 0xF2, // 11 pink 0x1C, // 12 green 0xFC, // 13 yellow 0x9E, // 14 aqua 0xFF // 15 white };
the mixture of each color with one of the other colors precomputes to
static const uint8_t mix8[16][16] = { 0x00, 0x29, 0x28, 0x2D, 0x28, 0x49, 0x28, 0x6D, 0x24, 0x69, 0x24, 0x4D, 0x6D, 0x6D, 0x6D, 0x6D, 0x29, 0xA1, 0x62, 0xA2, 0x02, 0x52, 0x42, 0xAB, 0x0E, 0x3B, 0x8A, 0xC6, 0x0B, 0x37, 0x47, 0x7B, 0x28, 0x62, 0x02, 0x42, 0x0E, 0x31, 0x0A, 0x2B, 0x0C, 0x18, 0x46, 0x87, 0x16, 0x39, 0x32, 0x79, 0x2D, 0xA2, 0x42, 0xA3, 0x0A, 0x56, 0x03, 0x8B, 0x16, 0x3E, 0x8A, 0xEB, 0x37, 0x5F, 0x4F, 0x9E, 0x28, 0x02, 0x0E, 0x0A, 0x10, 0x70, 0x11, 0x37, 0x2C, 0x98, 0x2A, 0x2B, 0x14, 0x78, 0x35, 0xB9, 0x49, 0x52, 0x31, 0x56, 0x70, 0x91, 0x55, 0x9A, 0x8C, 0xD1, 0x72, 0x9B, 0xB9, 0xDA, 0x99, 0xDA, 0x28, 0x42, 0x0A, 0x03, 0x11, 0x55, 0x12, 0x4F, 0x10, 0x58, 0x2A, 0x67, 0x19, 0x39, 0x3A, 0x99, 0x6D, 0xAB, 0x2B, 0x8B, 0x37, 0x9A, 0x4F, 0x93, 0x35, 0x7E, 0x92, 0xD3, 0x7F, 0x9F, 0x9B, 0xDF, 0x24, 0x0E, 0x0C, 0x16, 0x2C, 0x8C, 0x10, 0x35, 0x68, 0xAC, 0x2D, 0x36, 0x94, 0xB4, 0x54, 0xB1, 0x69, 0x3B, 0x18, 0x3E, 0x98, 0xD1, 0x58, 0x7E, 0xAC, 0xED, 0x76, 0x7F, 0xFC, 0xF5, 0xBD, 0xF6, 0x24, 0x8A, 0x46, 0x8A, 0x2A, 0x72, 0x2A, 0x92, 0x2D, 0x76, 0x6D, 0xAE, 0x56, 0x76, 0x72, 0xB6, 0x4D, 0xC6, 0x87, 0xEB, 0x2B, 0x9B, 0x67, 0xD3, 0x36, 0x7F, 0xAE, 0xEF, 0x57, 0x7F, 0x6F, 0xBF, 0x6D, 0x0B, 0x16, 0x37, 0x14, 0xB9, 0x19, 0x7F, 0x94, 0xFC, 0x56, 0x57, 0x7C, 0xDD, 0x5D, 0xFE, 0x6D, 0x37, 0x39, 0x5F, 0x78, 0xDA, 0x39, 0x9F, 0xB4, 0xF5, 0x76, 0x7F, 0xDD, 0xFD, 0x9D, 0xFA, 0x6D, 0x47, 0x32, 0x4F, 0x35, 0x99, 0x3A, 0x9B, 0x54, 0xBD, 0x72, 0x6F, 0x5D, 0x9D, 0x7E, 0xFE, 0x6D, 0x7B, 0x79, 0x9E, 0xB9, 0xDA, 0x99, 0xDF, 0xB1, 0xF6, 0xB6, 0xBF, 0xFE, 0xFA, 0xFE, 0xFF, };
the ILI panel is actually 16 bits per pixel and the table is twice as long, but the same set of calculations apply... and all that calculation time disappears in a puff of lookup table glory. Lookups are fast and our problem of "not enough CPU" is solved.
How does it hold up in color environments, though? Well, the best example I can think of is the game AirHeart (which is my go-to for the game that pushes the limits of what the Apple //e was capable of doing). Here it is at full (double) resolution:
And here it is on a 320x240 display, at half-double-resolution:
Wow.
Now I'm wondering about the effort spent getting that RA8875 working. If I'd made the lookup table years ago, I'm not sure upgrading the panel would have even come up...
-
Entry 28: Collaboration via Discord
01/23/2022 at 00:32 • 0 commentsI've been collaborating with a few individuals on Discord the last month or so. I've got bits of development and a lot of discussion going on over there, and if you're interested, drop on by. I don't know if this will work out long term but it's at least an interesting experiment...
Discord server invite: https://discord.gg/NRhMS6fRgZ
-
Entry 27: on DMA with the Teensy 4.1
01/22/2022 at 20:10 • 0 commentsAs part of the RA8875 display work, I had to build a new framebuffer and use the Teensy's eDMA system to automatically shuffle bytes out the SPI interface. In doing that I learned quite a lot about how the eDMA interface works, and what the magic code in the ILI9341 and ST7735 libraries does. I left a lot of those comments in my RA8875_t4.cpp module but I thought an out-of-band write-up would be really useful (for me for later, if not for others trying to do the same thing).
To begin with: download the IMXRT 1060 Manual from PJRC. The inner workings of all of this is documented in there, but it's not all in one place and can take a while to find (in the 1100-whatever pages of documentation). Eventually you'll be looking for data from it.
For me, the general path of getting this all working was:
- Get the display initialization working by reading other sources and duplicating what they did.
- Implement synchronous transfers on-demand to draw pixels, clear the screen, and whatnot.
- Build a framebuffer and update that instead of calling the synchronous methods.
- Build a synchronous update from the framebuffer.
- Implement a one-time DMA-to-SPI transfer that reads from the framebuffer and then stops.
- Turn on continuous asynchronous updates from the framebuffer.
- Remove all of the (now-unused) synchronous code.
The display initialization and LCD workings I'm not going to talk about too much - mostly it was a combination of looking at the RA8875 module distributed with Teensyduino 1.56 and looking up constants in the official display manual. The same is true of synchronous transfers -- a lot of copy/paste/delete/rewrite as I began to understand how the display itself works.
The framebuffer code itself is pretty straightforward. I needed an array for 8bpp 800x480, and declaring it is relatively straightforward:
DMAMEM uint8_t dmaBuffer[RA8875_HEIGHT][RA8875_WIDTH] __attribute__((aligned(32)));
A refresher from my last log entry: DMAMEM tells the Teensy to put it in RAM2, which is perfect for DMA to use; the height and width constants are 800 and 480 respectively. That just leaves the attribute, which is important for DMA -- it can apparently be picky about the alignment of the buffer it's copying out of. (I didn't have any problems with this, but then again I was forewarned and added the attribute. YMMV.)
From there it's just a matter of doing some pixel math -- whenever something needs to be drawn to the screen, calculate the proper index in to the array and store the pixel instead of pushing a command to the LCD to do the same work. Taking that buffer of data and feeding it to a synchronous update proves that I know how to interact with the display itself - initializing its display window, starting a new SPI transaction, telling it I'm sending memory data, and ending the transaction when done. Nothing difficult so far, just very very slow to perform its work. This is how I'd transfer all of the data to the display in a synchronous function:
_writeRegister(RA8875_CURV0, 0); _writeRegister(RA8875_CURV0+1, 0); _writeRegister(RA8875_CURH0, 0); _writeRegister(RA8875_CURH0+1, 0); // Start it sending data writeCommand(RA8875_MRWC); _startSend(); _pspi->transfer(RA8875_DATAWRITE); for (int idx=0; idx<800*480; idx++) { _pspi->transfer(dmaBuffer[idx]); } _endSend();
Those first four writes tell the display we're starting in the upper-left corner (Vertical and Horizontal cursor position at 0). Then send a memory write command; begin an SPI transaction with _startSend(); tell the display we're going to stream the pixel data; actually stream te pixel data; then end the transaction and we're done.
All we have to do is repeat that from a DMA handler!
Arr, but here there be dragons. Or maybe that's the wrong metaphor. Here there be performers of the dark arts? Certainly poorly documented capabilities that are hard to figure out from scratch. Which is why I leaned a lot on the ILI and ST code.
A lot of the code in the ILI and ST modules are abstractions -- "how do we talk to this piece of hardware to perform the some action" -- and it makes it difficult to read any of it. Paring it down to the minimum for the platform I was interested in helped a lot in seeing the important pieces. But time consuming and mind numbing. The code I'm left with is specific to the Teensy 4.1, which makes it a lot more legible.
There are a bunch of state variables. The pin assignments themselves for SPI communication are obviously in _cs, _miso, _mosi, _sck, and _rst. Pretty much everything else is more difficult to understand, so let me walk through it here.
There are the output hooks. There are multiple SPI busses on the Teensy, and in order to abstract which one is being used, there's _pspi that points to the SPI bus in question. It's also referenced in _spi_num (0, 1, or 2) and there's a crazy hack (that will certainly break at some point) to dig the SPI hardware configuration out of the SPI object and store it in _spi_hardware. Lastly we have _pimxrt_spi, which points at the low-power SPI data structure used to control how the target LPSPI bus operates.
There are the DMA glue hooks. There are a bunch of DMASetting structures, each defining a piece of data to be transferred via DMA. There's _dmatx which defines (and interacts with) the current DMA channel being used. There's _pfbtft that points at the frame buffer. _dma_state just holds a number that tells us what pieces of our DMA engine we've dealt with so far (have we initialized yet? Are we in continuous update mode? Is there a transaction active?) We have _spi_tcr_current, which refers to the current Transmit Control Register status; _spi_fcr_save which saves the Fifo Control Register setting before we start a DMA transfer so it can put it back the way it found it; and _dma_frame_count which is just a count of the number of full frames we've sent to the display (so Aiie can calculate frames per second).
The way it all works is this...
There are a number of DMASetting objects in that array sitting between the Framebuffer and the SPI bus. Each DMA transfer can only transfer up to 32767 words of data (where a word might be 8 bits, 16 bits, or 32 bits wide). We've got 800 x 480 (8-bit) words of data to write, so we will need (800*480)/32727 = 11.719 settings objects to describe all of the data. To that end we've got 12 of them, each of which will transfer (800*480)/12 = 32000 bytes of data and then hand off to the next DMASetting object.
Each object looks something like this...
What Notes sourceBuffer The source pointer to the data we want to transmit destination The peripheral target that will get the data TCD->ATTR_DST Transmit destination attributes (word size) replaceSettingsOnCompletion() Another dmaSetting structure that will automatically be used when this one completes
For us, the source points to some piece of the framebuffer; the destination points to the transmit control register (TCR); the destination attributes specify that each transmit will be 8 bits; and each one refers to the next in the chain for completion, so that structure [0] points to [1]; [1] points to [2]; and so on, with the last one (number 11) pointing back to [0].The last one in the chain is also a little special. It triggers an interrupt when it's done so we can hook the end of a transfer if we need to.
With all of that prepared, we set up _dmatx to point to the first set of the settings; tell it to trigger based on transmit status of the output SPI channel; attach the correct DMA interrupt handler; tell it to start running; and record that it has been initialized (for our own bookkeeping in _dma_state).
_dmatx = _dmasettings[0]; _dmatx.triggerAtHardwareEvent(dmaTXevent); if (_spi_num == 0) _dmatx.attachInterrupt(dmaInterrupt); else if (_spi_num == 1) _dmatx.attachInterrupt(dmaInterrupt1); else _dmatx.attachInterrupt(dmaInterrupt2); _dmatx.begin(true); _dma_state = RA8875_DMA_INIT | RA8875_DMA_EVER_INIT;
Now that the DMA channel end is set up, we'll just need to set up the display so it knows what to do with the data it's about to receive. You'll recognize this from the synchronous send I'd written above, we just stop before sending any of the actual data...
_writeRegister(RA8875_CURV0, 0); _writeRegister(RA8875_CURV0+1, 0); _writeRegister(RA8875_CURH0, 0); _writeRegister(RA8875_CURH0+1, 0); // Start it sending data writeCommand(RA8875_MRWC); _startSend(); _pspi->transfer(RA8875_DATAWRITE);
And then set up some registers for DMA:
// Set transmit command register: disable RX ("mask out RX"), enable // TX from FIFO (b/c it's not masked out), and 8-bit data transfers // (7+1). maybeUpdateTCR(LPSPI_TCR_FRAMESZ(7) | LPSPI_TCR_RXMSK); // Set up the DMA Enable Register to enable transmit DMA (and not receive DMA) _pimxrt_spi->DER = LPSPI_DER_TDDE; _pimxrt_spi->SR = 0x3f00; // clear error flags _dmatx.triggerAtHardwareEvent( _spi_hardware->tx_dma_channel ); _dmatx = _dmasettings[0]; _dma_frame_count = 0; _dmaActiveDisplay[_spi_num] = this; _dmatx.begin(false);
And finally start everything flowing.
_dmatx.enable();
Now we've started an SPI transaction; sent the commands to tell the RA8875 to write to RAM; and wired up memory to send the dmabuffer contents out SPI. When it's done it will call the interrupt, and then we need to clean up and end the transaction, or the SPI bus will hang waiting for something to end the current transaction.
That cleanup happens in process_dma_interrupt. If we're running continuously, we don't have to do much. The window rolls over from the bottom-right pixel to the top-left and the stream of data continues to flow to the screen. We additionally call a cache-busting function to tell the processor to flush and cached data to RAM before we try to use it.
But if we called updateScreenAsync(false), then it wants to update once and stop. This is the clean-up code that makes sure we're done sending, cleans up some registers and turns off the transmission circuitry, finally ending the SPI transaction and updating _dma_state so we know it's no longer running:
while (_pimxrt_spi->FSR & 0x1f) ; // wait until transfer is done while (_pimxrt_spi->SR & LPSPI_SR_MBF) ; // ... and the module is not busy _dmatx.clearComplete(); _pimxrt_spi->FCR = _spi_fcr_save; _pimxrt_spi->DER = 0; // turn off tx and rx DMA _pimxrt_spi->CR = LPSPI_CR_MEN | LPSPI_CR_RRF | LPSPI_CR_RTF; //RRF: reset receive FIFO; RTF: reset transmit FIFO; MEN: enable module _pimxrt_spi->SR = 0x3f00; // clear error flags // DMF: data match flag set // REF: receive error flag set // TEF: transmit error flag set // TCF: transfer complete flag set // FCF: frame complete flag set // WCF: word complete flag set maybeUpdateTCR(LPSPI_TCR_FRAMESZ(7)); _endSend(); _dma_state &= ~RA8875_DMA_ACTIVE; _dmaActiveDisplay[_spi_num] = 0;
And that's it. It's not a lot of code - the RA8875_t4.cpp file is 538 lines, of which about 100 are comments. It's just poorly documented and difficult to understand.
Having done all of that - could I do it for other Teensys? Yes, the Teensy 4.0 could do this with some code changes. None of the previous Teensys have enough RAM for the frame buffer, though; the 3.6 had 256k of RAM and it's the beefiest of the predecessors.
I'll also say it's probably not worth building DMA transfers in to the RA8875 library for a few different reasons. The fact that the performance is so poor with the system clock at 60MHz (the documented maximum) limits the utility to begin with. Then you have to also decide if you want to use it at 10% of that speed by using PSRAM, or at 8 bits per pixel in RAM1 or RAM2.
The transfers themselves may be an elegant solution to avoid wasting a lot of CPU time doing busywork, but the sacrifices and general set-up are definitely ... not.
-
Entry 26: Harder, Better, Faster, Stronger
01/22/2022 at 18:23 • 0 commentsWhen I was a freshman at university studying electrical engineering, one of my professors laid this out pretty plainly for us: engineering tolerance is important. If you're designing a system that needs 1 amp of current, your power supply better support at least 2 amps. You want room for failure - particularly when you're first designing something and have no idea how all the pieces will interact.
That often means that, if you know what you're doing, you can push past the stated limits of systems as long as you're willing to accept some risks. In the last log entry, you saw me push a 20MHz SPI bus over 26 MHz before it broke. Will every copy of that display get 26 MHz? I don't know, but it's possible. Will something be damaged by pushing it that far? Possibly, but it's not likely (in this case).
In the quest to get beyond 7 frames per second, this is the realm I'm visiting. What kinds of limits can I bend or break, without causing any significant damage? is there a way to get an 800x480 SPI display over 12 frames per second? Or maybe as far as 30 frames per second?
The first step is to consult ye olde manuals. What are the variables here and how do they interplay?
We've got the SPI bus speed. On the Teensy side of things, I see reports of people getting that up to 80 MHz. I've certainly driven it up to 50 MHz. We're not near those potential maximums yet - and the current failure is on the RA8875 side anyway. So what are the limits there?
According to the RA8875 specification, page 62, the SPI clock is governed by the System Clock - where the SPI clock frequency maximum is the system clock divided by 3 (for writes) or divided by 6 (for reads). The system clock, in turn, is governed by the PLL configuration that's set via PLL control registers 1 and 2 (p. 39). The PLL's input frequency is the external crystal (on the Adafruit board that's 20MHz), and twiddling PLLDIVM, PLLDIVN, and PLLDIVK configures the multipliers and dividers for the PLL to generate its final frequency.
The system clock frequency is
SYS_CLK = FIN * ( PLLDIVN [4:0] +1 ) / (( PLLDIVM+1 ) * ( 2^PLLDIVK [2:0] ))
and looking at the DC Characteristic Table on page 174, we see that it's "typically" 20-30 MHz with a max of 60 MHz.
Now, the RA8875 driver (as distributed with the Teensy) sets all of this as PLLC1 = 0x0B and PLLC2 = 0x02, which means sys_clk is 60MHz. Right at the maximum limit specified in the datasheet.
Will it go faster? If we make it go faster, what else will be affected?
Looking for all of the references to the system clock, I see that the two PWMs use it. PWM1 is being used to drive the backlight, so that might be important at some point, but probably isn't critical. More importantly: the pixel clock is derived from the system clock.
The pixel clock is how the data is being driven out to the display. While I see various generalities about the pixel clock required for different sized displays that suggests 30-33MHz for a 800x480 display, I don't have the numbers for the actual display I'm using. And looking at the RA8875 manual and doing the math, it looks like the pixel clock is actually 15MHz here, so those "normal" values are either unimportant or wrong. Either way there's not much to do be done about it until we understand more of what's going on.
So, let's jump in the deep end! What happens if we maximize PLLDIVN (set it to 31), minimize PLLDIVM (set it to 0) and minimize PLLDIVK (set it to 0 also)? Short answer: nothing. A black screen. So we can't just set it all the way to the maximum. But a little bit of binary searching shows that we can actually set it to other values in the middle and it works with our 26MHz SPI bus just fine, and a sys_clk that's over 60MHz. How far? As far as 150MHz. Along the way I found that the display would break down in very interesting ways... like this, when pushing the SPI bus faster than the clock wanted:
Or this, when the pixel clock started drifting too far out from what the panel wanted:
but with a lot of experimentation, it looks like I can get a good solid display with the PLL frequency at 300MHz; the system clock at 150 MHz; and the pixel clock at 18.75 MHz. That lets me push the SPI bus up quite a lot - at 57.5 MHz I'm getting 14-15 frames per second with a good solid display. At 60MHz there are occasional color shifts like the psychedelic display above, though not as severe; and the display goes dark at 80MHz (although it works at 79.999999 MHz, so I suspect there's some magic constant in the Teensy software or the Teensy 4.1 hardware someplace that's cutting it off at 80 MHz).
What are the tradeoffs, then?
- I have no idea if this will work on other panels, or if it's just the one I've got.
- There are two integrated CPUs on the RA8875, and it's possible that neither of them will run reliably.
- There are character ROMs that might not work properly.
- The RA8875 has its own DMA access to an SD card that probably won't work.
- The PWM rates are definitely affected and could be an issue depending on how the backlight is being fed.
- The increased SPI bus rate on the Teensy means it's using more processing power "invisibly" transferring all that data. The audio channel also does the same, so there's potential conflict there.
Most of those are ignorable for now - I don't care about the features related to 2, 3, and 4. 5 seems to be fine empirically. We'll find out about 1 as more people test with this hardware and configuration. Which leaves 6.
There's definitely a problem with the increased SPI bus rate. As I pumped it up I started getting warnings about the audio buffer overflowing -- so as compensation, I've bumped up the audio buffer size tenfold (from 4k to 40k).
All this goes to show a few things.
The Adafruit support forum was wrong about the maximum SPI clock frequency, at least in the general case (or maybe I totally misunderstood what they were trying to say). It is not related to or capped at the 20MHz oscillator. Maybe what they said was true for their own driver under some other circumstances, but it's not a globally correct statement.
The stated maximums in the RA8875 manual for the CPU are not true for my use case. Again, it makes sense for the general-purpose driver to "set it and forget it" at the documented maximum, but for my purposes I can do more.
Understanding the manual is really helpful. Without knowing the way that these registers worked or were related to each other, I wouldn't have been able to easily identify which four values were related - how they are related, what their minimum and maximum settings are, and more importantly when I started seeing problems I wouldn't have had any idea which setting was likely the culprit in that circumstance.
Work it harder, make it better
Do it faster, makes us stronger
More than ever, hour after hour
Work is never over -
Entry 25: ONCE MORE INTO THE ABYSS (of displays)
01/22/2022 at 15:04 • 0 comments_Alternate title: "BRING ON THE HACKS"... strap in, it's gonna be a ride.
When we started brainstorming about Aiie! v10, one of the first questions asked was, "why can't we just use an 800x600 display and show the whole display instead of hacking around it on a 320x240 display?"
What a lovely, simple, innocuous question. And oh boy what a road it's been.
Let's start with the physical... what 800x600-like displays exist for embedded systems? There aren't many, and they tend to be fairly pricey. There are NTSC, VGA, and HDMI panels (obviously requiring those kinds of output from your project, which I don't have; I've toyed with NTSC and VGA so either of those would be feasible). If I'm looking for SPI, though - there is basically just the RA8875 chip which supports 800x600 and 800x480, which are both good resolutions for Aiie v10 as discussed in my last log entry.
I'd like this to be as cheap as possible, though. Which means understanding the parts really well and ultimately deciding if I'm using someone else's carrier board or making my own. The 800x480 40-pin displays are cheaper than the 800x600 displays. And buydisplay.com has them for under $18.
So some digging later, I'd bought a 4.3" 800x480 display from buydisplay.com. Yes, it's cheap... but doesn't include the RA8875 driver. Pair it with the $40 Adafruit RA8875 driver board and we should be good to go. Can I make this display work in any reasonable way?
Well, maybe. Let's look at the software side. There is an RA8875 driver for the Teensy, that's good! But it doesn't support DMA transfers to the SPI bus. That's bad.
What exactly does that mean? It means excruciatingly slow screen draws. Like, 1 frame every 6 seconds if we're drawing one pixel at a time. This is the actual RA8875 drawing that way...
That may be terrible, but it gets worse.
The first version of Aiie used a similar direct-draw model and was always fighting for enough (Teensy) CPU time to emulate the (Apple) CPU in real-time, because it was spending so much time sending data to the display. Not to mention the real-time requirements of Apple 1-bit sound. It was a bad enough set of conflicts that I added a configuration option at some point to either prioritize the audio or the display; you could have good audio or good video but not both. And if you picked video, then the CPU was running in bursts significantly faster than the normal CPU followed by a total pause while the display updated.
All of that was solved when I converted to a DMA-driven SPI display. The background DMA transfers don't interfere with the CPU or sound infrastructure at all. So I definitely, absolutely, completely want to use eDMA-to-SPI transfers to avoid this bucket of grossness.
So let's start somewhere real... let's change teensy-display.cpp so it will be able to drive this thing, and see how it goes. Here's one line that's a great starting place in teensy-display.cpp:
DMAMEM uint16_t dmaBuffer[TEENSYDISPLAY_HEIGHT][TEENSYDISPLAY_WIDTH];
That's the memory buffer where the display data is stored. TEENSYDISPLAY_HEIGHT and WIDTH are 320 and 240, respectively - matching the 16-bit display v9 uses. The DMA driver for the ILI9341 automagically picks up changes there and squirts them over SPI to the display at 40-ish frames per second.
What happens when we change TEENSYDISPLAY_HEIGHT and TEENSYDISPLAY_WIDTH to be 800x480 instead of 320x240? We get our first disappointment, that's what!
arm-none-eabi/bin/ld: region `RAM' overflowed by 450688 bytes
Simply put, there just isn't enough memory on the Teensy to be able to hold the display data. Let's dive in to that a bit.
The Apple //e has 128k of RAM. The Teensy 4.1 has 1MB of RAM. I would seem, on the face of it, that there should be enough RAM for a 480x480x16-bit display buffer - that's 750k. Yes, there's more overhead in the rest of Aiie... but a rough calculation says that for us to be over by 450-ish kilobytes per that error message, we'd have to be using the (1 meg minus 750k for the driver equals) 250k remaining RAM, plus the 450k that we're overflowing. Subtract the 128k of the Apple's actual RAM, and ... 572k seems like a lot of overhead to run this emulator. If that's true maybe there are optimizations that could be made elsewhere. But it turns out that's not actually true, for an architectural reason on the Teensy hardware itself.
The problem is that there are three banks of RAM on the Teensy 4.1. Borrowing some of the documentation from PJRC's Teensy 4.1 page:
Ignoring FLASH for the moment, we've got three banks of RAM - RAM1, RAM2, and PSRAM.
RAM1 is "tightly coupled" to the processor and is the fastest. RAM2 has a 32k cache around it, and PSRAM is the slowest. (FLASH isn't RAM, but it is possible to put static content and code in it.)
The DMAMEM adornment on my code means that the array is in RAM2. Since it's only 512K, there isn't enough space for the 750k array at all.
Both RAM1 and RAM2 are part of the stock Teensy 4.1 board. PSRAM is either one or two add-on 8 megabyte chips; Aiie v9 requires one. I've been adding both just for future flexibility, which means not adding a FLASH chip (you can either have one of each, or two PSRAM chips).We could, therefore, move this display RAM up to PSRAM and it would fit. But there's a reason it's called DMAMEM; it's the preferred place for DMA transfers to happen. They can come out of RAM1, which is faster; but they get no benefit of doing so, which means RAM1 is better suited to local variables that are being used and disposed of all the time. And while it will fit in PSRAM, that memory is much slower than RAM2; enough that it will affect eDMA's ability to drive the display. In practice I find that it's roughly 10% the speed of using RAM2. So RAM2 is highly preferred, but not required. If we could get the display up to, say, 300 frames per second in RAM2 then we might get 30 fps when using PSRAM and that would be sufficient.
But the fastest I've driven the ILI9341 is somewhere around 60fps. And 6fps would just not cut it. So this really needs to fit in RAM1 or RAM2 to be viable.
How do we jam 600 x 480 x 2 bytes -- 750k of data -- in to a single 512k chip?
We don't. That's not possible. But maybe we can alter our expectations of what's being drawn, or how it's being drawn, to keep a smaller buffer footprint.
Generally display driver chips have some idea of windowing. They don't mean having a window you can drag around the screen; they mean that there's an area of the screen to which all drawing is confined. It's a way of restricting what part of the screen is being updated at any given time, which reduces the number of bytes you've got to transfer. What if we restrict it to just the Apple display area of 560x384? Well, 560 x 384 x 2 = 430080, which is 420k. It fits! But to take up that much space for DMAMEM, we'd have to get all of the object constructors and variables out of RAM2 to make room; that's almost the whole chip by itself, most of which is the Apple's 128k of memory (it's about 182k total, so there's about 54k of overhead that might be optimizable). Okay, it's possible. But not elegant.
If we've got a DMA transfer sending just the Apple display window data to the display, it means we can't update anything outside that rectangle without pausing and restarting the Apple's display. So any time we want to update the disk access lights, or the battery indicator, or dump a debugging string to the screen (like that "37 FPS" in the previous log entry's screen shots), we'd have to shut down the Apple DMA display; change the window; redraw the drive indicator or whatever; reset the window; and restart the DMA display. I suspect all of that will badly mess with the CPU timing, since those events would have to happen inline with the CPU process rather than asynchronously from the DMA process, but it's possible to do something clever and complex around caching and multithreading. Doesn't seem like a showstopper, but the work involved in getting all the other stuff out of RAM2 feels like a problem. I have no idea how to get C++ objects out of that space, for example. So I'd prefer another solution if one can be found. But let's keep thinking this through for a minute, and we'll see that there's a confluence coming up.
560 x 384 x 2 bytes of data being transferred via SPI will be 560 x 384 x 2 x 8 bits of SPI data. The SPI clock rate in Aiie v9, with an ILI9341, is 50MHz. A good approximation of our frame rate with this increased resolution would be
50 MHz / (800 pixels wide * 384 pixels tall * 2 bytes per pixel * 8 bits per byte) = 14.5 frames per second
That's not a complete show stopper but it's definitely troubling. But reading Adafruit support's reply in this thread about bitmap drawing speed on the RA8875 board, they say the maximum SPI bus speed is the same as the clock speed on the RA8875, which is 20MHz. Which means
20 MHz / (800 pixels wide * 384 pixels tall * 2 bytes per pixel * 8 bits per byte) = 5.8 frames per second
Uh oh. 5.8 fps is really bad. The only two ways to deal with that are to up the SPI clock - and in practice, I find that it's possible to go past stated maximums to some extent, so maybe there's room here to go faster - or to reduce the amount of data being transferred.
Can we reduce the data we need to send?
One way to do that would be to define dirty areas of the screen that need updating. I'd written code to do this back in the original Aiie because of its terrible display update speed, and it would be possible to get that all working again. But it really complicates the DMA code. Instead of one free-running DMA transfer that's constantly sending the whole screen, giving us a constant update rate, we'd need to trigger DMA transfers of the dirty area from the main code loop. That would typically give us better performance with a fluctuating FPS rate. While there's nothing wrong with that, it doesn't really help us with games that are updating the whole screen; we'll be reduced back to the lowest rate. And potentially worse, because there may be an impact on the real-time CPU scheduling while we set up each DMA transfer and talk to the SPI bus directly. This doesn't actually solve the problem, it just makes it less obvious that it's still broken under some circumstances.
Okay, I've been saying "560 x 384 x 2" and have been deliberately ignoring the "x2" piece of that. Yes, the display is 16 bits per pixel, which is where that comes from. But it doesn't have to be. The RA8875 (and most other 16-bit display driver chips) lets us drop it to 8 bits per pixel. This literally cuts in half the memory requirement, which will double the frame throughput. The tradeoff, of course, is that it's harder to accurately represent any individual color; instead of 5 bits of data to represent the red component, you're dropped to 3 bits. 6 bits of green drop to 3 bits also. And 5 bits of blue drops to only 2 bits. But let's ignore that for the moment, too... we're still experimenting.
If we drop to 8bpp, then the whole thing fits in to RAM2 without hacking any of the C++ constructor allocation stuff. Two problems at least partially addressed with the same solution. Now the theoretical 20MHz SPI clock can transfer 11.6 windowed frames per second, which is still pretty awful but is closer to usable. Personally I think 12 fps is the cutoff for usability here; 30 is ideal, but I haven't seen anything that's a real show stopper at 12 fps other than real-time video. And that's a rarity, at best, on the //e. So I'm not happy with 11.6 at all and want better.
While I'm backing off of the layered hacks to keep it as simple as possible... let's back off of one more. The windowing improvement is an optimization. Optimizations should come after the core code works well. So let's go back to the full 800 x 600 display and see what happens - the theoretical frame rate drops to 5.2 fps, but the buffer does fit in RAM2. If we can get this working well enough then maybe we can apply windowing as a later optimization t make things better in the majority of cases - where there aren't drive indicators flashing or debug messages being displayed.
This is the point where I draw the rest of the fucking owl.
The RA8875 driver doesn't have eDMA support at all, so I had to write it. Which I did, and which was complicated, and which deserves its own log entry. So I'm going to hand-wave past that as "yep, it's solved, nothing to see here" for now. We've got DMA, and it does get about 6 fps with a 20MHz SPI clock.
So now we're back to hacks. What other hacks might get that frame rate up?
Well, let's see how far we can push that SPI clock. Setting the _clock rate in teensy-display.cpp to about 25000000 works, but setting it to 30000000 is completely black. A little playing around and we find that at 29MHz we get this psychedelic show:
The empirical maximum, for me in this configuration under these circumstances, seems to be about 26MHz which yields roughly 7 fps. But LOOK AT HOW NICE THAT 80-COLUMN TEXT LOOKS.
So, first goal accomplished - YES, we can use this display. But can we make it perform well enough? Stay tuned for more...
-
Entry 24: Look at these 53760 pixels that the world doesn't want you to see!
01/22/2022 at 06:50 • 0 commentsIn 2021, a friend of mine gave me an Apple //e that he'd had sitting in a garage for years. He'd rescued it from a place where another one of my friends had been working, probably around 1996. I've spent a few months cleaning and fixing it up; part of that effort lead me to build the Apple ProFile 10MB Hard Drive reader. It's also gotten me playing Nor Archaist - which my wife bought me for Christmas 2020 - on the actual //e.
But that's not what I'm here to write about. I'm here to write about how all of this pushed me back to working on Aiie!
Just as I was thinking about how the //e was going to come back together, someone reached out to me on Twitter with questions about their own Aiie v9 build. Some of the components are no longer avaialble. I never listed why I picked the voltage regulator circuit I'd used (because it's a 1A boost). The PCB pads for the battery aren't labeled (J6 is +B and J5 is GND, but I'd left it flexible for the boost circuit). The version of HDDRVR.BIN (from AppleWin) has changed. The parallel card ROM is one of a pair, so if you have the actual card, it's not clear which one to dump (it's the Apple Parallel ROM, not the Centronix ROM).This all got us talking about what else I'd forgotten to finish (sigh, woz disk support; Mockingboard emulation; WiFi). And then we started talking about what else we could do with a new revision of the hardware. Design a case, maybe. Update the parts list. Integrate a charger.
And update the display.
Now, most of that was on my roadmap... but I'd convinced myself to forget completely about the display. It's a hack that I'd optimized and considered "done."
The display on the Aiie! v9 is a 320x240 SPI display (the ILI9341). It's the second display I've used for this project. The original was a parallel interface that required a lot of CPU time to drive; I think I managed to get it up around 12 frames per second. The SPI interface for the ILI9341 not only uses fewer pins, but it can also be run directly from the Teensy's eDMA (extended? expanded? direct memory access) hardware, directly sending the data out the SPI bus without the program manually doing the work. eDMA does a block transfer; when it ends, it automatically starts another one. The frame rate went through the roof. I think I saw it up around 40fps... where anything over 30 is overkill. (Hmmm... the black magic of the ARM IMXRT 1062 eDMA system would be a good topic for another log entry...)
Fine, that explains why I chose an SPI bus driven display. But why is the display resolution 320x240?
The Apple II video is really low resolution. The plain text screens are 40x24 characters, where each character is 7 pixels wide and 8 pixels tall - resulting in a 280x192 display. Lo-res graphics chop each character vertically in half, giving you 40x48 blobs that use 280x192 pixels. Hi-res graphics are, not surprisingly, 280x192 pixels. Fits fine in a 320x240 display, no problem! They're cheap, one's for sale right at PJRC along side the Teensy, and it's got a well supported driver that's been optimized by the Teensy community. It's a slam dunk
But with the //e's 80-column card, Apple made it weird. (I know, that's not a big stretch. The whole machine is built around engineering miracles, which is part of what I find so endearing.)
In 80-column mode, the horizontal resolution doubles but the vertical does not. Basically data gets shoveled out the NTSC (or PAL) generator twice as fast so it's twice as dense - but the number of scan lines aren't affected. So you wind up with the really awkward 560x192 pixel size for 80-column text, double-low-resolution, and double-high-resolution graphics.
So I wrote the core of Aiie! to support 560x192. There are three builds of the core code - one for SDL (which is what I primarily use for development under MacOS); one for a Linux framebuffer (which I've used in passing on a RaspPi Zero as a toy); and the Teensy build for my custom hardware. Under SDL and the framebuffer, I'm using 800x600 (the same aspect ratio as 320x240) and it can directly draw the 560x192 Apple pixels. But on the Teensy I had to be ... clever.
// This was called with the expectation that it can draw every one of // the 560x192 pixels that could be addressed. If TEENSYDISPLAY_SCALE // is 1, then we have half of that horizontal resolution - so we need // to be creative and blend neighboring pixels together. void TeensyDisplay::cachePixel(uint16_t x, uint16_t y, uint8_t color) { #if TEENSYDISPLAY_SCALE == 1 // This is the case where we need to blend together neighboring // pixels, because we don't have enough physical screen resoultion. if (x&1) { uint16_t origColor = dmaBuffer[y+SCREENINSET_Y][(x>>1)*TEENSYDISPLAY_SCALE+SCREENINSET_X]; uint16_t newColor = (uint16_t) loresPixelColors[color]; if (g_displayType == m_blackAndWhite) { // There are four reasonable decisions here: if either pixel // *was* on, then it's on; if both pixels *were* on, then it's // on; and if the blended value of the two pixels were on, then // it's on; or if the blended value of the two is above some // certain overall brightness, then it's on. This is the last of // those - where the brightness cutoff is defined in the bios as // g_luminanceCutoff. uint16_t blendedColor = blendColors(origColor, newColor); uint16_t luminance = luminanceFromRGB(_565toR(blendedColor), _565toG(blendedColor), _565toB(blendedColor)); cacheDoubleWidePixel(x>>1,y,(uint16_t)((luminance >= g_luminanceCutoff) ? 0xFFFF : 0x0000)); } else { cacheDoubleWidePixel(x>>1,y,color); // Else if it's black, we leave whatever was in the other pixel. } } else { // The even pixels always draw. cacheDoubleWidePixel(x>>1,y,color); } #else // we have enough resolution to show all the pixels, so just do it x = (x * TEENSYDISPLAY_SCALE)/2; for (int yoff=0; yoff<TEENSYDISPLAY_SCALE; yoff++) { for (int xoff=0; xoff<TEENSYDISPLAY_SCALE; xoff++) { dmaBuffer[y*TEENSYDISPLAY_SCALE+yoff+SCREENINSET_Y][x+xoff+SCREENINSET_X] = color; } } #endif }
If you look at the #if/#else/#endif, you'll see the simple case at the bottom (which is basically what the SDL and Framebuffer code do) versus what the Teensy code has to deal with for a smaller display. There are 3 different behaviors (and one that I've since removed). First: every even pixel gets drawn to the display, always. With just this piece, you can tell that there are letters in 80-column mode but it looks pretty awful.
Then the second is the luminance shader, if the user has picked the black-and-white display mode. This inspects both even and odd pixels, and decides if the combination are over or under a threshold to show a single dot. For black text on white, this looks really good; but for white text on black, it's really illegible... look at the quality of the "a" in the screen versus the rest of the text.
There was a third shader I'd built, where if either the even or odd pixels had any non-black value, then it would draw the pixel. In white-on-black text mode that looks substantially better than either of the above, but it does really badly in black-on-white text (look at that inverted 'a', it's just 3 dots now) and DHGR, so I compromised by using the luminance shader (which is equally bad in all cases).
So... yeah, I've built hacks to make it kinda work, but it's ugly and wrong; and while I can build a better antialiaser (to shade the result pixel better based on the left and right partners), I can't really make it right without switching to a higher resolution display.
It turns out that there are very few 800x600 displays available for microcontrollers. There are also a few 800x480 displays, which also works - the vertical axis will need doubled pixels for either 800x600 or 800x480, needing a minimum display resolution of 560x384. But they're all physically larger; and they are generally separate from the driver boards rather than nice integrated packages like the ILI9341. So we'll have hardware and software work to do - starting with a new proof of concept!
-
Entry 23: Here mousie mousie mousie
01/11/2021 at 22:56 • 0 commentsOn my list for a good while has been mouse support - mostly because I'd like to add networking support, but most of the programs that have Uthernet card support (which is what I want to emulate) also require a mouse card. So, quite some time ago I started working on the mouse for Aiie, and was stymied with the lack of CPU power on the Teensy 3 (mostly architectural, based on how I'd designed it).
With the Teensy 4.1 running the show, things are looking pretty good - I'm running the CPU at 396 MHz, downclocked from the stock 600 Mhz and well lower than the 816 MHz it can run at without cooling... so I shouldn't have any problems there. Which means it's just a matter of time and understanding.
The time has presented itself, and here's the understanding. Let's start with how the mouse works on the //e.
The AppleMouse II was the same mouse used on the Mac 128/512/Plus. The card it used on the //e interfaced with the system bus via a 6521 PIA ("Peripheral Interface Adapter") chip. It was glued together with a fairly substantial ROM, which not only used the standard 256 bytes of peripheral card space but also page-swapped in another 2k of extended ROM on demand.
My first thought was to implement a soft 6521 and glue it in to the bus; then use the real ROM images to provide a driver. It seemed a tedious, but likely robust, way to build it out.
The 6521 code wasn't hard to write, but testing it is a different story; the only way I have of testing it is by booting something that has mouse support, and seeing what happens. Which means blindly interfacing the mouse card on top of the untested 6521, and figuring out which problems are bugs in the 6521 vs. which are problems in my interface to the program running on the Aiie.
I figured that GEOS would be a good way to test the mouse itself. I'd used GEOS back in the 80s, so I knew what to expect - particularly that it used quite a lot of what the Apple //e was capable of back in the day. Double-high-res graphics, all 128k of RAM, all on top of PRODOS. But when I booted it, the mouse didn't work, and I couldn't quite figure out how to debug it.
Which is approximately where I put it down in 2019, waiting for some stroke of inspiration. Or fortitude.
So when I picked up the mouse driver again, I wanted to do it another way. I've spent some time bringing the SDL build up to snuff, working from the same code base as the Teensy 4.1 so I can directly debug on my Mac. So to find out how the mouse was supposed to work, I started reading all the AppleMouse documentation I could find.
Which isn't much. There's the AppleMouse II User's Manual; the related Addendum; a smattering of old usenet (as far as I can tell) the exists in various forms around the Internet. A few other dribs and drabs but nothing substantial.
At that point, I figured I'd start disassembling the original ROM and building my own. But I had some substantial questions.
How is the mouse card identified?
I'd already decided I'd put the mouse in slot 4, so booting up the machine it's straightforward to look at the basic 256 bytes of ROM directly in the system monitor as disassembly or raw data. I went for disassembly.
] CALL -151 * C400L
But this was all built in the days before anyone had hardware you could interrogate - there's no handshake to ask the board what it is, so it's not the code that's important right now. The OS detected the hardware by reading bytes out of its ROM and guessing at what made it a mouse. Over years of hardware appearing, patterns emerged and it became standard practice to look for certain fingerprints of data. Eventually Apple released the 1988 specification "Pascal 1.1 Firmware Protocol ID Bytes". It says that the bytes at offsets $05, $07, and $0B must be $38, $18, and $01 respectively. And all of that is true in this firmware. It also says that byte at offset $0C is the hardware identifier - in this case, its value is $20. And the mouse user's guide says that it also should have the value $D6 at offset $FB.
So at application load time, the app scans all 7 slots looking for those bytes. If I wanted to write my own ROM I'd need to start with those specific bytes. Easy enough.
How is the ROM driver structured?
The mouse manual says there are 8 routines in the ROM; and to find each one's entry point, you read a single byte from that routine's dedicated place in ROM to find the offset. So, in slot 4 (which begins at memory $C400), to find the InitMouse function, we look at offset $19 (ergo, memory location $C419) and get back the byte $97, telling us that the function we want is at memory $C497.
This accounts for 8 more bytes of the ROM that are fixed (or, rather, that have fixed meanings) - bytes $12 through $19 are this lookup table for the functions we need. And one more is added in one of the pieces of trivia I found around the Internet, saying that if you look at offset $1C, you can call that function before InitMouse to change its frequency from 60Hz to 50Hz - which, looking at my ROM image, isn't actually doing anything and betrays something else about the structure of the ROM.
Because at $1C, in my ROM, is the byte $AE. And at $C4AE, I have
C4AE- A2 03 LDX #$03 C4B0- 38 SEC C4B1- 60 RTS
That's not doing anything useful. It's setting X to the value 3 (why? I don't know); setting the status register carry, which is how the driver signals an error to the caller; and then returning. I guess my ROM isn't capable of doing that thing Apple was talking about. Maybe it was too early, or a different variant. It's somewhat unimportant to me, but is very interesting - because of the ROM around the lookup table. Which reads like this, with my annotations:
C40D- AE AE AE AE C411- 00 C412- 6D ;(setmouse @ c46d) C413- 75 ;(servemouse @ C475) C414- 8E ;(readmouse @ C48E) C415- 9F ;(clearmouse @ C49F) C416- A4 ;(posmouse @ C4A4) C417- 86 ;(clampmouse @ C486) C418- A9 ;(homemouse @ C4A9) C419- 97 ;(initmouse @ C497) C41A- AE C41B- AE C41C- AE ;(semi-documented: sets mouse frequency handler to 60 or 50 hz) C41D- AE C41E- AE C41F- AE
I think it's safe to say that all of those are lookup table addresses. $11 might be the entry point for PR# ("booting" from a peripheral or setting its output), and it looks like all the others with $AE are probably an "unimplemented function" placeholder. So now we know code shouldn't be placed between $0D and $1F.
So there's some initialization code at $C400; a few identification bytes; then a lookup table from $C40D through $C41F; and the "main" code block begins at $C420.
How does it work?
All of that is pretty straightforward if you're using an assembly or Pascal program to call those entry points. You can call InitMouse, then SetMouse to turn on the mouse peripheral, then ReadMouse in a loop to keep reading position data. The mouse returns X and Y positions in specific memory locations that the caller can retrieve, and sets the Carry flag if any errors occur. But none of that actually interfaces with a mouse for me, at least not yet. I'll need some physical interface that the user interacts with, which these functions can then read from. With that abstraction in mind, I'll need a way to get the position and button state... something like this, which is in physicalmouse.h:
virtual void getPosition(uint16_t *x, uint16_t *y) = 0; virtual bool getButton() = 0;
then in the SDL build, there's an sdl-mouse.cpp; and in the Teensy build, there's a teensy-mouse.cpp. The SDL one reads the Mac's mouse movements and tracks the position within the window, while the Teensy one uses the joystick and uses the left shift key as a mouse button.
That just leaves tying together the two halves - how do we bridge between the mouse interface card ROM and this C++ code?
Enter the soft switch
Part of the allure of early computers - for me, at least - is the way the hardware and software so fluidly cross over each other. In the Apple II series, there are 16 memory addresses for each peripheral that look like RAM to the processor and, in hardware, could do any number of interesting things - because they're just electrical signals. These are the "soft switches" from $C090 through $C0FF. A card in slot 1 would get $C090 through $C09F. Since we're in slot 4, we get $C0C0 through $C0CF. When something writes to those addresses, some of them are tied to signal lines for the mouse card's 6521 PIA chip - and that's how the driver winds up talking to the mouse. Since I've removed the PIA and I'm writing the ROM driver from scratch, we can use any of those 16 to either read or write, giving us 32 ways to transfer data across this border.
For example, take the new InitMouse function in my ROM.
$C494 8D CC C0 STA $C0CC $C497 18 CLC $C498 60 RTS
Dead simple: write whatever's in the accumulator to $C0CC, which activates soft switch $C for the peripheral in Slot 4; then clear the carry (to tell the caller no error occurred) and return.
Then we can write out the interface in C++, in the mouse.cpp object:
void Mouse::writeSwitches(uint8_t s, uint8_t v) { switch (s) { case SW_R_INITMOUSE: // Set clamp to (0,0) - (1023,1023) g_vm->getMMU()->write(0x578, 0); // high of lowclamp g_vm->getMMU()->write(0x478, 0); // low of lowclamp g_vm->getMMU()->write(0x5F8, 0x03); // high of highclamp g_vm->getMMU()->write(0x4F8, 0xFF); // low of highclamp g_mouse->setClamp(XCLAMP, 0, 1023); g_mouse->setClamp(YCLAMP, 0, 1023); break; ...
That is to say: when InitMouse is called, soft switch 0xC is activated for write, which calls writeSwitches(...) in my emulation. That sets the clamping window (the bounds in which the mouse operates) to [0,1023]; and we'll write those values back to the reserved memory locations for the clamping window bounds (per the docs in the user manual) before returning from the "write to memory" operation.
The rest of the code is similarly trivial. The only real complicated piece is ServiceMouse, which is half of an interrupt handler pattern.
Which is something I've not done on Aiie at all before, so this ought to be interesting.
Programmer Interruptus
I let this one convince me it was going to be a problem for waaaaay longer than I should have. Eventually my years of experience told me to *just do something* and figure it out better after I had some context.
The problem is twofold.
First, there are the interrupts themselves. There's an IRQ ("interrupt request") vector on the 65C02 that, when an interrupt is asserted, stops code flow and jumps to the address stored in that memory location. But the Apple //e didn't have any sources of interrupts natively. These interrupts don't generally happen. You had to have hardware that was generating interrupts... and nothing I've written so far has done that. So I figured there was a good chance that I'd wind up debugging basic interrupt functionality, instead of the code I was trying to write. (Spoiler: it did not.)
Second, there's the nature of the interrupts. There are 3 supported interrupts on the AppleMouse card - an interrupt when the button is pressed; an interrupt when the mouse is moved; and an interrupt in the video vertical blanking period.
If you don't know what that is - the "vertical blanking period" is (simplifying a bit) the period of time during which the electron beam from a Cathode Ray Tube display is moving from the bottom-right to the top-left corner. Since it's a single beam being deflected by magnetic fields, it takes some time for the field to shift to get the beam back where it needs to be. And on most //e models, that's at a rate of 60 Hz.
So I need something new that runs at a rate of 60 cycles per second.
Previously, I'd had two timing-sensitive parts of Aiie - the CPU (which runs as 1.023 MHz) and the audio (which runs at 44.1 KHz at the moment). There's a third section that does maintenance - for the physical keyboard as well as the USB keyboard, polling at an arbitrarily-set 10 Hz. So I stepped that up to 60 Hz, and added a poll in to the mouse:
g_mouse->maintainMouse(); g_keyboard->maintainKeyboard(); usb.maintain();
Then in the mouse.cpp object, it was a matter of adding code to trigger a CPU interrupt whenever the time arrived:
if ( (status & ST_MOUSEENABLE) && (status & ST_INTVBL) && (cycleCount >= nextInterruptTime) ) { g_cpu->irq(); interruptsTriggered |= ST_INTVBL; }
That says "If the mouse is enabled; and it was configured to send vertical blanking interrupts; and it's time for an interrupt, then trigger the IRQ in the CPU, and keep track of the reason why an interrupt has gone off".
Finally, in the ServeMouse call, we can tell the caller that we did indeed cause an interrupt, and it was caused because of the vertical blanking interrupt.
And with that, we've got mouse support.
So far I've tested it in GEOS, MultiScribe, Fantavision, Blazing Paddles, and Copy II+ 9.1.
-
Entry 22: Introducing the Mk 2!
08/19/2020 at 12:58 • 2 commentsThe OSH Park boards arrived, and I spent some time Monday assembling! Here's a time lapse of the build, which took me shy of 3 hours (mostly because I hadn't organized any of the parts and had to hunt for several).
The build didn't actually work right away - I'd installed the power boost module upside-down (if you use the same board, don't install it with the components facing up - they should face down). After re-soldering it, everything just booted up fine!
A couple quick thoughts about this new build.
- The speaker/headphone jack works, but it drives the headphones very heavily. Full volume is *really* loud, and since I haven't implemented the volume control yet, it's not very useful.
- The backup battery holder I had isn't exactly the one from Mouser, and I'm not sure the one I listed from Mouser is actually correct. I need to check that out.
- The Mk 1 ran directly from the battery, which seemed fine - but really isn't. As the battery runs down the backlight flickers, and if you're using a lower capacity battery, it doesn't really hold up well. The Mk 2 uses a boost circuit to pop it up to 5v, and then a dedicated 3.3v linear regulator for the display. Much nicer.
- The battery voltage tester for the Mk 1 was just a simple resistive divider, and I never quite got it working the way I wanted. I've embellished a little on the Mk 2; it's using a MOSFET and 2n3904 transistor to switch on and off the check, and I'm hoping I'll get better results (when that piece of code is eventually written).
- I decided to stick with the 18650 removable battery concept, rather than embedding a charger.
- The ESP-01 still doesn't do anything, so there's no reason for anyone to add one of those immediately. With the Mk 1 I spent a little time building an ethernet driver, but the CPU bottlenecks I ran in to caused me to abandon that path. I'm hopeful that there's enough CPU overhead in the Mk 2 - between moving to the Teensy 4.1 and moving the display to DMA - that I'll be able to pull it off with the Mk 2.
- The Teensy 4.1 is supposed to draw about 100mA at 600MHz, and the display is also rated at 100mA. The ESP-01 needs 400mA in bursts when using Wi-Fi, and probably more like 80mA when quiescent. So I've added a power switch specifically to turn off the ESP-01 -- it's not likely something anyone will use constantly, so there's no reason for it to take up a third of the power of the device!
- Speaking of power draw: I'm explicitly using the Teensy 4.1 at 528 MHz, rather than the default 600MHz. By stepping down one notch, it drops the core voltage which should significantly reduce its power draw. And in the testing I've done so far, I could probably drop it down to 396MHz (down two more steps) without affecting performance at all.
- The USB port should give me a way to add an external keyboard, which makes this more usable for non-games. I was dithering about how to deal with possible video out solutions (building a FlexIO VGA or composite output driver) but I decided I'd rather have a working handheld unit sooner, rather than waiting for a potential future that might not come to fruition. My ARM-specific FlexIO knowledge is nascent, so I'll have to spend a lot of time on this, and I don't have a lot of time to spend. And then there's the pin requirement - I'm bumping up against the number of I/O pins available! Maybe a future Mk 3 will have a video output option as a replacement for the display (like the Teensy64 does - and if the underlying uVGA project builds in Teensy 4 support, then all the more likely I'll be able to hack this together).
- The ESP-01 doesn't have a way to be reprogrammed once installed, so I installed mine on a header for easy removal. There's just enough space to do so, but I'm concerned about the WiFi antenna being right up against the back of the LCD panel. I bet this will become a problem later - but time will tell!
- I still don't have an enclosure, and if someone's interested in designing one, I'd be game for a redesign of the board to accommodate. Right now I'm using one of the three PCBs from OSH Park as a backer, with standoffs holding it together. I like it but it could be better :)
So there you have it! There are a lot of software details that need attending to now, but the basic hardware is functional. If you want to build one, OSH Park has the project shared (for a hefty $126 for 3 boards) or you can use the schematic files in the project with another manufacturer (just be sure to use the aiie_r9_gerber.zip files, which are this Mk 2 board dated 20200731).
-
Entry 21: Here We Go Round Again!
07/31/2020 at 04:01 • 0 commentsWell, it's official - the r7 (or "v7" as I apparently named it, since I'm mostly doing software stuff these days) Aiie board is off to OSH Park for prototype manufacturing.
There are some substantial changes in there. Amongst them:
- Serial display, of course, is a huge change. Performance continues to look great during initial development (huzzah for DMA transfers).
- I got rid of the Tall Dog adapter. As I mentioned before, I'd been using it because of the pins I needed from the Teensy 3.6. Now that it's a serial display, there's less need for the pins (which is good, seeing how the Teensy 4.1 has fewer pins than the 3.6).
- I added an audio jack that cuts off the speaker (mechanically).
- The battery voltage monitor is redesigned. The old version was just a voltage divider going to an analog pin; it's now fed by a P-channel MOSFET so I'm not draining the battery constantly while monitoring its voltage. Maybe I'll actually get it coded up and working the way I want this time.
- A USB jack for plugging in a keyboard. I've done basic testing and it works, although modifier keys behave a little unexpectedly. I opted to not add a lot of protection circuitry here, not sure if that's going to come back to bite me later...
- The ESP-01 now has its own power switch. And the correct pull-ups to be able to boot. I never got to the ESP-01 development I wanted on the old board because of CPU issues; I'm hoping that won't be a problem here. Regardless I figured I'd rather not have it draining the battery when I'm not using it - especially since the ESP-01 has a higher current draw than the Teensy 4.1 and display put together.
- I'm using a boost power module to run off of +5v, and regulate it back down to 3.3v where necessary. The old board ran straight off of the battery which lead to some issues with fluttering display brightness as the battery ran down.
I've also been playing with VGA output on the Teensy 4.1, trying to build a FlexIO output with the right timing. However, there aren't enough free pins for me to do that with this layout, and I've been delaying having prototype boards made while I'm fiddling with the VGA stuff - so I've put that on the back burner for now. Maybe v8 will have a serial display-or-VGA hardware option, or maybe the VGA version will wind up being something completely different. Or maybe I'll never get back to it! Who knows. :)
By the end of August, I expect I'll have three prototype boards in my hands with a pile of new components to populate. Then it's back to the software!
-
Entry 20: Redesigns 'R Us
07/09/2020 at 14:27 • 0 commentsWelcome to Redesigns 'R Us, where we come up with new ways to do old things!
Over the last few years, my Aiie has mostly been sitting collecting dust. Sure, I spent some time working on WOZ disk format (which I love), and I've got a half dozen private branches of the code repo where I've been working on various features - but there are two major obstacles that I've talked about before that have kept me from really pursuing any of them:
- There's only so much RAM. When I'm working on code that's timing-critical, and then have to read a new track from the virtual floppy, the interaction with the MicroSD card is really a killer. I'd rather cache floppy images in RAM and forget about it - but there just isn't enough RAM to do that.
- There's only so much CPU. In some of my branches, I've been working on peripherals that have complicated timing interactions with the 65C02 - which lead to increasingly complex hoops I've been jumping through to try to keep them working. In the end this became too complex and I had to put it down for a while.
But now there's the Teensy 4.1! It's a nice bump, from 180 MHz to 600 MHz; and it has pads for an additional 16MB of PSRAM. Those sound enticing - but come at a cost; there are fewer pins available.
The Teensy 3.6 had a boatload of pins available via pads on the underside of the board. Aiie v1 used many of those (lazily, via the Tall Dog breakout board). But with something like 17 fewer pins (if I've counted rightly) on the Teensy 4.1, I've got a problem.
So, redesign decision 1: how do I squeeze the same hardware in to a smaller footprint?
Well, back in Entry 17, I faced the same general question when I experimented with adding external SRAM. My choice then is the same as it is now - swap out the display. I originally picked a 16-bit parallel display because I wanted to throw data at it very quickly. And it did, at first, until I wound up complicating the codebase which dropped the framerate to a sad 6FPS. (This is on my list of "things that bug me about Aiie v1" - as it became more Apple //e-correct, it became much less responsive.)
So the 16-bit nature of the display isn't the problem. It's the code (primarily) and the available CPU (to a lesser extent). The display I picked then is the same one I'm picking now - the ILI9341, an SPI-driven display. In theory I can run it via DMA which will reduce the CPU overhead too. My only beef here is that the version of the ILI9341 that's in the Teensy store is a 2.8" display, where I picked a 3.2" display for Aiie v1 - but there are 3.2" versions of the ILI9341 available, and I have one of them, so I'm pretty satisfied on that front.
And with that much information, it's time to try it out! I've still got the original Aiie prototype board sitting around, and it doesn't seem too daunting to rewire it for this. First step, remove all the stuff I don't need, like that nRF24L01 serial interface that I wound up replacing with an ESP-01 in the final v1 circuit and all of those extra pins broken out from the bottom of the Teensy 3.6...
... not too hard.
There's also this rat's nest on the backside that has to go.
And then I need to figure out how I'm powering it. Lately I've been liking these MakerFocus battery charger / boost modules - they're obviously intended as the core of a battery booster pack, and fairly elegantly handle the charging of the battery, boost to 5v, and display of the battery's state. Single presses of the button turn it on, and a double-press turns it off. So adding one of those and a 3.3v linear regulator to safely drive the display...
Re-add the rat's nest of wiring underneath...
use some velcro to tape the battery in place...
and what do you know, if we hand-wave through the little bit of code that needed adjusting, we wind up with
36 frames per second on mostly-unoptimized code. Oh yeah, I like this.
The new code is in my Github repo, in the 'teensy41' branch. If you look at the timestamps you'll find it's got me very excited - I've been working on this as much as possible between other tasks, and it has me distracted enough that I took a day off of work to try to get some of this out of my head today.
Post on Hackaday: check. Next up? Board layout, so I can get a fabrication house started on the board manufacturing for Aiie v2...