There were a very few 'high-resolution' modifications available for the TRS-80 Model I. Some were downright weird, such as the 80-GRAPHIX board, which worked though the character generator. One of the more sane ones was the HRG1B, which was produced in The Netherlands . It worked by or'ing in video dots from a parallel memory buffer, and this buffer was sized at the TRS-80's native resolution of 384x192.
This was on my 'to-do- list of things to implement, but [harald-w-fischer], actually had one of these on-hand from time long past. He also had manuals, some of the stock software, and some extensions he his self wrote! This was needed to be able to faithfully reproduce the behaviour, and some of my initial assumptions were proven false by it.
Since my emulator already has framebuffer(s) at the native resolution, my thinking was that this would be easy to implement. However, I was mistaken. It was mostly easy to implement, but there were several challenges.
At first I planned to directly set bits into the 'master' frame buffer(s) when writes to the HRG1B happened, but after disassembling the 'driver', it was clear this wouldn't work. For one, there needs to be supported reads from the HRG buffer that don't reflect merging, so a separate backing buffer would be needed to do this. Second, the HRG supports a distinct on/off capability, so again the separate buffer would be needed, but also the case of when to merge the buffers arises. Lastly, the computational overhead of doing the merging on-demand caused overload and required some legerdemain.
Whereas the Olimex Duinomite-Mini board (which Harald is using) can easily accommodate the 9K frame buffer required, my color UBW32-based board is just too memory constrained. The color boards already have 3 frame buffers, and this makes a fourth. Initial experiments were done on the Olimex, where the memory was not a problem, but I really wanted it to work on color, too.
The first experiment I tried was to just see if there would be enough time to merge the image during the vertical blanking interval. Since I am upscaling the TRS-80 native 384x192 to 768x576, and then padding with blanking around to get to 800x600, there is a bunch of vertical retrace time where I could do the buffer merging. Maybe the processor is fast enough to do it on-the-fly?
Well, no, it's not. The buffers are small (9K), and the operation is simple (or'ing on the hi-res buffer onto the main buffer), and the operation is done on a word basis on word-aligned quantities, but it's not enough. The major problem is that the vertical blanking interval is driven in software by a state machine that is in turn driven by an Interrupt Service Routine (ISR) at the horizontal line frequency. The vertical state machine is mostly line counters for the vertical 'front porch', sync period, and 'back porch'. So, although the vertical blanking period is a long time in toto, it is comprised of a bunch of short times. The naive implementation requires all the work to be done actually during a horizontal line period. In my first iteration, I wound up dividing the work into chunklettes, spreading out between several horizontal line intervals. I generated a test pattern, and this scheme appeared to work.
Having received the HRG1B driver and sample apps from Harald, I naturally had to immediately disassemble them. I have to say, the driver is a little piece of art. The goal is to extend the existing BASIC with some new keywords, and what they chose to do was hook a routine that is sometimes used during the basic execution process (it is rst10h, used to skip whitespace to the next token). The hook observes if it is coming from the interesting places, and either leaves if not, or carries on with it's shenanigans. The shenanigans consist of looking for a '#', which is the prefix for the HRG extensions. As implemented, the HRG repurposes the existing BASIC keywords of: SET, RESET, POINT, OPEN, CLOSE, LINE, CLS, CLEAR. If those are prefixed with '#', then the HRG driver performs it's alternative implementation, parsing a parameter list, and executing functionality. When it has finished consuming as much as it can, it then returns to the regular routine. So, in effect, the HRG commands act like whitespace to the command interpreter, and the driver consumes and executes those commands. By reusing the existing BASIC tokens, the parsing is simplified, since those have been converted to a one-byte code, but this is not required. In fact, Harald had made his own extension to the driver which introduced a new command 'CIRCLE' which would necessitate a full string compare of 'circle' since that is not a standard token. After my fun with disassembly, I returned to the task at-hand.
The video code was quite a mess because it is a legacy of the Maximite code that Geoff Graham had given me permission to use, and upon which had I mercilessly hacked. In that project, there were a bunch of different video modes supported, and also composite (CVBS) video -- none of which were relevant here, and had been cut out and replaced with the single 800x600 mode with funky pixel doubling and line tripling needed for the TRS80. Since I was about to add more complexity, I decided to take a couple days and clean that code up for intelligibility. Mostly this came in the form of lucid comments, and explicit computation of timing parameters from the specs. Some came from discarding vestiges of features no longer realized (headless, TFT LCD, etc). There's more to do, but that module is now a little bit tidier.
I still felt bad about merging the entire frame buffer each vertical retrace time, and thought about optimising this to only be done when output (either to the main screen buffer, or the HRG buffer), though this would be major surgery. However, my desire for color support (or at least thoroughly understanding the limitations) drew me in a different direction.
So, a component of the system, the 'hypervisor' (borrowing jargon from the virtual machine industry) is realized with a 'dialog manager' facility that I created. This dialog manager is similar to the way things are done in Windows, and you create a screen with a 'template' that specifies the location of 'controls' that are things like static text, buttons, check boxes, list boxes, edit fields. You provide a 'dialog procedure' that handles 'messages' sent from the controls when things happen. I implemented a text version of all those controls that works on the TRS80 screen. It's a development boon to have this facility, but I know that it's heavy on the use of malloc(). But how heavy is heavy? The malloc() implementation in the standard library does not expose methods to let you know how much heap you've used, or do a 'heapwalk' to inspect all the allocated objects. This is something I've long wanted, and now push came to shove.
So, I implemented my own heap manager. It implements malloc(), free(), realloc(), and has some diagnostic features where you can inquire how much free space there is, and also do a heapwalk though all blocks (allocated and unallocated) to scrutinize heap use closely. There's some magic in the gcc linker that can be used to 'wrap' a function by giving it an alternative name. This is particularly useful, because other things that have already been compiled use malloc(), and I need them to be redirected to my malloc() -- not the one in the standard library. This is a handy feature, but gcc specific. Anyway, I discovered that the dialog manager uses a lot of memory. Something like 12 KiB for the first hepervisor dialog. A full video screen is 1 KiB, so it's certainly not just the content. What could be taking up all that space?
Looking at the code, I found one culprit: v-tables. The dialog system is designed in a object-oriented manner, as might be expected, but I had put the vtable of the controls as members of the objects, rather than a single pointer to a const table (that is in ROM). That was foolish of me to start with -- what was I thinking -- and a bit painful to fix, but that gave me about 5K back right away. Half a frame screen buffer! There's more improvements I think I can make there (e.g. copy-on-write for text values -- why copy it to RAM if it never changes), but I'm past my RAM crisis for color support for the moment.
Returning to the HRG implementation, I have a new new idea: merge on a raster-by-raster basis. I.e., there is no master buffer, but rather there is the TRS-80 buffer, the HRG buffer, and just a single raster line buffer for the current raster line. Only a single line needs to be merged (prior to display), and no sophisticated (and bug-prone) code for 'optimized' merging needs to be implemented, and the implementation of HRG on/off becomes trivial (you either merge TRS80 and HRG, or just copy TRS80). But... it all needs to be completed during a few clock cycles at the beginning of the horizontal interrupt period.
At first, this seems plausible, because a horizontal line is just 12 words long. But there's still not a lot of time to dawdle, because there are just 256 CPU clocks from the moment that the interrupt is stimulated (by timer 3 maximum count being reached) before raster data is needed to be available for shifting out via SPI and DMA. Some of those cycles have to go into state machine logic and also to setting up the peripherals before the trigger comes (via output comparator that ends the horizontal sync pulse). 'Time' will tell.
As before, I have enough time to keep up on the monochrome Olimex board, but it's not quite enough for color. Looking at the screen in the failing color mode, it looks like it is almost there, but just shy of having enough time. sigh. Maybe some assembler will help. At the least, I unrolled the loops, since it's just 12 sequential 'x++ = y++ | z++' operations.
As a side benefit of this approach, a thing that had vexed me for some time is effectively solved: the hardware is such that the rising edge of the horizontal sync pulse is what triggers the SPI/DMA to start shifting out dots. The problem is that this is not per-spec. The spec requires some 'horizontal sync back porch', which is some quite time before the pixel data comes. In CVBS signals, this is where the color burst goes, but in VGA it's just a quiet time. This quiet time is simulated by left-padding the horizontal lines in the screen buffers with dummy bytes. This has always irritated me, because that space adds up: 192*4 = 768 bytes per color plane, == 3072 bytes for color with HRGB. I want that 3K back! But by doing this raster line merging, only the raster buffer needs to have the padding, so that's a 3K savings in RAM. It's the little things in life....
Anyway, one morning I awoke with another distributed work solution to my color problem: only merge half the raster line, then setup SPI/DMA, then merge the other half. By only merging half, that gives enough time to setup SPI/DMA before the trigger comes in from the output comparator that starts shifting out dots, while at the same time providing enough work for the SPI/DMA to keep them busy while we finish up the second half of the raster line. This wound up to be not too tricky to implement, and worked fine for both monochrome and color.
Now, I had to finish the hardware emulation of the API used to interface with the HRG board. As mentioned Harald Fischer had a physical unit, and also he had the driver software. So I disassembled it. He also had some scans of the manual, which he sent as well. I'm so glad he did, because the API is a bit byzantine. First, the addressing scheme is a 16-bit quantity, but it's not linear. Rather, the 16-bit 'address' is actually a bitfield of character row, column, and raster line within the character cell. Second, each memory location is 6-bits, not 8. In retrospect, I'm not so surprised - the board is implemented with SSI TTL and the 6 bits reflects the 6 pixel wide characters on the TRS-80, and I'm sure that it was convenient from the standpoint of tapping signal lines to keep that structure (yes; many hardware upgrades in those days required cutting traces and soldering. Plug-in cards are for babies!). But, I now have to emulate that addressing scenario and do a bunch of masking and bit-shifting. My favorite.
Knowing that this would be a brain-melter, I put it off until I had a morning to come into it fresh, ready to formulate test cases for boundary conditions, and code up a bunch of bit masking and shifting, dealing with 6-bit-bytes that may or may not straddle a 32-bit word boundary. About 4 hours later, I had gotten it wired in, and ran a test program that came with the board (again, thanks to Harald's provision). This is a program called 'elips', which draws a series of concentric ellipses with ever widening minor axes.
Hmm. Somethings not quite right. Oh! The SPI on the PIC32 is MSB first, but the HRG1B is LSB first. I forgot this. When I was doing the character generator ROM, I preconditioned that data to be pre-reversed to accommodate. However, in this case I've got to do the bit reversal in software and then also map in reverse into the words of the raster line. sigh; more brain-melting.
After about three more hours I had managed to get the bit reversals, shifts, and masks right.
9 minutes, 42 seconds. 36 ellipses.
And for fun, once I figured out meaningful parameter parameters:
Hope you've got a little time for that one; something like 15 minutes. (-1, 45, 30 turns off hidden line removal, and takes about a third the time)