Shortly after my last log, I went ahead and wired up the 6502's NMI pin to a vsync-style pin in my video output circuit (actually marking the end of the visible reason rather than the actual sync pulse). I added some basic NMI handler code - the same as the IRQ/BRK handler, but simpler because there's no need to detect BRK vs IRQ, and no need to read the BRK instruction's argument byte - and tried it out.
It seemed to work first time, which I was satisfied by, but also not surprised because I kind of anticipated it - because the basic task switching was all done through software interrupts, it was pretty natural that it would also work from hardware interrupts.
Here's a video of this in action, and a screenshot for quick reference:
What it's doing
The 32K of RAM is divided into eight pages. Four of the pages correspond to the area of RAM that the video circuit displays on the monitor. The RAM is not cleared on startup, hence the weird random background pattern you see.
The supervisor takes page 7 (the bottom quarter of the screen) for its own video output, and displays a counter showing the number of context switches that have occurred. It also takes page 1 to use for its zero page, stack, and general memory.
Then it spawns two instances of a test program. Each gets a page of non-video memory (page 2 and page 3) for stack and general use, plus a page of video memory (pages 4 and 5). These processes just show their process ID, then sit in a tight loop printing their register contents onto the screen. The NMIs interrupt them and the supervisor schedules a new process to run, using a least-recently-used queue.
So what's the problem?
This works fine in most cases, but due to the fact that NMIs are non-maskable, I really need to ensure that all my code - including all the supervisor code - is OK to be interrupted. Easier said than done! Let's go through some of the issues and resolutions:
1. My NMI and IRQ handlers were not reentrant
As I was initially just writing a BRK handler, I didn't make it support being re-entered - the supervisor will never call BRK itself anyway. It's much simpler and more efficient to save the registers to fixed memory locations than to push them onto the stack. Still, with hardware interrupts a possibility, this code needed to use the stack instead.
As a technical detail, it's still the case that neither the NMI handler nor the BRK handler will actually get re-entered while they're already running. The supervisor indeed never issues a BRK, and the NMI is triggered at a low enough frequency by hardware that it will never trigger quickly enough to catch the supervisor before it's handed control back to a user process. However, the critical case that can occur is when a BRK is executing, and the NMI fires. That's the case that needs to be guarded against. So it would be sufficient to just make one or the other be reentrant.
2. NMI can occur while the supervisor itself is active
As it's possible for an NMI to occur while a BRK is being handled, the NMI handler needs to cope with interrupting the supervisor. The general behaviour needs to be different, because the supervisor is not currently activated like a normal user process. In fact, given a way to detect that the supervisor is running, the easiest thing to do is just exit the NMI as quickly as possible without doing any damage. The only purpose of the NMI is to forcefully interrupt user processes.
The difficulty is reliably detecting that it was the supervisor rather than a user process that was interrupted. At the hardware level, I didn't build in a way to read back from the PID register. I tried shadowing it in software, but this was futile because it's not possible to update the shadow and the actual PID simultaneously. There will always be a gap where they don't match, and an NMI in that gap is impossible to handle cleanly.
3. NMI pushes data to the stack
The first thing NMI does is push the flags to the stack, followed by a return address. This means that, at all times, it's important that the stack pointer is valid and nothing important is stored below it. Sure, that seems a reasonable assumption... but when switching tasks, I need to adjust the stack pointer to match wherever the new process's (or supervisor's) stack pointer is meant to be, and I can't do that atomically in the same instruction that changes the PID register - it has to come either before it, or after it.
If I adjust the stack pointer before setting the PID register to a user process, then an NMI in the meantime would corrupt part of the supervisor's stack, as for one instruction the stack pointer would have the user process's SP value while the PID register is still set for the supervisor. On the other hand, if I set the PID first, then there will be a moment when an NMI would corrupt the user process's stack, as the supervisor's stack pointer would still be set in SP and the NMI would push flags etc to the wrong location in the user's stack.
Similar concerns apply when setting the PID back to the supervisor PID - the stack pointer can't be changed simultaneously with the PID register, and this can lead to corruption.
It's possible to work around some specific instances of this issue, e.g. when the supervisor's stack doesn't actually contain any precious data, but in general I think this issue can't be solved in software.
So overall there are some pretty big flaws there. It's only in rare cases that they'd actually happen, but they can't just be ignored.
What can we do about them? Here are the main mitigations I've considered:
1. Mask the NMIs somehow
Maybe I can OR the NMI signal with a masking signal that I can control, allowing the supervisor to block NMIs at critical times. This would operate much like the SEI instruction for regular IRQs, but would be done in a way that user processes won't (in the long term) be able to access.
One interesting option is to mask them whenever the PID is set to zero. This would ensure that the supervisor is not interrupted except possibly at times when it has temporarily set a user PID (e.g. to write to the page table), and in those cases I could take necessary precautions.
2. Drive the NMIs from a controllable timer
This is kind of similar in effect - rather than interrupting at 50Hz, I could wire the NMI up to a resettable timer, e.g. using a 6522. Then I could make it so that the timer gets reset on BRK, before fully entering the supervisor. The only purpose of the NMI is to ensure that user processes yield from time to time, and so long as they're calling BRK occasionally, there's no need to interrupt them further. If they're not, then the NMI will do its job.
3. Use IRQ instead of NMI
This would ensure that the supervisor never gets interrupted, or at least allow me to control at what points it can get interrupted. The reason I didn't do it this way in the first place is that user processes can very well set the interrupt-disable flag themselves and never have to yield, and there's no good way to prevent that - it's much harder to guard against than the other protection measures I'm planning to take in later prototypes.
That said, though it's lower on the priority list, I do have a requirement listed to be able to forcibly terminate processes that are not yielding, and this would be necessary even with NMIs for cases where an illegal instruction has been executed. So I might be willing to overlook this aspect of "bad behaviour" for now, and deal with it later on.
4. Change the PID/pagetable architecture
Part of the problem here is the way the PID needs to be set in order to configure the pagetable. It already makes some aspects of memory protection harder, so might be something to change.
If the supervisor didn't need to temporarily set the PID to write to the page table, then the only cases where it would need to write to the PID would be on resuming a user process, and in the interrupt handler when one is interrupted. In these cases the supervisor's stack is not precious, so the SP atomicity issue goes away.
Combined with an interface to read the PID, this could also allow the NMI handler to reliably detect when it's interrupting the supervisor, and stop. So it would seem possible that it could solve all the issues above.
The whole point of the prototyping process is to uncover these kinds of issues, so it's doing its job. The issues can all be fixed, but I haven't yet thought through what the right fixes are going to be. I might fix them in a revised Prototype 0b (maybe 0c?), or I might just learn from the problems and factor solutions into Prototype 1. It's possible that some of the techniques to implement memory protection in Prototype 1 will help with these issues, or force certain solutions. We'll have to wait and see!