More thoughts on the project

I still haven't started my Verilog CPU project. I've been wanting to build upon the Gigatron using a Digilent CMOD A7-35t FPGA board. I've gone through various ideas in my mind and have discarded many. I'd like to make something that can run vCPU code, but with a different architecture. I'm looking for pointers/comments/thoughts on various aspects or any considerations. I ask that I not be condescended to or told the obvious.

Separating I/O from the CPU using DMA -- One of the first changes I'd like to make would be to use some variety of DMA to read the frame buffer directly and honor the indirection table. The indirection table system is nice since it eliminates the need for blitting when you want special effects (virtualization is faster than copying entire lines). Since the video is 160 bytes wide and the X register addresses 256 bytes, you can change the indirection table to side-scroll one or more lines. Racer uses that for side-scrolling. If you don't want much of a background on the sides, you can take advantage of the register wrap-around and use the same background for both sides. So I'd want my controller to be table-aware. The sound, keyboard, and likely other I/O will also be done in hardware so everything can have direct access to the syncs and thus avoid hardware race conditions. There could be ways to mitigate software race conditions such as creating a CPU halt line. Line quadding will be done in hardware, possibly using a BRAM buffer, so the SRAM is only read 1:4 native scan lines.

Then, of course, moving only the video to hardware introduces a new problem. Everything else needs to be synced with the video controller. In the Gigatron (proper), the ROM bit-bangs the video and syncs, and the user code is aware of the syncs. Two different ideas come to mind. One is to add other controllers to the video controller. Since it will be the DMA "bus master," it could take over the software-created sound and do it the same way, and it would know when it is safe to poll the keyboard. Concurrent DMA would be used for not only the video controller, but also sound, keyboard, and possibly other I/O.

One issue I see with moving everything to DMA (and using the video controller as the bus master and arbiter) is software races. Doing all the I/O in hardware like that would remove any chances of hardware race conditions, but it could cause software races. So I'd likely need to find a way to add halting or spinlocks. For instance, to run legacy Gigatron vCPU code, I'd likely want to add a mode that pauses the CPU while the lines are drawn. That will make the hardware similar enough to probably prevent races and allow original vCPU code to work as expected.

Of course, using interrupts could be another way to route the signals at the correct time and provide the needed code to service the ports when they are active. That would allow for more complex I/O than outlined above and make it easier to add other things. For instance, I could have a couple of interrupt-driven SPI ports. Interrupts are not a part of the SPI standard, nor are they forbidden. Discretion is left to the designer.

6-bit sound -- Since Blinkenlights are not planned, I might as well implement 6-bit sound. The custom hardware would work during the DMA time and would use the same memory locations as now. I'd like to try to mux this on the color lines. Demuxing might be done using 2 multiplexer sets to decide what gets selected and what gets blanked or muted. Hopefully, interference won't be a problem. I'm unsure about the best way to do this. I'd like to try this to save GPIO pins.

Advanced memory unit/arbiter -- This unit should be able to use 10ns SRAM, make it synchronous, give it 2-3 "ports," allow 16-bit SRAM transfers using the best available method (do during the next cycle, do during unused video cycle or sync time, or halt the CPU). Its tasks would include providing abstraction between RAM and every device that uses RAM, and providing multiple "ports."

I'm a bit unsure how to do it in FPGA. I'm unsure how fast I want to do the CPU. The Gigatron does everything at 6.25 Mhz. Decoupling the video from the CPU could allow it to go faster. The memory on the FPGA board is 10 ns, so it could theoretically do up to 100 Mhz, while the onboard crystal is 12 Mhz. So the arbiter could be clocked at 100 Mhz while the video could be done at 6.25 (or 12.5, 25, or even 25.1xx), and the CPU could be done at 6.25 (up to 25 or so). So it seems like the memory arbiter could have maybe a 3-channel, round-robin loop. Writes seem to be the easiest since due to crossing clock domains, they could be registered. And maybe there could be a semaphore/flag to let the memory unit know when a request is made, though the /WE line should be able to do that. But to me, reads seem to be the hardest part. I mean, if you register the request and set the /OE line, you'd have the turnaround period before the result goes in its respective register. I guess the output from RAM would not need to be registered, or, if it is registered, maybe I could sample from both ends of the register or stretch the slower clock until it is read? A combinational/async read would be faster, but I'm not sure how to guarantee it is sampled before the arbiter goes to something else. Now, if clock-stretching is needed, perhaps it would be better to manually derive everything from counters wired from off of the 100 Mhz clock without messing with BufG, etc. Thus one of the counters could be decremented to add 10ns at a time to the slower CPU clock until the result arrives at the register clocked at the faster speed.

Watchdog/snooper unit and halt line -- To avoid software races caused by having all the I/O done in hardware, there likely needs to be a halt line to pause the CPU. In a naive compatibility mode, one could activate the halt line during active display time. That way, things will be slightly faster than a regular Gigatron in Mode 4 due to dedicated hardware doing things that currently use active processing time. So if there is a hardware RNG, cycles are not used to create random numbers. Not having Blinkenlights saves cycles. Hardware sound saves cycles.

However, being less naive, it could snoop the address lines to know when I/O is being updated and selectively halt for so many active scanlines. There is precedence for this in things such as Apple accelerator cards or the 100 Mhz FPGA 6502 board that does everything internally at 100 Mhz, shadows the entire ROM and RAM into BRAM, and does board traffic at bus speed. For I/O region writes, it writes to both BRAM and DRAM. I'm not sure about I/O region reads, but I guess it would write to the BRAM as it uses the data. So if sound registers, frame buffer, indirection table, or other I/O areas on the Giga-similar machine are written to, the watchdog/snooper can selectively halt the CPU.

"Microcode" store(s) -- This is to make it easier to do vCPU operations and allow the designer to arbitrarily assign them. It could have its own PC to not disturb the main native one, and jumps could be in relation to this PC. There could be an opcode to execute the vCPU opcode located at vPC with the microcode loading any operands.

In a way, this could count as microcode, since the control unit LUT I'd likely use in place of the CU would contain picocode.In technical parlance, microcode runs instructions whereas picocode deals with the lowest level of controlling I/O and ALU functions to make instructions. I would use LUTs for both.

I don't know how to do this, but it would be nice to have an immediate version that uses the registers so native code can execute these. Maybe the CU can alias the registers as vPC[?] to be able to use a single microcode store for that and the main vCPU execution instruction.

One beauty of using a microcode store this way is that if I wanted to, I could access extra native instructions since BRAM is 9-bits wide. Things like a Return to main PC instruction or a Jump to different vCPU opcode instruction might help. Or, if there is a secondary unit, the extra bit could make things target the secondary instructions if a Secondary Execution Idea is used.

But I don't know how to handle syscalls. It makes little sense to jump from one microcode store to another when it can just go there, and that would take an extra cycle. Then again, I could make that a part of the instruction so that calling the syscall instruction calls that code store instead.

A question, how many native instructions would be good per vCPU/v6502 instruction? I mean, how much space should be reserved per vCPU/v6502 instruction (in even powers of 2)? I ask since the best way to address it would be to use the low bits for the local address, higher bits for the instruction number, and any higher bits than that to determine which microcode store (such as 2 to have 2 CPUs and a system call store, with address space for one more).

New native instructions -- These could include 16-bit memory instructions, additional registers, additional memory modes, hardware shifts, carry flag and instructions, and a single-cycle RNG instruction. It would be nice if the vCPU had actual registers. On the RNG, maybe it could work via register (AC?) and via RAM for compatibility. Having a dedicated hardware RNG would make up for having more time to run user code than during the porches.

I'd want to add more usable opcodes to make the emulator more efficient. For instance, things like a right-shift, a true left-shift, and maybe flags and related conditionals. Another register or 2 would be helpful, as well as additional access modes. Some 16-bit ops would be nice. How to find time for a 16-bit memory op would be a challenge. Since this would still be an 8-16 design, 16-bit transfers would take 2 cycles. If the following instruction doesn't use RAM, then that slot could be used. With a memory arbiter approach, three slots could be designed so that there is 1 slot for the CPU, 1 for I/O, and a spare. And if worse comes to worst, clock stretching or halting could be employed.

What native instructions would you find more helpful in improving code density?

Secondary Execution Unit -- It would be nice to use the operand space for additional instructions when encountering instructions that use no operands. But what should those instructions be? What instruction pairs could be done at the same time that would speed up the vCPU? The secondary EU would need its own instruction set. I would use 0 for its NOP and it likely should not have Jump instructions. Instead of Jumps, it should have predicated instructions. It probably should have no port instructions except maybe for an optional separate port.

In designing a secondary unit, I'd probably want to triple-port the registers to make it easier to allow both units to use the same registers. The Data registers likely should be triple-ported to help reduce the critical path. That way it can be available as an operand and an instruction at the same time and be decoded as an instruction regardless, with the ALU of the secondary unit being gated. Since I'd probably want to have a BRAM "ROM"-based decoder, it would likely save time to decode for it and have a line coming from the decoder to determine if the secondary ALU uses what is decoded or not.

A consideration here is to not remove all the unused constant functions in the main core. Using those instead of immediate constants will give more opportunities to use the slave core.

Possible ROM block copy opcode -- That could be used to increase data storage density. It could use Y:X as the starting destination address and the operand could be the number of words to copy. I am not sure if I should have a compressed version or not, since that would be good for storing bitmaps in ROM. Then 680 bytes could be stored in 510 bytes in 255 addresses.

Unlikely But Possible

Interrupts -- If I decide on hardware interrupts, I'd have to find a way to modify the CPU design to include that. One obvious challenge is that the Gigatron has a pipeline. So I'd need to figure out how to bubble the pipeline and stall the program counter. The reason would be to prevent unintended code execution into or out of the interrupt. That eliminates the delay slot during interrupt transitions. Then, of course, I'd need to save the PC, look up the interrupt, maybe save the registers, and then restart the PC at the intended ISR address. Then it would be a matter of reversing the process on an IRET. I'd probably want to use register aliasing to avoid needing a stack and allow only 1 interrupt at a time, and possibly chain them on the end (maybe by jumping and letting the next IRET do the return if appropriate). Yet I see something that could cause a problem. I don't know what would happen if an interrupt occurs during or right before a jump if there is any case where some intended code is omitted.

Double-speed ALU -- An idea is to clock the ALU twice as fast to have room for 2 pairs of instructions max. Then you'd change the instruction set to have paired instructions. That would complicate the CU some due to having two slots of time and up to 2 instructions per slot (see secondary decoder). So the most commonly paired instructions can run with a single opcode. That would simplify the microcode store idea and allow for a smaller ROM. The ROM copy idea, the secondary decoder, this idea, and the microcode/function stores could all make the entire ROM more likely to fit in the netlist. In a wired version, this could be done by using the doubled clock and giving the PC an extra bit. Jumps/branches only set the highest bits, and the lowest bit becomes part of the control unit and determines which phase of the control unit is active. So opcode A set instructions and opcode B set instructions.

Unlikely But Possible

Discussions

Become a Hackaday.io Member