I'm focusing here on the FPGA implementation of YGREC8. The instruction memory is made of SRAM blocks that need to be pre-loaded before the program starts. Some families provide this functionality but this is not often convenient (you have to mess with the bitstream files). Other technologies (such as ProASIC and ASIC) don't preload the SRAM. Furthermore, you want to be able to upload any program, at any time, with no fuss. This log explains how it's done for #YGREC8.
The core can work in 2 states : after reset, it is in "upload mode", as it waits for 512 byte to be streamed through a dedicated port. If the checksum is correct (ok that makes 513 bytes, sorry), the core goes to "run" mode when it executes the program, until the next reset or shutdown.
The basic trick is to reuse the existing circuits. In particular, the PC circuit increments the 8 bits and the incrementer provides a "carry out" signal that shows an overflow happened. The PC is also directly connected to the instruction memory's address bus so we save a MUX and a counter or state machine.
In detail, it works like this :
- Upon reset, PC=0 and state is "upload".
- The circuit waits for 16 bits of data presented to the instruction memory's write port, followed by a "latch" strobe. PC is incremented (while the instruction sent to the pipeline is NOP and the writeback condition is disabled)
- If PC overflows, meaning PC=0 again, the last received word is compared to the CRC register. If it is correct, the mode goes to "run".
This system requires few added gates :
- The 16-bits write port is connected "somewhere"
- a few gates control the "state machine"
- some inhibit bits here and there
- some sort of CRC (TBD)
I'm pretty happy with this sytem :-)
The 16 bits words can be provided asynchronously, from a byte-wide port for example, or a serial port (SPI, RS232 or USB) with minimal handshake. The host computer must simply be able to control the /RESET and read the current status of the core (with some GPIO or ACK pin). This detail is independent from the system's principles.
Preloading the data RAM blocks is a different story. The write bus and address bus are already tied to the core and adding a MUX would slow the whole circuit down. The written data must go through the normal datapath...
There is also the issue of granularity, as incoming words are 16-bits wide but memory blocks are 8-bits wide. It is not possible to split a 16-bits word into two parallel 8-bits bytes because there is only one write port (usually). The whole design promotes economy and reuse and a dedicated circuit seems unpractical. Reusing the PC like above creates more problems than it solves.
There is one easy solution though : reuse the program upload system several times.
The idea is to allow the program to trigger a "soft reset" (to restart the upload FSM) and indicate which program to load and execute. In other words : create some sort of overlays.
In the simplest case, the whole program is sequentially split between several consecutive (temporal) overlays, the first one(s) contain all the data and initialise all the peripherals. When each overlay has done its work, it triggers a "soft reset" that requests 512 more bytes to upload and execute. This breaks the addressing problem in a flexible and cheap way.
In a more elaborate case, instead of receiving consecutive overlays, the program can control which overlay to receive and execute. One interesting way is to dedicate one GPIO register to the current overlay number, which can be read on the outside of the core by the upload system. For example, it can latch the MSB of an address bus to a ROM or a SPI bus master.
This opens another avenue for tweaks because on a secondary upload, the PC might not be cleared. The upload will then only fill the remaining high addresses of the instruction memory, leaving the lower addresses unchanged. This means : partial overlays are possible ! Complex webs of overlays can then call each other at will, in a FSM fashion...
Data SRAM are then easily filled with tailored software that can be generated from a .hyx file. Each sequence or "stretch" can be converted by program into a custom loop (several instructions) and a block of constant data, to get a decent density. Both SRAM blocks could mostly fit into one overlay.
Triggering the overlay upload can be done in several ways but they must obey one rule : set the PC to the address when instructions will be loaded. In this case, the "CALL PC" idiom is the most appropriate to become the "OVL" instruction because it sets the PC to the value contained in a register or an immediate. The old value of PC is lost. The overlay number must be set on GPIO pins by a previous instruction.
These overlays solve a nagging problem because the 8-bits instruction addressing space is really small... It also allows "bootstrap" features, by switching the source of instructions after a first overlay has done all the required initialisation. In the A3P family, there is a 128-bytes FlashROM area that can contain such bootstrap code.
This leaves open the issue of reading the instruction space...