YGREC8 separates the program and data spaces. Like the Microchip PIC16, it's a "Harvard" architecture which is safe, secure and convenient (PC does not have to deal with unused LSB for example). This creates the problem of transferring data (constants such as tables) from the program (fixed) space, to the data space for example.
YGREC8 does not provide a way to access the program space, which is actually larger : 256 bytes of data RAM and 512 bytes of instructions (packed as 256 instructions). Using the IO space to read back the program space is not a good idea because the necessary circuits are too cumbersome. And the ability to probe the program space at will is not a good idea (from a security perspective).
Enters the Microchip PIC16F again. Its architecture has really awkward aspects but it uses a wide range of crazy tricks. For example, constant data tables are implemented with series of RETURN-like instructions that load a constant in the accumulator before returning. It's awkward but cheap.
Actually the only good way to access the program space is through PC. Doing otherwise messes with the program sequence/schedule or adds a read port so it's better to multiplex accesses in some way. And it's easy to control PC because we can easily jump there :
MOV R1 TableBase ADD R1 R2 ; adds the index into the table CALL R1 R2 ; jump into the table and link in R2
OK but then what ?
It is not possible to implement the PIC's retlw instruction as is, but a similar effect is possible within the YGREC8's datapath, by reusing tricks developed for the CALL instruction.
CALL is like a MOV that swaps two write buses : the result goes to PC, and NPC goes to the registers.
The new desired instruction (let's call it MOVJ) is also like a MOV but it also jumps : we want to write a value to a register (typically an Imm8 but a reg is also possible) while also writing a register to PC. There are the following problems :
- The opcode space is already saturated, unfortunately... I don't know which opcode to axe.
- Only one half of the coding space is used. Instructions are 16 bits long and we get only 8 bits.
- It's a subtly different datapath and it requires a new MUX because it's not possible to make two MOV at the same time. The DST output from the register set needs a direct connexion to the PC MUX input.
Supporting this MOVJ instruction adds little overhead (a bypass from DST to PC) but the other problems are more severe. If only we could use the whole width of the instructions and pack 2× more constants...
So @roelh comes along and suggests a different approach, which I have tried to avoid by fear of introducing complexity :
you might use the pipeline effect to call a single-instruction subroutine. This is done by Marcel: https://hackaday.io/project/20781/logs . A flag (set by the 'calling' instruction) is needed that tells that the single instruction is a value and not an instruction.
In my https://hackaday.io/project/11012-risc-relay-cpu I have a LDC (load constant) instruction, that loads the pc with an address and fetches the table value from program space. It takes advantage of the fact that the PC has a master and a slave section. The trick is that the master can contain the constant location while the slave still contains the original (incremented) PC, so it is very easy to restore the PC after the constant has been loaded.
Using such an instruction solves the density problem because it can potentially read 16 bits at once and prevent waste.
However, the YGREC8 is not pipelined, it's a single-cycle processor. It might be implemented with several phases but accessing the program memory and the data memory each use one cycle (simultaneously). The NPC value (which is the current PC plus one) is not a register but a combinatorial output from the incrementer.
There are two ways to solve that :
- add a transparent latch, to hold on the last incremented value, after NPC or before the incrementer. This prevents the next value (pointing into the table) from being fed back immediately.
But the timing will be difficult to get right...
- Inhibit the write to PC and MUX the next instruction address with the result bus. This adds another small latency on the critical datapath yet this is the preferred method (no timing uncertainty).
Both methods require a small additional circuit and also a two-cycles sequence. That part makes the design more complex, because the core exists in two modes : either normal execution, or data fetch (which is monostable and will be cleared at the next cycle. This requires the addition of a new state/mode bit that controls some circuitry.
Another side-effect comes with interrupts : can an IRQ (when finally developed) break the 2-cycles sequence ? One aspect says "yes" because the value will not change, the program remains constant. PC is also unchanged so an interrupt signal does not need to be masked during that sequence, the program will restart at the beginning of the sequence. However, the value on the result bus will be gone if the program resumes in the middle of the sequence so it must be restarted.
So the sequence is "interruptible" and "restartable", the 2nd cycle must be restarted (with very low cost) and the state/mode bit does not need to be preserved (because it must be re-run anyway). So should the IRQ be inhibited ? Probably not, for simplicity reasons.
But now there is a new issue : Reading the program memory gives two bytes and the core can only write one at a time. Furthermore, it's probably not possible to write anywhere at all because the opcode might not even hold enough information. The probable opcode that will be sacrificed is CALL PC, which makes no sense. Only 8 bits are left for the address (either a register value + condition or an Imm8) and there is no room left for a destination register number.
Using the Inhibit+MUX system, it is easy to shift the address right by one bit, so we get a 128-words range with byte granularity : the LSB is kept for the next cycle as a byte-select flag, in front of the byte-MUX. Oh, and since it makes no sense to run the instruction with an immediate value (because the result is constant), the Imm8 operand is meaningless, and this frees one bit in the opcode (the R/I8 flag). This flag can then be used as a MSB for the 9-bits address. We end up with the opcodes LDCL and LDCH.
We have nowhere to write this data so the destination must be implicit.
- The value must not got to a fixed register. It's BAD. Next.
- It could go to an IO register. That's slow and IOs are not really ready now. It could be a special IO register like IN 0 but the extra read instruction wastes one instruction in the program.
- It could go to a latch/FF to override the SRC operand of the following instruction : this makes a 3-cycles instruction sequence with IRQ off... it's better from a programming point of view but the electronics breaks the nice lean and clean architecture.
So ideally, a register destination would be best but this requires an opcode sacrifice...
There is still another trick for tiny constant tables when speed is not important. Let's imagine a 7-segments decoder, with 16 values. I'll simply reuse some code I have written at https://hackaday.io/project/26068-numitron-hexadecimal-display-module/log/67654-going-further
var o7S=[ A+B+C+D+E+F , // 0 B+C , // 1 A+B +D+E +G, // 2 A+B+C+D +G, // 3 B+C +F+G, // 4 A +C+D +F+G, // 5 A +C+D+E+F+G, // 6 A+B+C , // 7 A+B+C+D+E+F+G, // 8 A+B+C+D +F+G, // 9 A+B+C +E+F+G, // A C+D+E+F+G, // B A +D+E+F , // C B+C+D+E +G, // D A +D+E+F+G, // E A +E+F+G // F ];
The last entry is "F", made of segments A+E+F+G, or 1+16+32+64=113. So if we only had F, the function would be like
decode7seg: ; index in R1, link in R2 MOV R3 113 MOV PC R2 ; output in R3
Now, let's support the digit "E" : it differs from "F" by the segment D (value=8). My idea is something like this:
decode7seg: ; index in R1 MOV R3 121 ; "E" AND R1, 1 ; isolate the LSB of the index ADD PC 2 IFZ ; skip the next instruction if digit="E" XOR R3 8 ; "E"^"F"=8 MOV PC R2 ; return to link in R2 ; output in R3
The "ADD PC" idiom can be extended to any number of elements to skip, and not only fixed jumps. The index can be added to PC but don't forget to increment to prevent an endless loop.
decode7seg: ; index in R1 MOV R3 0 ; ADD R1, 1 ; pre-increment the index ADD PC R1 ; skip as many instructions to get to the right entry ; the actual table: XOR R3 57 ; "0" XOR R3 93 ; "1" XOR R3 20 ; "2" ; .... XOR R3 39 ; "D" XOR R3 8 ; "E" XOR R3 113 ; "F" MOV PC R2 ; return to link in R2 ; output in R3
The XORs are chained to create the appropriate value at the end :-)
It's a semi-decent solution because this still wastes more than 1/2 of the encoding space due to the opcodes of the immediate instructions.
And it's slow and has linear execution time.
And you'll need a script to create arbitrary sequences of code from arbitrary data.
It's a hack !
Another cool thing (and mitigation) would be to find a way to preload the constants in the Data RAM upon reboot. I have found a trick to reuse PC during power-up and load the instruction RAM but it's not so easy for the data. But with 512 bytes of instructions, it's still possible to cram some constants here and there.