Retro challenge 2017/04 project to create a TMS9900 compatible CPU core. Again in a month... Failure could be an option...
Real life has kept me busy... I did a few small tweaks to the TMS9900 core: I added support for the MPY (multiply) instruction. I also started to implement CPU status flag support more carefully, and implemented a few changes to the handling of ST0 (Logical Greater Than) flag, as well to the handling of ST1 (Arithmetic Greater Than) flag. ST0 is the MSB i.e. bit 15 of the status register, while ST1 is bit 14. For whatever reason the TMS9900 generates these flags in what seems to be an inconsistent way: for both flags the "correct" behaviour is only done with the compare instructions (C, CB, CI). With other instructions (except special handling of ABS instruction - I did not implement that yet), the flags become simpler and mostly just versions of "non-zero" flags.
The multiply instruction I implemented with Xilinx Spartan 6 hardware multiplier, so this instruction (while not optimized) has a much higher performance than on the original CPU. Reading the data sheet the multiplier can compute the multiply operation in 3 cycles - it could do it even faster in pipelined mode but that is neither necessary or useful on this CPU core.
I have three test programs that I use for testing: the original TI ROMs, my custom boot loader and a instruction test program I use for simulation. My custom boot loader has the ability to start the Defender game, if the ROM image for that cartridge has been loaded. Both of my own test programs work, but the TI ROMs still do not work. With the changes to the handling of flags, now the Defender cartridge image renders the first screen before somehow crashing. So some progress, but still no beef yet.
One more feature done! Now the CPU core supports interrupts, leaving the major missing features to MPY and DIV instructions - and a whole bunch of debugging.
The CPU core now has 3 more I/O signals:
int_req : in STD_LOGIC; -- interrupt request, active high ic03 : in STD_LOGIC_VECTOR(3 downto 0); -- interrupt priority for the request, 0001 is the highest (0000 is reset) int_ack : out STD_LOGIC;
Of these on both my simulation framework and the actual FPGA implementation I set ic03 bit vector to "0001" so all interrupts occur at the highest level. This is how interrupts are hardwired on TI-99/4A.
int_ack is a signal that does not exist on the real TMS9900 CPU. In my case it is set to high while the CPU fetches the new workspace pointer from the interrupt vector table. This would allow external hardware to see that the CPU is vectoring to an interrupt, and also which interrupt it is. Maybe the actual CPU does something similar, but this was simple so I did it... At least useful for simulation runs.
It is amazing that there is almost no feature where the first implementation version wouldn't have several bugs: in my case when the CPU vectors to an interrupt it needs to modify its internal interrupt level (stored in 4 LSBs of the status register). I managed to implement two bugs in there first: initially I latched the interrupt priority code too early, so that when the CPU exited the interrupt service routine, it actually remained at a higher interrupt priority level than before the interrupt, because I altered interrupt priority before the previous level was stored to memory (as part of status register). So a flag bit was needed to modify the status register's interrupt priority field only after storing the previous contents to memory as part of the interrupt context switch.
The second problem was harder to debug, since it only occurred with FPGA run and not on simulation. When the CPU vectors to an interrupt, it must also adjust the current priority level not to the level that external hardware is requesting the interrupt for, but to a level below, as to block the same interrupt from firing over and over again. The following code does the right thing in the processing of the do_blwp3 state (this is part of the chain of states the CPU execution state machine marches through when entering the interrupt):
if set_int_priority then st(3 downto 0) <= std_logic_vector(unsigned(ic03) - 1); set_int_priority <= False; end if;So a four bit decrementer is required to compute the new interrupt priority.
Having interrupts working both in simulation and on the FPGA allowed me to also properly implement the IDLE instruction. This instruction simply waits for an interrupt. I added a new state to the CPU, which waits for an interrupt that has the same or higher priority that CPU currently has.
On the TI-99/4A hardware that my FPGA implements the interrupt originate from my TMS9918 video processor core on every frame, then pass through my TMS9901 core which further can mask the interrupt, and finally it goes to the CPU core. Thus on the TI-99/4A clone the IDLE instruction (assuming the video processor and I/O controller are properly set up) becomes a wait for next video frame operation.
After some debugging I got the TMS9900 CPU core to run on the FPGA chip. That's very cool! Even in this completely unoptimised form it runs a fair deal faster than the TMS99105 processor shield I built before. The CPU core is not yet fully functional, it lacks interrupts and DIV/MPY instructions, but I can run my demo code on it. The demo runs equivalently on the new CPU core as it did on the actual TMS99105 chip:
I also tried to run the full TI-99/4A on it (as the peripheral set is the same), but it wants to go to never-never land. That is hardly surprising, as I have tested the CPU very little. It actually is more surprising that it does run the demo code correctly. That program has about 650 lines of assembly code, and it does exercise a fair amount of instructions - and interfacing to the TI-99/4A hardware on the FPGA chip, namely the interface to my TMS9918 implementation. That part of the code loads up fonts to the video memory of the TMS9918.
Before I was able to get this far I needed to integrate the TMS9900 core in a functional way to the rest of the logic I created earlier for my TI-99/4A clone. I had a bunch of difficulties in getting the CPU to run. Eventually I added debug registers to help me figure out what was not working. The CPU core component exposes a 48-bits wide debug bus, which contains three 16-bit fields:
I also added another 64-bit debug bus to my top-level VHDL module, which connects to the SRAM memory bus (my TMS9900 core uses the external SRAM as its memory, it does not use any internal block RAM). This 64-bit debug bus contains the 16-bit CPU address bus, the 18-bit SRAM address bus (there is a memory paging subsystem, and a bus multiplexer to connect to the 32-bit wide SRAM bus), a few status flags, and the last 16-bits read from the memory bus.
The beauty with the debug buses is that they are exposed through my USB memory controller interface, allowing me to see what the contents are. Normally those values would be flying by, but when the CPU core hits an opcode it does not understand it will enter a "stuck" state, and light up a LED. That allows me to see that it got stuck, and by reading the debug buses I can see what is wrong. That is the theory at least. It did help me debug the SRAM memory interface, so that my CPU core could read and write memory reliably. But when running the normal TI-99/4A ROMS, I can only see that the CPU gets stuck in a memory location it is not even supposed to fetch instructions from... That is not surprising.
But the nice thing about working with FPGAs is that I can next equip my CPU with more debug features - it seems it would be very useful to have a trace buffer, where I could store for example the last 1024 or so instructions and their addresses. I could also force the CPU stuck the minute (eh nanosecond really) it starts to execute instructions from a memory location it is not supposed to run code from.
I wrestled today with two of the remaining instructions, LDCR and STCR, but before explaining them, I did plug in the CPU to my EP994A project (TI-99/4A clone), basically replacing the external TMS99105 interface - and synthesis did pass!!!!!! Wow! I did a really stupid integration, just wiring signals in there in a semi logical way, as to force the logic synthesis to do something - and it did! The very first attempt succeeded! It will not work for sure, as the bus interface I created is different from the TMS99105 - it has vastly different timing, so I need to modify the integration logic quite a bit.
But here is the interesting stuff, a comparison of how many logic resources were consumed on the Xilinx XC6LX9 FPGA, with "only" the TI-99/4A logic, and with the logic including the CPU (granted the CPU is bogus and the integration even more bogus - but we don't care, the ballbark is what matters):
|TI-99/4A + external TMS99105||TI-99/4A + new TMS9900 core|
|Number of slice registers used||966||1248|
|Number of slice LUTs||1402||2663|
|Number of slice LUTs %||24%||46%|
|LUTs used as logic||1381||2636|
|LUTs used as memory||9||14|
I like these results :) It means that the FPGA easily accommodated the CPU implementation with the rest of the TI-99/4A logic. In fact even if the CPU is totally bogus and much more logic is required, it will fit in. In fact on this relatively small FPGA there is enough space to add at least one more TMS9900 core. Also the instruction decode etc. is completely state machine based, so it does not use any of the memory blocks of the FPGA. The CPU could be partially microcoded by using a memory block to save logic if necessary.
Of course a working integration of the CPU will change the numbers - but integrating the on-board CPU is actually more straightforward than interfacing to an external CPU. And if it becomes complex I can always modify the CPU bus interface...
Still DIV and MPY instructions are not implemented, but I think I will next focus in getting the CPU integrated so that I can actually prove that it works in the FPGA implementation. This could be a lot of work...
LDCR and STCR instructions
Before doing the synthesis I spent quite a bit of time of implementing and simulating two of the remaining four instructions:
They were both nasty instructions to make. The bit serial CRU interface on the TMS9900 uses the address bus to tell the external world which bit is addressed. When writing or reading more than one bit - which is pretty much always the case with these instructions - the CPU must increment the address and shift bits appropriately. It also needs to separately handle transfers between 1 and 8 bits, and 9 and 16 bits. Between 1 and 8 bits the CPU operates in "byte mode", so for example when using the auto increment addressing mode:
Writes 5 bits from the address pointed to by R3 and it auto increments R3 by 1. But if the bit counter is higher:
The auto increment is by two. This gets more hairy with the opposite direction, for example with
This will read 5 bits and do a byte write to the address pointed to by R3. Since the external bus is 16 bits wide, the CPU actually must do a read-modify-write cycle and modify either the low or high byte (depending on the LSB address bit).
To make things a little more involved, the number of bits transferred is encoded into 4 bits, with the value 0000 indicating 16 bits. So that needs to be handled properly too.
My implementation seems to do the appropriate things now, based on my limited experience of running on the real iron and reading the data sheet. The data sheet is really not verbose as to how these instructions work. The instructions also mess around with flags, but I did...Read more »
Today was a busy day in the office - I just had the time and energy to add 3 more instructions. They are from a new category - one that only allows a workspace register as a destination parameter, while the source operand can have all addressing modes, for example:
This category has five instructions in total (DIV and MPY are still missing) but COC, CZC and XOR are now done. Of these instructions XOR is the most familiar and supported by virtually all processors - it just does the XOR operation of source and destination and stores the result to destination while also setting 3 status flags.
COC and CZC are unusual instructions, I have not seen these on any other processor although I have programmed in assembler on many CPUs.
COC stands for "compare ones corresponding" and CZC stands for "compare zeros corresponding". Since they are comparison instructions, there is no actual data output other than the result of comparison which is stored in the zero flag.
I implemented both using new ALU operations, in VHDL as below. I don't think I have ever used these instructions, so this implementation follows from what I understood from the TMS9900 data sheet.
COC: alu_out <= ('0' & arg1 xor ('0' & arg2)) and ('0' & arg1);
CZC: alu_out <= ('0' & arg1 xor not ('0' & arg2)) and ('0' & arg1);
(The extra zero bits '0' are just garbage in the above to account for the fact that the ALU is actually 17 bits wide, in order to be able to generate the carry flag - which is not used by these instructions).
Both of these comparison instructions take the source operand (arg1 above) and make sure that the result indicates that in the destination operand there are one bits (for COC, or zero bits for CZC) in each location where there are one bits for the source. I did the core of the comparison with XOR, and arg1 is used as a bit mask to leave only the relevant bits. The standard result comparison to zero in the flag creation logic works.
I'm not really sure why they thought around 1977 that these are useful instructions... I can only assume they originated from the minicomputer architecture. These operations can so simply be implemented with basic boolean operations. Clearly this architecture was not designed with C compilers in mind - but that is evident from many other things as well, such as the lack of a proper hardware supported stack.
Whatever - 3 more instructions done and a little tested - and only 4 instructions remain!
I was hoping to complete the core in terms of instruction set today - but no such luck. But I did add a bunch of instructions:
The XOP instruction turned out to be the real deal, a proper mega instruction. I was thinking earlier after adding the BLWP instruction, that it does not get more complex than that. The XOP is a kind of software interrupt, which transfers control via a table in address >0040. What is unique about the TMS9900 is that the XOP instruction support a parameter, and the effective address of the parameter is put into register R11. This really converts the instruction into a door to many useful and compact constructs. It is sad that in the design of the TI-99/4A ROM they did not really provide any provision for general use of this instruction.
An example: XOP *R3,3
This instruction activates the XOP number 3 (out of 16). The vector is calculated as >40+4*3, i.e. from >4C. From [4C] is loaded the new workspace pointer and from [4E] new program counter. Then no less than four values are stored into the new workspace: old values of PC, ST and W register, and finally the effective address of *R3 (which happens to be the contents of R3 which in this case would be a memory pointer).
I modified the processing of BLWP instruction to serve Reset, XOP and BLWP use cases - I suspect that once I implement interrupts they will also use the same internal states, since interrupts are effective a bit like XOPs or BLWPs, in that they also vector through a memory location, change context and save previous state into the new context.
After these instructions there are only seven instructions left to do! They are COC, CZC, XOR, MPY, DIV, LDCR and STCR. Of these I want the multiply instruction MPY to use the Xilinx FPGA DSP block for good performance. In addition interrupt support needs to go in, but that should be easy at this point due to the BLWP/XOP support.
Ok so more progress for today. I implemented many key (unique but a little obscure) features of the TMS9900. Still testing under simulation. I decided to postpone actual hardware synthesis until I have all the instructions somehow implemented. After these additions there are not many instructions missing anymore.
I tested all of the above in simulation. Not comprehensively, especially regarding flags. But BLWP and RTWP work - I actually changed reset processing so that reset is done by forcing a BLWP from address 0. BLWP does a ton of stuff:
When doing the above, care must be taken since to capture the old values of the registers W, PC and ST before overwriting them with the new ones.
RTWP is an easy instruction - it has no operands. But it also does plenty: it reverses BLWP by loading W, PC and ST from R13, R14 and R15.
The shift instructions SLA, SRA, SRC and SRL are also flexible in that the operand to be shifted can be chosen flexible with the full slew of addressing modes. The shift count can be given as an immediate argument. If set to zero, shift counter is actually read from workspace register zero. In that case the four LSBs of R0 are used as a shift count. And there is a catch there too - if those four LSBs of R0 are zero shift count is actually 16. I think for the shift instructions the carry and zero flags at least are set properly but not sure yet of the other flags...
The single bit CRU instructions are also unique in that they use a special addressing mode that none of the other instructions use: the 8 LSBs of the instruction word become a sign extended offset to R12 for I/O bit addressing. Not only that - the 3 MSBs of I/O address are always zero and the offset is left shifted by one... The instructions are:
Again time constrained...
Still running the core in simulation, I added the support of byte operations. The TMS9900 has only one category of instructions which support byte operations: the dual operand instructions with all addressing modes. These are the most flexible instructions.
In principle byte operations are simple, because they are done by reading and writing 16-bit values (the bus only supports these (except with single bit CRU operations that I don't support yet)). So you read a 16-bit word, and put the relevant byte as the most significant byte. When writing to memory, you need to do a read-modify-write cycle, and put the relevant byte where it belongs.
For example, if at address >1000 you have a data word >1234, you have as bytes >12 at address >1000 and >34 at address >1001. Now if you do a MOVB to the destination address >1000 with source data of >55, the result will become >55 at >1000 and still >34 at >1001. Since the bus only supports 16-bit values, you have >5534 at >1000. Similarly, if you store >55 at >1001, the memory word at >1000 becomes >1255. Note that with Ti assemblers the greater than sign > denotes a hexadecimal number.
Simple, right? In principle, yes, in practice not exactly. Since there is an exception. If the write destination is a workspace register, you always modify the high byte of the register. Conceptually for a programmer this is very simple. If you for example consider the add byte instruction and do AB @>1001,R2 and at >1000 you have >1234 then the memory word at >1000 will be read, the least significant byte >34 (since the LSB of the address was 1) will be shifted to the MSB with zero extension (i.e. the word >3400) and that will be added to the contents of the most significant byte of R2. So you preserve the least significant byte of R2.
But if you consider the above as a hardware designer, and keep in mind that the registers are actually in memory, you may need to special case direct register accesses to make sure you always deal with high bytes of registers. This comes back to how the hardware stores effective addresses, as the least significant bit of effective operand address calculation becomes a byte shifter control line. Now that I think about this, it actually maybe is not necessary to special case the registers... So it is useful to write this blog entries :)
Internally I use the following hardware block to handle read operand processing for bytes:
-- Byte aligner process(ea, rd_dat, operand_mode, operand_word) begin -- We have a byte operation. If the data came from register, -- we don't need to do anything. If it came from memory, -- we will zero extend and possibly shift. if operand_word or operand_mode(5 downto 4) = "00" then read_byte_aligner <= rd_dat; else -- Not register operand. Need to check that EA is still valid. if ea(0) = '0' then read_byte_aligner <= rd_dat(15 downto 8) & x"00"; else read_byte_aligner <= rd_dat(7 downto 0) & x"00"; end if; end if; end process;
These are the byte instructions:
For both source and destination operands you have the 5 addressing modes, using R3 as example we have:
So this definitely is a CISC architecture, as you can do things like:
This reads the source byte from the address R3 and increments R3 by one. It then retrieves the immediate 16-bit address operand TABLE, and adds that to R2 to have an indexed destination address. It then reads the byte from that destination address, and adds it to the source byte, and writes that byte back. As explained before, the actual read operations on the memory bus are 16-bit operations, so there is byte shuffling going on simultaneously, depending on the actual effective addresses. Again since the workspace registers are actually in memory, there CPU core must also calculate...Read more »
Today and yesterday I had more time to work on the project. I refactored the code, learned some more VHDL. I also greatly improved my workflow by creating a python script which takes a TMS9900 binary file and spits out the definition of a 64 word ROM in VHDL with the code. This allows for very quickly (10 seconds or so) code changes and simulation reruns, without any manual work.
The TMS9900 supports a bunch of single operand instructions (i.e. the source and destination are the same, for example):
Here the R1 register is incremented, so the source is R1 and destination is also R1. I refactored the VHDL code to calculate the effective address of the source operand and also properly handling all side effects, allowing the effective address to be used twice after operand calculation (once for value read, second time for result write, in between there is computation).
Now I added support for all of the addressing modes for the single operand instructions, so all of the following work (tested with the CLR instruction, which clears the operand). The asterisk * is the comment in TMS9900 assembler, but also used to flag indirect operations:
CLR R5 * Clear R5 CLR *R5 * Clear memory word pointed to by R5 CLR *R5+ * As above, also increment R5 by 2 to point to next word CLR @MEM1 * Clear the word with the 16-bit address MEM1 CLR @4(R5) * Clear the word in the address R5+4There are 14 single operand instructions, I implemented all of them except one, the BLWP instruction, which probably will be the next one. So as additional instructions I now have (with the full suite of address modes):
B @LABEL * Jump to 16-bit address LABEL BL @LABEL * As above, but with link: PC stored to R11 first CLR R4 * Clear R4 SETO *R5 * Set memory word at address R5 to >FFFF INV R9 * Invert bits of R9 NEG R10 * Negate R10 (i.e. 0-R10) ABS R10 * Take the absolute value of R10 SWPB R5 * Swap bytes of R5 INC R1 * Increment R1 by 1 INCT R1 * Increment R1 by 2 DEC R1 * Decrement R1 by 1 DECT R1 * Decrement R1 by 2 X R3 * Execute the opcode in R3 (UNTESTED)Some things to note from above:
I've added a whole bunch of new functionality into the CPU core:
********** TEST 3 ** Simulation output BOOT LI R3,>8340 ** write to 8306 data 8340 1000001101000000 LI R0,>1234 ** write to 8300 data 1234 0001001000110100 LI R1,1 ** write to 8302 data 0001 0000000000000001 MOV R0,*R3 ** write to 8340 data 1234 0001001000110100 MOV *R3+,R2 ** write to 8306 data 8342 1000001101000010 * ** write to 8304 data 1234 0001001000110100 A R1,R2 ** write to 8304 data 1235 0001001000110101 MOV R2,R8 ** write to 8310 data 1235 0001001000110101 MOV R1,*R3 ** write to 8342 data 0001 0000000000000001 A R1,*R3 ** write to 8342 data 0002 0000000000000010 MOV @>4,@>8344 JMP BOOTAnd below is the picture of the timing sequence of running the MOV @>4,@>8344 instruction:The core does a few extra memory accesses (it reads register 0 needlessly twice) so the execution takes a whopping 6 memory reads and one memory write (IAQ signal marks opcode fetch - from yellow line onwards). Thus, despite the 100MHz clock, this instruction takes almost 500ns. I will remove the unnecessary R0 reads (that's an instruction decode artifact) later. For now I am just happy this works!