Retro challenge 2017/04 project to create a TMS9900 compatible CPU core. Again in a month... Failure could be an option...
Today was a busy day in the office - I just had the time and energy to add 3 more instructions. They are from a new category - one that only allows a workspace register as a destination parameter, while the source operand can have all addressing modes, for example:
This category has five instructions in total (DIV and MPY are still missing) but COC, CZC and XOR are now done. Of these instructions XOR is the most familiar and supported by virtually all processors - it just does the XOR operation of source and destination and stores the result to destination while also setting 3 status flags.
COC and CZC are unusual instructions, I have not seen these on any other processor although I have programmed in assembler on many CPUs.
COC stands for "compare ones corresponding" and CZC stands for "compare zeros corresponding". Since they are comparison instructions, there is no actual data output other than the result of comparison which is stored in the zero flag.
I implemented both using new ALU operations, in VHDL as below. I don't think I have ever used these instructions, so this implementation follows from what I understood from the TMS9900 data sheet.
COC: alu_out <= ('0' & arg1 xor ('0' & arg2)) and ('0' & arg1);
CZC: alu_out <= ('0' & arg1 xor not ('0' & arg2)) and ('0' & arg1);
(The extra zero bits '0' are just garbage in the above to account for the fact that the ALU is actually 17 bits wide, in order to be able to generate the carry flag - which is not used by these instructions).
Both of these comparison instructions take the source operand (arg1 above) and make sure that the result indicates that in the destination operand there are one bits (for COC, or zero bits for CZC) in each location where there are one bits for the source. I did the core of the comparison with XOR, and arg1 is used as a bit mask to leave only the relevant bits. The standard result comparison to zero in the flag creation logic works.
I'm not really sure why they thought around 1977 that these are useful instructions... I can only assume they originated from the minicomputer architecture. These operations can so simply be implemented with basic boolean operations. Clearly this architecture was not designed with C compilers in mind - but that is evident from many other things as well, such as the lack of a proper hardware supported stack.
Whatever - 3 more instructions done and a little tested - and only 4 instructions remain!
I was hoping to complete the core in terms of instruction set today - but no such luck. But I did add a bunch of instructions:
The XOP instruction turned out to be the real deal, a proper mega instruction. I was thinking earlier after adding the BLWP instruction, that it does not get more complex than that. The XOP is a kind of software interrupt, which transfers control via a table in address >0040. What is unique about the TMS9900 is that the XOP instruction support a parameter, and the effective address of the parameter is put into register R11. This really converts the instruction into a door to many useful and compact constructs. It is sad that in the design of the TI-99/4A ROM they did not really provide any provision for general use of this instruction.
An example: XOP *R3,3
This instruction activates the XOP number 3 (out of 16). The vector is calculated as >40+4*3, i.e. from >4C. From [4C] is loaded the new workspace pointer and from [4E] new program counter. Then no less than four values are stored into the new workspace: old values of PC, ST and W register, and finally the effective address of *R3 (which happens to be the contents of R3 which in this case would be a memory pointer).
I modified the processing of BLWP instruction to serve Reset, XOP and BLWP use cases - I suspect that once I implement interrupts they will also use the same internal states, since interrupts are effective a bit like XOPs or BLWPs, in that they also vector through a memory location, change context and save previous state into the new context.
After these instructions there are only seven instructions left to do! They are COC, CZC, XOR, MPY, DIV, LDCR and STCR. Of these I want the multiply instruction MPY to use the Xilinx FPGA DSP block for good performance. In addition interrupt support needs to go in, but that should be easy at this point due to the BLWP/XOP support.
Ok so more progress for today. I implemented many key (unique but a little obscure) features of the TMS9900. Still testing under simulation. I decided to postpone actual hardware synthesis until I have all the instructions somehow implemented. After these additions there are not many instructions missing anymore.
I tested all of the above in simulation. Not comprehensively, especially regarding flags. But BLWP and RTWP work - I actually changed reset processing so that reset is done by forcing a BLWP from address 0. BLWP does a ton of stuff:
When doing the above, care must be taken since to capture the old values of the registers W, PC and ST before overwriting them with the new ones.
RTWP is an easy instruction - it has no operands. But it also does plenty: it reverses BLWP by loading W, PC and ST from R13, R14 and R15.
The shift instructions SLA, SRA, SRC and SRL are also flexible in that the operand to be shifted can be chosen flexible with the full slew of addressing modes. The shift count can be given as an immediate argument. If set to zero, shift counter is actually read from workspace register zero. In that case the four LSBs of R0 are used as a shift count. And there is a catch there too - if those four LSBs of R0 are zero shift count is actually 16. I think for the shift instructions the carry and zero flags at least are set properly but not sure yet of the other flags...
The single bit CRU instructions are also unique in that they use a special addressing mode that none of the other instructions use: the 8 LSBs of the instruction word become a sign extended offset to R12 for I/O bit addressing. Not only that - the 3 MSBs of I/O address are always zero and the offset is left shifted by one... The instructions are:
Again time constrained...
Still running the core in simulation, I added the support of byte operations. The TMS9900 has only one category of instructions which support byte operations: the dual operand instructions with all addressing modes. These are the most flexible instructions.
In principle byte operations are simple, because they are done by reading and writing 16-bit values (the bus only supports these (except with single bit CRU operations that I don't support yet)). So you read a 16-bit word, and put the relevant byte as the most significant byte. When writing to memory, you need to do a read-modify-write cycle, and put the relevant byte where it belongs.
For example, if at address >1000 you have a data word >1234, you have as bytes >12 at address >1000 and >34 at address >1001. Now if you do a MOVB to the destination address >1000 with source data of >55, the result will become >55 at >1000 and still >34 at >1001. Since the bus only supports 16-bit values, you have >5534 at >1000. Similarly, if you store >55 at >1001, the memory word at >1000 becomes >1255. Note that with Ti assemblers the greater than sign > denotes a hexadecimal number.
Simple, right? In principle, yes, in practice not exactly. Since there is an exception. If the write destination is a workspace register, you always modify the high byte of the register. Conceptually for a programmer this is very simple. If you for example consider the add byte instruction and do AB @>1001,R2 and at >1000 you have >1234 then the memory word at >1000 will be read, the least significant byte >34 (since the LSB of the address was 1) will be shifted to the MSB with zero extension (i.e. the word >3400) and that will be added to the contents of the most significant byte of R2. So you preserve the least significant byte of R2.
But if you consider the above as a hardware designer, and keep in mind that the registers are actually in memory, you may need to special case direct register accesses to make sure you always deal with high bytes of registers. This comes back to how the hardware stores effective addresses, as the least significant bit of effective operand address calculation becomes a byte shifter control line. Now that I think about this, it actually maybe is not necessary to special case the registers... So it is useful to write this blog entries :)
Internally I use the following hardware block to handle read operand processing for bytes:
-- Byte aligner process(ea, rd_dat, operand_mode, operand_word) begin -- We have a byte operation. If the data came from register, -- we don't need to do anything. If it came from memory, -- we will zero extend and possibly shift. if operand_word or operand_mode(5 downto 4) = "00" then read_byte_aligner <= rd_dat; else -- Not register operand. Need to check that EA is still valid. if ea(0) = '0' then read_byte_aligner <= rd_dat(15 downto 8) & x"00"; else read_byte_aligner <= rd_dat(7 downto 0) & x"00"; end if; end if; end process;
These are the byte instructions:
For both source and destination operands you have the 5 addressing modes, using R3 as example we have:
So this definitely is a CISC architecture, as you can do things like:
This reads the source byte from the address R3 and increments R3 by one. It then retrieves the immediate 16-bit address operand TABLE, and adds that to R2 to have an indexed destination address. It then reads the byte from that destination address, and adds it to the source byte, and writes that byte back. As explained before, the actual read operations on the memory bus are 16-bit operations, so there is byte shuffling going on simultaneously,...Read more »
Today and yesterday I had more time to work on the project. I refactored the code, learned some more VHDL. I also greatly improved my workflow by creating a python script which takes a TMS9900 binary file and spits out the definition of a 64 word ROM in VHDL with the code. This allows for very quickly (10 seconds or so) code changes and simulation reruns, without any manual work.
The TMS9900 supports a bunch of single operand instructions (i.e. the source and destination are the same, for example):
Here the R1 register is incremented, so the source is R1 and destination is also R1. I refactored the VHDL code to calculate the effective address of the source operand and also properly handling all side effects, allowing the effective address to be used twice after operand calculation (once for value read, second time for result write, in between there is computation).
Now I added support for all of the addressing modes for the single operand instructions, so all of the following work (tested with the CLR instruction, which clears the operand). The asterisk * is the comment in TMS9900 assembler, but also used to flag indirect operations:
CLR R5 * Clear R5 CLR *R5 * Clear memory word pointed to by R5 CLR *R5+ * As above, also increment R5 by 2 to point to next word CLR @MEM1 * Clear the word with the 16-bit address MEM1 CLR @4(R5) * Clear the word in the address R5+4There are 14 single operand instructions, I implemented all of them except one, the BLWP instruction, which probably will be the next one. So as additional instructions I now have (with the full suite of address modes):
B @LABEL * Jump to 16-bit address LABEL BL @LABEL * As above, but with link: PC stored to R11 first CLR R4 * Clear R4 SETO *R5 * Set memory word at address R5 to >FFFF INV R9 * Invert bits of R9 NEG R10 * Negate R10 (i.e. 0-R10) ABS R10 * Take the absolute value of R10 SWPB R5 * Swap bytes of R5 INC R1 * Increment R1 by 1 INCT R1 * Increment R1 by 2 DEC R1 * Decrement R1 by 1 DECT R1 * Decrement R1 by 2 X R3 * Execute the opcode in R3 (UNTESTED)Some things to note from above:
I've added a whole bunch of new functionality into the CPU core:
********** TEST 3 ** Simulation output BOOT LI R3,>8340 ** write to 8306 data 8340 1000001101000000 LI R0,>1234 ** write to 8300 data 1234 0001001000110100 LI R1,1 ** write to 8302 data 0001 0000000000000001 MOV R0,*R3 ** write to 8340 data 1234 0001001000110100 MOV *R3+,R2 ** write to 8306 data 8342 1000001101000010 * ** write to 8304 data 1234 0001001000110100 A R1,R2 ** write to 8304 data 1235 0001001000110101 MOV R2,R8 ** write to 8310 data 1235 0001001000110101 MOV R1,*R3 ** write to 8342 data 0001 0000000000000001 A R1,*R3 ** write to 8342 data 0002 0000000000000010 MOV @>4,@>8344 JMP BOOTAnd below is the picture of the timing sequence of running the MOV @>4,@>8344 instruction:The core does a few extra memory accesses (it reads register 0 needlessly twice) so the execution takes a whopping 6 memory reads and one memory write (IAQ signal marks opcode fetch - from yellow line onwards). Thus, despite the 100MHz clock, this instruction takes almost 500ns. I will remove the unnecessary R0 reads (that's an instruction decode artifact) later. For now I am just happy this works!
Now the design implements a few more instructions, totalling five:
LI Rx,imm AI Rx,imm LWPI imm LIMI imm4 JMP offset8These instructions all have immediate operands and are two words long, except the JMP which a single word instruction.
Above imm is a 16 bit immediate value, imm4 a four bit immediate value, Rx designates a register R0-R15, and offset8 a 8-bit signed offset.
The TMS9900 is an unusual processor in that it only has three registers directly accessible for the programmer, yet the programming model provides the programmer with 16 registers R0-R15. This is done by means of indirection: the register W points to a word aligned region of memory, where the 16 "workspace registers" are kept, taking 32 bytes. The hardware registers are:
PC program counter
W workspace pointer
ST status register
This architecture means that the memory bus gets very busy when executing instructions. The most advanced instruction I have implemented so far is AI (add immediate) where a constant immediate number is simply added to a workspace register. For example AI R3,1 would add the number 1 to workspace register R3. Simple, right? It is, but when you implement this part of the microprocessor core, a whole lot of states are needed, the VHDL code does roughly the following:
1. Fetch state, initiate the opcode fetch from address pointed to by the PC register
2. Start the memory cycle, also increment PC by 2 to point to next opcode
3. Wait for the memory cycle to finish
4. Decode state, examine the opcode that was fetched and write it to the instruction register IR. Here we see that the instruction is AI and go to the first state of AI processing
5. Immediate operand fetch, the AI instruction is a two word instruction, so at this point another fetch from PC is prepared.
6. Do the memory cycle, also increment PC by 2 to point to next opcode. Similar to steps 2&3.
7. Execute step starts: the AI instruction needs the old value of R3. For that we need to first calculate where R3 is. Thus we initiate an ALU cycle to add W and 3*2 (registers are 16 bits, i.e. two bytes).
8. The ALU has done the addition for the address of R3, so we can initiate a memory cycle from that address to fetch old contents of R3. The address of R3 is stored for later.
9. Once the R3 fetch is complete, another ALU cycle is initiated. This time it is the actual addition operation, so the contents of R3 and the immediate operand are forwarded to the ALU input registers. Also the ALU is configured for an add operation.
10. Finally a memory write cycle is started, to store the result value from the previous state 9 to the address calculated in step 8. The outgoing databus is driven with the data (embedded cores do not have three state databuses, instead there is a separate output bus and another input bus.
11. Wait for memory cycle completion. Once that is done, go back to state 1 for the next instruction.
In order to perform the above operations, an ALU also needed to be added. The ALU is not yet complete, it just does a few operations and does not compute all the status flags.
I have successfully simulated the following program, and that proves that the LI, AI and JMP instructions work. TMS 9900 assemblers typically implement the NOP (no-operation) as a JMP to the next instruction, there is no bespoke opcode for that. The program below does not show the reset vectors.
* Erik Piehl (C) 2017 April * test9900.asm * * Test program sequences to test drive the TMS9900 VHDL core. * IDT 'TEST9900' BOOT NOP LI R3,>ED07 LOOPPI AI R3,>0001 JMP LOOPPI SLAST END BOOTWhen compiled, the following VHDL code implements the ROM memory containing the above program for simulation purposes:
-- Program ROM type pgmRomArray is array(0 to 11) of STD_LOGIC_VECTOR (15 downto 0); constant pgmRom : pgmRomArray := ( x"8300", -- initial W x"0008", -- initial PC x"BEEF", x"BEEF", x"1000", -- BOOT: NOP x"02E0", x"83E0"...Read more »
After working on the CPU core a few hours to get started, I was able to complete my initial objectives:
Below is a picture from one of the first simulation runs. There are two instructions, like this:
5678 JMP >567A * effectively a NOP 567A JMP >5678 * branch back to previous line