• 3 more instructions

    Erik Piehl11 hours ago 0 comments

    Today was a busy day in the office - I just had the time and energy to add 3 more instructions. They are from a new category - one that only allows a workspace register as a destination parameter, while the source operand can have all addressing modes, for example:

    • COC R4,R3
    • CZC @TABLE(R2),R3
    • XOR *R7+,R2

    This category has five instructions in total (DIV and MPY are still missing) but COC, CZC and XOR are now done. Of these instructions XOR is the most familiar and supported by virtually all processors - it just does the XOR operation of source and destination and stores the result to destination while also setting 3 status flags.

    COC and CZC are unusual instructions, I have not seen these on any other processor although I have programmed in assembler on many CPUs.

    COC stands for "compare ones corresponding" and CZC stands for "compare zeros corresponding". Since they are comparison instructions, there is no actual data output other than the result of comparison which is stored in the zero flag.

    I implemented both using new ALU operations, in VHDL as below. I don't think I have ever used these instructions, so this implementation follows from what I understood from the TMS9900 data sheet.

    COC: alu_out <= ('0' & arg1 xor ('0' & arg2)) and ('0' & arg1);

    CZC: alu_out <= ('0' & arg1 xor not ('0' & arg2)) and ('0' & arg1);

    (The extra zero bits '0' are just garbage in the above to account for the fact that the ALU is actually 17 bits wide, in order to be able to generate the carry flag - which is not used by these instructions).

    Both of these comparison instructions take the source operand (arg1 above) and make sure that the result indicates that in the destination operand there are one bits (for COC, or zero bits for CZC) in each location where there are one bits for the source. I did the core of the comparison with XOR, and arg1 is used as a bit mask to leave only the relevant bits. The standard result comparison to zero in the flag creation logic works.

    I'm not really sure why they thought around 1977 that these are useful instructions... I can only assume they originated from the minicomputer architecture. These operations can so simply be implemented with basic boolean operations. Clearly this architecture was not designed with C compilers in mind - but that is evident from many other things as well, such as the lack of a proper hardware supported stack.

    Whatever - 3 more instructions done and a little tested - and only 4 instructions remain!

  • XOP, STST and external instructions

    Erik Piehla day ago 0 comments

    I was hoping to complete the core in terms of instruction set today - but no such luck. But I did add a bunch of instructions:

    • XOP - extended operation. More about this below.
    • STST - store status word to a workspace register. I am kicking myself for not implementing this instruction before, as it enables very easy flag functionality verification: do a computation impacting flags, store flags into a register and do an immediate comparison. If mismatch, stop. The source material for the comparison needs to come from a genuine TMS9900 or more likely from the classic99 emulator - or from my TMS99105 based TI-99/4A clone. This should be an easy way to verify the behaviour of flags which is pretty involved on the TMS9900.
    • STWP - store workspace pointer to a workspace register.
    • IDLE, RSET, CKOF, CKON, LREX - These are so-called external instructions. They basically just show a status code on the bus. IDLE should also stop and wait for an interrupt but I am not doing that yet.

    The XOP instruction turned out to be the real deal, a proper mega instruction. I was thinking earlier after adding the BLWP instruction, that it does not get more complex than that. The XOP is a kind of software interrupt, which transfers control via a table in address >0040. What is unique about the TMS9900 is that the XOP instruction support a parameter, and the effective address of the parameter is put into register R11. This really converts the instruction into a door to many useful and compact constructs. It is sad that in the design of the TI-99/4A ROM they did not really provide any provision for general use of this instruction.

    An example: XOP *R3,3

    This instruction activates the XOP number 3 (out of 16). The vector is calculated as >40+4*3, i.e. from >4C. From [4C] is loaded the new workspace pointer and from [4E] new program counter. Then no less than four values are stored into the new workspace: old values of PC, ST and W register, and finally the effective address of *R3 (which happens to be the contents of R3 which in this case would be a memory pointer).

    I modified the processing of BLWP instruction to serve Reset, XOP and BLWP use cases - I suspect that once I implement interrupts they will also use the same internal states, since interrupts are effective a bit like XOPs or BLWPs, in that they also vector through a memory location, change context and save previous state into the new context.

    After these instructions there are only seven instructions left to do! They are COC, CZC, XOR, MPY, DIV, LDCR and STCR. Of these I want the multiply instruction MPY to use the Xilinx FPGA DSP block for good performance. In addition interrupt support needs to go in, but that should be easy at this point due to the BLWP/XOP support.

  • BLWP, RTWP, Shifts and single bit I/O

    Erik Piehl2 days ago 0 comments

    Ok so more progress for today. I implemented many key (unique but a little obscure) features of the TMS9900. Still testing under simulation. I decided to postpone actual hardware synthesis until I have all the instructions somehow implemented. After these additions there are not many instructions missing anymore.

    • Now the core can process the most complex TMS9900 instructions, BLWP and RTWP.
    • I also added all shift instructions SLA, SRA, SRC and SRL. These are fairly standard shift instructions as can be found in most processors.
    • The TMS9900 architecture has a unique serial I/O facility called the CRU interface. This interface supports single bit and multiple bit transfers, using 5 instructions overall. Now the core implements the single bit variety, with SBO, SBZ and TB instructions. The multiple bit instructions are not yet there (LDCR and STCR).

    I tested all of the above in simulation. Not comprehensively, especially regarding flags. But BLWP and RTWP work - I actually changed reset processing so that reset is done by forcing a BLWP from address 0. BLWP does a ton of stuff:

    • It has a source operand, which supports the normal slew of addressing mode.
    • Once the effective address of source operand is calculated, two 16-bit words are read from there: the new workspace pointer and the new PC.
    • As the CPU enters the new workspace, it saves the entire context of the CPU to the new workspace by writing old WP to R13, old PC to R14 and old flags (ST) to R15.
    • Finally the new workspace is entered and new execution pointer is established by loading W and PC.

    When doing the above, care must be taken since to capture the old values of the registers W, PC and ST before overwriting them with the new ones.

    RTWP is an easy instruction - it has no operands. But it also does plenty: it reverses BLWP by loading W, PC and ST from R13, R14 and R15.

    The shift instructions SLA, SRA, SRC and SRL are also flexible in that the operand to be shifted can be chosen flexible with the full slew of addressing modes. The shift count can be given as an immediate argument. If set to zero, shift counter is actually read from workspace register zero. In that case the four LSBs of R0 are used as a shift count. And there is a catch there too - if those four LSBs of R0 are zero shift count is actually 16. I think for the shift instructions the carry and zero flags at least are set properly but not sure yet of the other flags...

    The single bit CRU instructions are also unique in that they use a special addressing mode that none of the other instructions use: the 8 LSBs of the instruction word become a sign extended offset to R12 for I/O bit addressing. Not only that - the 3 MSBs of I/O address are always zero and the offset is left shifted by one... The instructions are:

    • SBO <offset> - write a one bit to R12+offset. This is done by driving CRUOUT data line to one and issuing a clock pulse on CRUCLK. The CRUOUT is only valid when CRUCLK is high. Since the core is intended to run at 100MHz a single cycle CLKOUT may be too fast, so I added a delay counter which keeps CRUCLK high for 4 cycles.
    • SBZ <offset> - the same as above but writes a zero bit.
    • TB <offset> - calculates the I/O bit address as above, and then samples the CRUIN signal. For this one I also allowed four cycles of stable address output before sampling CRUIN.

  • Byte operations now supported - very CISCy

    Erik Piehl3 days ago 0 comments

    Again time constrained...

    Still running the core in simulation, I added the support of byte operations. The TMS9900 has only one category of instructions which support byte operations: the dual operand instructions with all addressing modes. These are the most flexible instructions.

    In principle byte operations are simple, because they are done by reading and writing 16-bit values (the bus only supports these (except with single bit CRU operations that I don't support yet)). So you read a 16-bit word, and put the relevant byte as the most significant byte. When writing to memory, you need to do a read-modify-write cycle, and put the relevant byte where it belongs.

    For example, if at address >1000 you have a data word >1234, you have as bytes >12 at address >1000 and >34 at address >1001. Now if you do a MOVB to the destination address >1000 with source data of >55, the result will become >55 at >1000 and still >34 at >1001. Since the bus only supports 16-bit values, you have >5534 at >1000. Similarly, if you store >55 at >1001, the memory word at >1000 becomes >1255. Note that with Ti assemblers the greater than sign > denotes a hexadecimal number.

    Simple, right? In principle, yes, in practice not exactly. Since there is an exception. If the write destination is a workspace register, you always modify the high byte of the register. Conceptually for a programmer this is very simple. If you for example consider the add byte instruction and do AB @>1001,R2 and at >1000 you have >1234 then the memory word at >1000 will be read, the least significant byte >34 (since the LSB of the address was 1) will be shifted to the MSB with zero extension (i.e. the word >3400) and that will be added to the contents of the most significant byte of R2. So you preserve the least significant byte of R2.

    But if you consider the above as a hardware designer, and keep in mind that the registers are actually in memory, you may need to special case direct register accesses to make sure you always deal with high bytes of registers. This comes back to how the hardware stores effective addresses, as the least significant bit of effective operand address calculation becomes a byte shifter control line. Now that I think about this, it actually maybe is not necessary to special case the registers... So it is useful to write this blog entries :)

    Internally I use the following hardware block to handle read operand processing for bytes:

    -- Byte aligner
    process(ea, rd_dat, operand_mode, operand_word)
        -- We have a byte operation. If the data came from register,
        -- we don't need to do anything. If it came from memory,
        -- we will zero extend and possibly shift.
        if operand_word or operand_mode(5 downto 4) = "00" then
          read_byte_aligner <= rd_dat;
            -- Not register operand. Need to check that EA is still valid.
        if ea(0) = '0' then
                read_byte_aligner <= rd_dat(15 downto 8) & x"00";
          read_byte_aligner <= rd_dat(7 downto 0) & x"00";
            end if;
        end if;
    end process;

    These are the byte instructions:

    • AB - add bytes
    • CB - compare bytes
    • SB - subsctract bytes
    • SOCB - set ones corresponding bytes (actually OR operation)
    • SZCB - Set zeros corresponding bytes (and not operation)
    • MOVB - move bytes

    For both source and destination operands you have the 5 addressing modes, using R3 as example we have:

    • R3
    • *R3
    • *R3+
    • @LABEL
    • @TABLE(R3)

    So this definitely is a CISC architecture, as you can do things like:

    AB *R3+,@TABLE(R2)

    This reads the source byte from the address R3 and increments R3 by one. It then retrieves the immediate 16-bit address operand TABLE, and adds that to R2 to have an indexed destination address. It then reads the byte from that destination address, and adds it to the source byte, and writes that byte back. As explained before, the actual read operations on the memory bus are 16-bit operations, so there is byte shuffling going on simultaneously,...

    Read more »

  • Workflow optimised, subroutines and single operand instructions!

    Erik Piehl04/16/2017 at 19:07 0 comments

    Today and yesterday I had more time to work on the project. I refactored the code, learned some more VHDL. I also greatly improved my workflow by creating a python script which takes a TMS9900 binary file and spits out the definition of a 64 word ROM in VHDL with the code. This allows for very quickly (10 seconds or so) code changes and simulation reruns, without any manual work.

    The TMS9900 supports a bunch of single operand instructions (i.e. the source and destination are the same, for example):

    INC R1

    Here the R1 register is incremented, so the source is R1 and destination is also R1. I refactored the VHDL code to calculate the effective address of the source operand and also properly handling all side effects, allowing the effective address to be used twice after operand calculation (once for value read, second time for result write, in between there is computation).

    Now I added support for all of the addressing modes for the single operand instructions, so all of the following work (tested with the CLR instruction, which clears the operand). The asterisk * is the comment in TMS9900 assembler, but also used to flag indirect operations:

    CLR R5    * Clear R5
    CLR *R5    * Clear memory word pointed to by R5
    CLR *R5+    * As above, also increment R5 by 2 to point to next word
    CLR @MEM1    * Clear the word with the 16-bit address MEM1
    CLR @4(R5)    * Clear the word in the address R5+4
    There are 14 single operand instructions, I implemented all of them except one, the BLWP instruction, which probably will be the next one. So as additional instructions I now have (with the full suite of address modes):
    B    @LABEL  * Jump to 16-bit address LABEL
    BL   @LABEL  * As above, but with link: PC stored to R11 first
    CLR  R4      * Clear R4
    SETO *R5     * Set memory word at address R5 to >FFFF
    INV  R9      * Invert bits of R9
    NEG  R10     * Negate R10 (i.e. 0-R10)
    ABS  R10     * Take the absolute value of R10
    SWPB R5      * Swap bytes of R5
    INC  R1      * Increment R1 by 1
    INCT R1      * Increment R1 by 2
    DEC  R1      * Decrement R1 by 1
    DECT R1      * Decrement R1 by 2
    X    R3      * Execute the opcode in R3 (UNTESTED)
    Some things to note from above:
    • All instructions above support all 5 addressing modes (although for B and BL the direct register operand does not really make sense)
    • BL is a subroutine call. TMS9900 does not support a hardware stack. Instead the previous PC is stored to register R11.
    • The absolute jump instruction B can be used to implement a return from subroutine, by B *R11. With these two the core can now handle subroutines. Although only one level can be handled, or R11 has to be stored elsewhere.
    • The almighty BLWP instruction is not yet done. This stores not only the PC, but also the workspace pointer W and status register ST, but I don't have that support yet :)

  • A, S, C - these are instructions...

    Erik Piehl04/14/2017 at 20:17 0 comments

    I've added a whole bunch of new functionality into the CPU core:

    • All branch instructions. I had just JMP in the past, now I have:
      • JLT
      • JLE
      • JEQ (tested)
      • JHE
      • JGT
      • JNE (tested)
      • JNC (tested - need rechecking)
      • JOC (tested - need rechecking)
      • JNO
      • JL
      • JH
      • JOP
    • Now the ALU supports more functionality:
      • Add, Sub, Compare, and, or, and not operations
      • Carry generation (needs much more testing)
      • Setting of condition codes ST0 through ST4. These need much more testing and the implementation is bogus for sure.
    • More convenient read and write operations in the core architecture.
    • Support for the whole slew of source operand address modes (R9 used as example)
      • R9 Workspace register addressing
      • *R9 Workspace register indirect addressing
      • *R9+ Workspace register indirect auto increment addressing
      • @LABEL Direct addressing (immediate operand is memory address)
      • @TABLE(R9) Indexed addressing (UNTESTED)
    • Support for the whole slew of destination operand addressing mode. Some of these do not work properly, I need to add more states to handle all cases to support properly side effects
      • R9 Workspace register addressing
      • *R9 Workspace register indirect addressing
      • *R9+ Workspace register indirect auto increment addressing BOGUS
      • @LABEL Direct addressing (immediate operand is memory address) ONLY WORKS FOR MOV INSTRUCTION
      • @TABLE(R9) Indexed addressing (UNTESTED, POTENTIALLY WORKS FOR MOV)
    • Since the core now supports all addressing modes (although as listed above, some a bogus and some untested) I was able to add the dual operand instructions. These are mostly untested. Below are some examples.
      • Move: MOV *R3+,R2
      • Add: A R1,*R3
      • Sub: S R2,R3
      • Compare: C R2,R3 Doesn't work, flag support missing
      • Or: SOC R2,R3 Untested
      • And not: SZC R2,R3 Untested

      The following test program runs correctly in the simulator:
      ********** TEST 3 ** Simulation output
        LI  R3,>8340    ** write to 8306 data 8340 1000001101000000
        LI  R0,>1234    ** write to 8300 data 1234 0001001000110100
        LI  R1,1        ** write to 8302 data 0001 0000000000000001
        MOV R0,*R3      ** write to 8340 data 1234 0001001000110100
        MOV *R3+,R2     ** write to 8306 data 8342 1000001101000010
      *                 ** write to 8304 data 1234 0001001000110100
        A   R1,R2       ** write to 8304 data 1235 0001001000110101
        MOV R2,R8       ** write to 8310 data 1235 0001001000110101
        MOV R1,*R3      ** write to 8342 data 0001 0000000000000001
        A   R1,*R3      ** write to 8342 data 0002 0000000000000010
        MOV @>4,@>8344
        JMP BOOT
      And below is the picture of the timing sequence of running the MOV @>4,@>8344 instruction:The core does a few extra memory accesses (it reads register 0 needlessly twice) so the execution takes a whopping 6 memory reads and one memory write (IAQ signal marks opcode fetch - from yellow line onwards). Thus, despite the 100MHz clock, this instruction takes almost 500ns. I will remove the unnecessary R0 reads (that's an instruction decode artifact) later. For now I am just happy this works!

  • We can perform simple additions!

    Erik Piehl04/12/2017 at 01:08 0 comments

    Now the design implements a few more instructions, totalling five:

    LI Rx,imm
    AI Rx,imm
    LWPI imm
    LIMI imm4
    JMP offset8
    These instructions all have immediate operands and are two words long, except the JMP which a single word instruction.

    Above imm is a 16 bit immediate value, imm4 a four bit immediate value, Rx designates a register R0-R15, and offset8 a 8-bit signed offset.

    The TMS9900 is an unusual processor in that it only has three registers directly accessible for the programmer, yet the programming model provides the programmer with 16 registers R0-R15. This is done by means of indirection: the register W points to a word aligned region of memory, where the 16 "workspace registers" are kept, taking 32 bytes. The hardware registers are:

    PC program counter

    W workspace pointer

    ST status register

    This architecture means that the memory bus gets very busy when executing instructions. The most advanced instruction I have implemented so far is AI (add immediate) where a constant immediate number is simply added to a workspace register. For example AI R3,1 would add the number 1 to workspace register R3. Simple, right? It is, but when you implement this part of the microprocessor core, a whole lot of states are needed, the VHDL code does roughly the following:

    1. Fetch state, initiate the opcode fetch from address pointed to by the PC register

    2. Start the memory cycle, also increment PC by 2 to point to next opcode

    3. Wait for the memory cycle to finish

    4. Decode state, examine the opcode that was fetched and write it to the instruction register IR. Here we see that the instruction is AI and go to the first state of AI processing

    5. Immediate operand fetch, the AI instruction is a two word instruction, so at this point another fetch from PC is prepared.

    6. Do the memory cycle, also increment PC by 2 to point to next opcode. Similar to steps 2&3.

    7. Execute step starts: the AI instruction needs the old value of R3. For that we need to first calculate where R3 is. Thus we initiate an ALU cycle to add W and 3*2 (registers are 16 bits, i.e. two bytes).

    8. The ALU has done the addition for the address of R3, so we can initiate a memory cycle from that address to fetch old contents of R3. The address of R3 is stored for later.

    9. Once the R3 fetch is complete, another ALU cycle is initiated. This time it is the actual addition operation, so the contents of R3 and the immediate operand are forwarded to the ALU input registers. Also the ALU is configured for an add operation.

    10. Finally a memory write cycle is started, to store the result value from the previous state 9 to the address calculated in step 8. The outgoing databus is driven with the data (embedded cores do not have three state databuses, instead there is a separate output bus and another input bus.

    11. Wait for memory cycle completion. Once that is done, go back to state 1 for the next instruction.

    In order to perform the above operations, an ALU also needed to be added. The ALU is not yet complete, it just does a few operations and does not compute all the status flags.

    I have successfully simulated the following program, and that proves that the LI, AI and JMP instructions work. TMS 9900 assemblers typically implement the NOP (no-operation) as a JMP to the next instruction, there is no bespoke opcode for that. The program below does not show the reset vectors.

    * Erik Piehl (C) 2017 April
    * test9900.asm
    * Test program sequences to test drive the TMS9900 VHDL core.
    	IDT 'TEST9900'
    	LI R3,>ED07
    	AI	R3,>0001
    When compiled, the following VHDL code implements the ROM memory containing the above program for simulation purposes:
            -- Program ROM
            type pgmRomArray is array(0 to 11) of STD_LOGIC_VECTOR (15 downto 0);
            constant pgmRom : pgmRomArray := (
                    x"8300", -- initial W
                    x"0008", -- initial PC
                    x"1000",                                -- BOOT: NOP
                    x"02E0", x"83E0"...
    Read more »

  • First simulation run

    Erik Piehl04/03/2017 at 12:37 0 comments

    After working on the CPU core a few hours to get started, I was able to complete my initial objectives:

    • The CPU is able to process reset (i.e. fetch reset workspace pointer and initial program counter)
    • The CPU can fetch instructions
    • It can execute the unconditional branch instruction

    Below is a picture from one of the first simulation runs. There are two instructions, like this:

    5678 JMP >567A   * effectively a NOP
    567A JMP >5678   * branch back to previous line