• VDP character cell address masking feature

    Erik Piehl11/01/2017 at 21:20 0 comments

    I pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.

    With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...

    The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:

    -- Graphics mode 2. 768 unique characters are possible.
    -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit
    masks for the two
    -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode.
    vram_out_addr <= reg4(2) -- MSB of the address
        & (char_addr(9 downto 8) and reg4(1 downto 0))  -- Character code with masks for bits 9 and 8
        & char_code & ypos(2 downto 0); -- 8 bit code and line in character

  • VDP character cell address masking feature

    Erik Piehl11/01/2017 at 21:20 0 comments

    I pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.

    With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...

    The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:

    -- Graphics mode 2. 768 unique characters are possible.
    -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit
    masks for the two
    -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode.
    vram_out_addr <= reg4(2) -- MSB of the address
        & (char_addr(9 downto 8) and reg4(1 downto 0))  -- Character code with masks for bits 9 and 8
        & char_code & ypos(2 downto 0); -- 8 bit code and line in character

  • VDP character cell address masking feature

    Erik Piehl11/01/2017 at 21:20 0 comments

    I pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.

    With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...

    The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:

    -- Graphics mode 2. 768 unique characters are possible.
    -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit
    masks for the two
    -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode.
    vram_out_addr <= reg4(2) -- MSB of the address
        & (char_addr(9 downto 8) and reg4(1 downto 0))  -- Character code with masks for bits 9 and 8
        & char_code & ypos(2 downto 0); -- 8 bit code and line in character

  • Bug fixes and support for 512K cartridges

    Erik Piehl10/09/2017 at 14:58 2 comments

    I did a couple of important bug fixes. I finally found, actually surprisingly quickly, the bug that caused the top pixel line to be shifted. The picture below illustrates this problem. The problem was not on the top line, it was that all the other scanlines of the picture that were right shifted by one pixel. This can be seen in the picture below, for example by looking at the top pixels of the M character on the topmost line.

    I also modified the right border start setting to properly display border colour in 40 column text mode. In that mode the picture is 240 pixels wide, not 256 pixels as in all the other modes. Not dealing with this properly caused the VGA scanline doubler to show pixels that were not written to during screen refresh.

    Then I changed the memory mapping, to support 512K cartridges. I did this by reallocating the 1MB external memory to Ti-99/4A mapping. Now 512K is allocated for paged cartridges (up from 64K). That came at the expense of reducing SAMS compatible memory to 256K. But importantly this allowed me to run the cool TI-99/4A megademo called "don't mess with Texas", and running that demo did reveal some bugs, below is the video.

  • Speed control needed - and added

    Erik Piehl09/20/2017 at 21:30 0 comments

    I wanted to continue my benchmarks and run my simple Basic program also under TI Extended Basic. That turned out to be impossible, as the keyboard repeat rate problem was much worse under extended basic than built-in Basic.

    It was time to do something about this. Instead of trying to hack the code (I tried quickly but too much code to disassemble and understand) it was time for a hardware solution. Execution speed on the TMS9900 is largely dependent on memory access speed. I added a 6-bit delay counter, which enabled me to add up to 63 wait states per memory access. The Pepino FPGA board has a 8 DIP switches, so I used three of those switches for determination of wait states (I did this with a clocked latch, so it is possible to adjust speed in flight):

    • DIP switch 1 on: 63 wait states
    • DIP switch 2 on: 31 wait states
    • DIP switch 3 on: 8 wait states
    • All off: no wait states

    Switch 1 has priority, so if it is on there will be 63 wait states. I also took a quick look at the CPU's memory timing under simulation: with no wait states reads take 40ns and writes 60ns, with 63 wait states reads take 670ns.

    Alas, it turned out that a 6-bit delay counter was too short, as I got these results when comparing execution speed under TI extended Basic for my test program:

    • Classic99 emulation: 1 min 11 s
    • 63 wait states: 24.7s, 2.9x faster
    • 31 wait states: 13.6s, 5.2x faster
    • 8 wait states: 5.6s, 12.7x faster
    • 0 wait states: 2.9s, 24.5x faster

    So even with the maximum of 63 wait states this thing goes too fast... Need to slow it down further. But not tonight. 

    Here is a video:

  • Nearly doubling the performance - 23x original TI-99/4A

    Erik Piehl09/17/2017 at 20:44 1 comment

    I started to see how I could optimize the CPU.

    I looked at my memory interface code in the TMS9900 core, and realized I have been using very conservative timings - just to make sure that when debugging the CPU the memory interface does not cause problems. But now it is time to optimize!

    My TI Basic test program:

    10 for i=0 to 1000
    20 print i;" ";
    30 next


    Takes 160 seconds on a standard TI, and 11.6 seconds on the previous version of the CPU.
    I tweaked CPU memory interface first on the read side, reducing the number of wait states. 
    That took me from 11.6s to 8.9s, and then after further tweaking the execution time dropped to 8.2s. This just by reducing the wait states on the read side.
    Next I reduced the number of wait states on write side. This brought down the execution time to 7.7s. The impact of reducing write states on the write side is much smaller than on the read side, since the CPU mostly reads data and seldom writes it. 
    After these changes I removed one extra "safety" state after each read (it was just there to make sure the bus interface has some time to settle after reads, but that is not really necessary as the main state machine anyway adds a delay cycle). That brought the time down to 7s. With these changes the execution time is only 60% of what it used to be! And the speed is now 22.9 times of the original TI.
    As a final tweak I removed one extra "safety" state that was there after each write - for the same reason as the read cycles. That reduced run time to about 6.8s, so now the CPU runs my benchmark 23.5 times faster than the original TI.

    Here is Parsec running at this new revised CPU:

    When doing these tests, I really appreciated the quick re-synthesis time, it only takes my PC a couple of minutes to do the synthesis, so test iterations are fast.
    I also took a look at how much FPGA capacity the current design takes - it takes 51% of the LUTs (look up tables), so there is plenty of space left. Also there is some debug features included in here, removing those would make the design smaller.

  • The keyboard repeat rate problem and fix

    Erik Piehl09/17/2017 at 19:45 2 comments

    If you looked at the video I posted on previous project log, you saw that I had great difficulty in typing in Basic programs because keyboard repeat rate was just crazy when CPU was running at 15x speed. 

    I decided to tackle this problem, by reading the TI ROM code from the excellent book "TI Intern". Page 21 looked promising, there was was some kind of keyboard scanning routine delay:

    Time delay routine at >0498
    0498 LI 12,>04E2
    049C DEC 12
    049E JNE >049C
    04A0 B *11

    Unfortunately changing the above did not help, I modified the counter from >04E2 to >024E2, but this did not help.

    After a little more searching (just for the word repeat in the book), I found a more promising piece of code. This time it was not in the Basic ROM, but in Basic GROM. GROM contains code in the interpreted GPL language, not TMS9900 machine code. I don't really know too much about GPL, but hey let's try changing it and see what happens:

    Page 149 and 150 talk about repeat counter GPL code. Memory location >830D is set to zero and when it exceeds >FE, repeat occurs. After repeat that location is decremented by >1E (or this is what I think the GPL code is doing). So the next attempt is to change the GPL code
    2A6B SUB @>830D,>1E 
    to a larger subtract, so that repeat would be slower. This actually helps! But the range is too small and sporadic repeats still occur, even SUB >FE is not enough. The parameter is byte sized, so I cannot subtract more than that. The FPGA CPU just goes too fast and the counter gets incremented from zero to FF too quickly.

    Then I got another idea: What if I could disable the repeat code altogether? At 2A4F there is CLR @>830D and it is a two byte opcode, just the same length as the INC opcode at 245F which is taking care of counting the repeat up.

    What if we just copy the CLR opcode to 2A5F, overwriting the INC? Then key repeat counter never increments, and we should never get into trouble, right?
    2A4F contains 86 0D and this must be CLR @ opcode.
    2A5F contains 90 0D and this must be INC @ opcode. So I'll just put 86 in 2A60 and hope for the best. 

    That worked! No more repeats and keyboard is usable under TI Basic. The downside of this fix is that while it helps with TI Basic, I don't know if it helps in other programs such as TI Extended Basic, which may use their own code for key repeat - I guess I will see.

  • Success! FPGA based TI-99/4A working!

    Erik Piehl09/17/2017 at 07:41 2 comments

    Finally I got my TMS9900 CPU to work enough that I can run original TI-99/4A software on my FPGA based TI-99/4A clone. Below you can find a link to my quick-and-dirty but rather long video about the whole project.

    Prior to this last working session I knew that I still needed to implement the divide instruction, so I went about doing it. I did that by first writing a very simple C program, and then converted that functionality to VHDL.

    unsigned short tms9900_div(unsigned int divident, int divisor) {
        unsigned short sa;      // source argument
        unsigned short da0;     // destination argument (high 16 bits)
        unsigned short da1;     // destination argument (low 16 bits);
        printf("divident: %d divisor: %d\n", divident, divisor);
        // algorithm
        da0 = (divident >> 16);
        da1 = divident & 0xFFFF;
        sa = divisor;
        
        int st4;
        if (
            (((sa & 0x8000) == 0 && (da0 & 0x8000) == 0x8000))
            || ((sa & 0x8000) == (da0 & 0x8000) && (((da0 - sa) & 0x8000) == 0))
            ) {
            st4 = 1;
        } else {
            st4 = 0;
            // actual division loop, here sa is known to be larger than da0.
            for(int i=0; i<16; i++) {
                da0 = (da0 << 1) | ((da1 >> 15) & 1);
                da1 <<= 1;
                if(da0 >= sa) {
                    da0 -= sa;
                    da1 |= 1;   // successful substraction
                }
            }
        }
        printf("quotiotent: %d remainder %d st4=%d\n", da1, da0, st4);
        printf("checking: quotiotent %d remainder %d\n\n", divident/divisor, divident % divisor);
        return da1;
    }

    Getting this algorithm implementation to work took something like 15 minutes, so this was quickly done. Also the VHDL implementation did not take long, although I did manage to bring a few bugs. I had been delaying a little the implementation of the divide instruction since I thought it would take a long time, but actually that was quickly done.

    After implementing the divide instruction it was not smooth sailing yet, since  keyboard was not working properly. I traced the problem to the CRU interface (LDCR and STCR) instructions. STCR which reads from the external CRU and writes to a destination, returned bit shifted data. As an example, the expected value for button '1' in my test program would have been >FEFF, but the read data was >FDFF, so there was a shift of one bit. I did run multiple simulation runs with my VHDL test bed, but it always worked. Finally after some head scratching this turned out to be a major timing error: the STCR instruction presented the address to read from on the first cycle, and already on the 2nd cycle following it (i.e. 10ns later) it was latching the data. Inside my FPGA TI-99/4A implementation that was way too fast, so I added a two clock cycle delay before sampling the CRUIN pin - and voila, my TI-99/4A clone was running!

    The performance however turned out to be slower than expected: it only runs 15 times faster than the original TI, despite a 30 fold difference in clock speed (3.3MHz vs 100MHz). When I was creating the TMS9900 core my first priority was to get the bloody thing running, so I did not pay much attention to how many states each instruction has to flow through to implement it's task. I do like to optimise though, and now that my TI clone is working, I can turn my attention to make it running even faster :)

    Source code can be found here:

    Link to GitHub (the FPGA CPU is in the soft CPU branch).

    And here is the video talking about the project a bit:

    Youtube link

  • Almost there!

    Erik Piehl09/13/2017 at 20:26 1 comment

    After extensive debugging and comparison of execution logs between the FPGA CPU and the results of Classic99 emulator with the same ROMs, I found and fixed four bugs, one of them being quite nasty to find. But the results were very pleasing, now with my own boot ROM and Defender cartridge loaded I get this picture (story continues after the picture):


    For the first time the FPGA CPU renders the opening screen correctly! Interrupts were disabled (at hardware level) for this run. 

    Even more pleasing, I tested the bug fixes with the normal TI-99/4A ROMs, and got this boot picture for the very first time (story continues after the picture):


    Personally this was a wow moment! 

    So what were the bugs? Three related to flags, and one to addressing modes:

    • The logical greater than flag (ST0, also known as L>) was set incorrectly for the compare instructions (C and CB). Similarly the arithmetic greater than flag (ST1, also known as A>) was set incorrectly. I did not find this bug in the past, because in many scenarios the flags were set correctly. I had read sloppily the data sheet, and in the VHDL code I was had accidentally swapped the source argument and destination argument inputs in the flag setting code when comparing their MSBs to detect certain conditions.
    • Related to the above, my flag setting code treated comparison (C and CB) and subtract (S and SB) instructions identically. For most CPUs this would be true, but for the TMS9900 family the aforementioned flags ST0 and ST1 rather strangely only compare against zero for the subtract instruction. So I modified the code to properly distinguish S and C instructions, this required a number of changes.
    • In the data sheet The carry flag ST3 is documented for subtract instruction to be set when "CARRY OUT" is set. However, "CARRY OUT" is not defined anywhere. I used simply ALU output bit 16 (i.e. the 17th bit of the ALU) as carry. This is fine for addition instruction, but subtract actually inverts that bit. I guess in the original CPU implementation this was the most effective way to implement the ALU (normally done by inverting the number to be subtracted, tweaking carry so that an "add" operation becomes a "subtract").
    • Hardest of all to find, I could not understand why the compare bytes instruction "CB R5, @>6049" in the defender game cartridge set flags incorrectly with my FPGA CPU. I modified my boot ROM to run this instruction among the very first instructions, so that I could check the behaviour both under simulation and actual FPGA by running only a few instructions - and it worked properly. But the same instruction much later on - as instruction 11 460, did not set the flags properly. This was a very hard bug to find, but I finally found the problem after adding ALU input debug registers and making them available for my debug software. I could see that in the latter instance this instruction was producing different ALU inputs, despite the actual inputs being exactly the same. I finally traced down this problem to the operation of the byte aligner. It used an internal register simply called "EA" for effective address to perform the alignment of input bytes i.e. conversion of an input byte to 16-bit ALU input. Now this register was not set at all if the source operand was a register operand, i.e. in this case R5. Thus the byte alignment was random and depended on whatever code was being run before. The problem was actually generic to all instructions in the TMS9900 instruction set that used byte operands.

    After fixing all of the above the FPGA CPU runs the TI99/4A boot ROMs and renders the familiar boot picture! It then stops at address >0296 where it finds the opcode >3D06. This is a divide instruction, and the FPGA CPU does not support it yet, but rather simply stops and leaves the program counter pointing at the unimplemented instruction, making the problem easy to spot. I knew that this instruction was still not implemented, so I was happy to see that...

    Read more »

  • Debugging with Defender

    Erik Piehl09/06/2017 at 19:22 1 comment

    Wow it's been a really long while since I posted the last update here! Well, I have not given up on this project - quite the opposite. It's just that I haven't had time to work on this project in a long while. To my delight there have been more followers to this project in the mean time, so it is about time to show a sign of life.


    I have not done too much progress since the last update, the only thing I've done is adding more support for debugging. Now when single stepping I record more information than in the past:

    • Program counter
    • Address of last write to memory
    • Data of last write to memory
    • Status register contents

    This stuff goes into a log file, the data is written by the Windows program running on the PC which controls the single stepping of the FPGA CPU. Basically it lets the CPU to step one instruction, then it reads the above data (below an example) and the continues with the next instruction.

    line:pc  :addr:data:st
       1:0028:83FA:9800:8DC0
       2:002C:83FC:0100:CDC0
       3:0030:83FE:8C02:8DC0
       4:0034:83E0:0020:CDC0
       5:005C:83E0:0020:CDC0
       6:005E:83E8:0000:29C0
       7:0060:83EC:0020:C9C0

    I compare this output of the FPGA based CPU to the output of the famous classic99 emulator (I modified the emulator to record the same stuff). Then I wrote a python script to compare the two files. This comparison cannot be done with a normal diff tool since there are some acceptable differences (for example my CPU sets the unused flag bits differently from a real TMS9900).

    In the past I've tried to do the analysis with TI ROMs, but unfortunately that doesn't produce any output before the FPGA gets stuck somewhere after running correctly a large amount of instructions. Capturing the single step log is a slow process, due to the number of USB transactions needed - my debugging implementation is not that great in that respect. So I now decided to go with another strategy: rather than using the TI Basic ROMs, I'm trying to use the Defender game cartridge. Instead of the normal TI Basic routines firing up the game, I start the game "by hand" using a minimalistic boot loader. With the FPGA CPU that produces the following picture:

    This clearly is bogus as can be seen. For reference, my other FPGA project which uses a real TMS99105 CPU chip produces the following picture with the same ROMs loaded:

    So the positive thing is that the FPGA CPU does quite many things right... Now I need to load this boot ROM / defender combination to classic99 and capture the log and then make the comparison. For that I need to find out how to load my custom ROM in classic99 instead of the normal Basic ROM...

    My motivation to use the defender game cartridge also comes from the fact that this game cartridge contains only a normal ROM chip, not a ROM + GROM combination. I hope that simplifies matters in debugging, as it should mean the GROM interface does not have to work perfectly for the game to work. The fonts seen in the pictures above are loaded from GROM to video memory by my boot code, so the GROM data is still initialised.

    Stay tuned, hopefully for not too long this time, as I am trying to make progress with debugging. With this long pause it takes a while to get back up to speed. Luckily I've become pretty good at taking notes - I can't trust my memory to serve me right in projects like this, with pauses of several months between work sessions.