• Support for the original keyboard

    Erik Piehl01/02/2018 at 20:08 1 comment

    A quick addition of the day - this one was really easy to do as interfacing to a normal TI keyboard from the FPGA is way easier than communicating with the PC's keyboard through USB and the server process.

    The implementation quite literally only involved in bringing out the keyboard row / column wires from my TMS9901 interface chip implementation inside the FPGA. There are no external active or passive components other than the keyboard switches, thanks to the internal pull-ups of the FPGA.

  • Stand-alone booting capability

    Erik Piehl12/30/2017 at 19:11 0 comments

    An update after a long last!

    The next step for the design is to make the FPGA system stand-alone, i.e. able to boot and operate without a host PC. A USB connection will still be needed, but only to provide power. Today I implemented a new feature, where after reset the FPGA logic will load 256K of data from the SPI flash ROM to the SRAM of the system. That allows the system get the TI-99/4A system ROMs and GROMs to the static RAM in appropriate places. After the download one of the DIP switches controls the CPU's automatic boot - if switch zero is set the CPU in the FPGA will automatically boot and start executing the code that was transferred to SRAM.

    The 256K of data is divided into three regions:

    • First 128K is written to SRAM from address zero upwards. The logic of the FPGA maps this area to the cartridge ROM slot of the TI-99/4A. This is a paged are of 8K pages. By default my scripts but the extended Basic ROM code (16K there).
    • Next 64K are written to SRAM from address 0x80000 onwards (at address 512K). This is the area where GROM data is stored in my design. By default I have there first 24K of system GROM followed by 32K of Extended Basic GROM code.
    • The last 64K are written to SRAM at address 0xB0000. This is my ROM area. It is largely unused, but the first 8K (at address 0xB0000) are the disk support DSR space and another block of 8K (at address 0xBA000) is mapped to address zero of the TMS9900 core's address space, thus containing the normal console ROM code.

    The Pepino board has 1M of static RAM overall. I had forgotten that the board has actually 16 megabytes of SPI flash storage so there is plenty of potential here.

    The design of the SPI flash interface is from Magnus Karlsson, the designer of the Pepino FPGA board. I used the code from his Mac Plus example, and modified the code for my purposes. His code is written verily while my code is in VHDL, so I wrote the standard VHDL component header to enable me to interface the Verilog code from VHDL.

  • VDP character cell address masking feature

    Erik Piehl11/01/2017 at 21:20 0 comments

    I pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.

    With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...

    The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:

    -- Graphics mode 2. 768 unique characters are possible.
    -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit
    masks for the two
    -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode.
    vram_out_addr <= reg4(2) -- MSB of the address
        & (char_addr(9 downto 8) and reg4(1 downto 0))  -- Character code with masks for bits 9 and 8
        & char_code & ypos(2 downto 0); -- 8 bit code and line in character

  • VDP character cell address masking feature

    Erik Piehl11/01/2017 at 21:20 0 comments

    I pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.

    With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...

    The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:

    -- Graphics mode 2. 768 unique characters are possible.
    -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit
    masks for the two
    -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode.
    vram_out_addr <= reg4(2) -- MSB of the address
        & (char_addr(9 downto 8) and reg4(1 downto 0))  -- Character code with masks for bits 9 and 8
        & char_code & ypos(2 downto 0); -- 8 bit code and line in character

  • VDP character cell address masking feature

    Erik Piehl11/01/2017 at 21:20 0 comments

    I pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.

    With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...

    The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:

    -- Graphics mode 2. 768 unique characters are possible.
    -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit
    masks for the two
    -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode.
    vram_out_addr <= reg4(2) -- MSB of the address
        & (char_addr(9 downto 8) and reg4(1 downto 0))  -- Character code with masks for bits 9 and 8
        & char_code & ypos(2 downto 0); -- 8 bit code and line in character

  • Bug fixes and support for 512K cartridges

    Erik Piehl10/09/2017 at 14:58 2 comments

    I did a couple of important bug fixes. I finally found, actually surprisingly quickly, the bug that caused the top pixel line to be shifted. The picture below illustrates this problem. The problem was not on the top line, it was that all the other scanlines of the picture that were right shifted by one pixel. This can be seen in the picture below, for example by looking at the top pixels of the M character on the topmost line.

    I also modified the right border start setting to properly display border colour in 40 column text mode. In that mode the picture is 240 pixels wide, not 256 pixels as in all the other modes. Not dealing with this properly caused the VGA scanline doubler to show pixels that were not written to during screen refresh.

    Then I changed the memory mapping, to support 512K cartridges. I did this by reallocating the 1MB external memory to Ti-99/4A mapping. Now 512K is allocated for paged cartridges (up from 64K). That came at the expense of reducing SAMS compatible memory to 256K. But importantly this allowed me to run the cool TI-99/4A megademo called "don't mess with Texas", and running that demo did reveal some bugs, below is the video.

  • Speed control needed - and added

    Erik Piehl09/20/2017 at 21:30 0 comments

    I wanted to continue my benchmarks and run my simple Basic program also under TI Extended Basic. That turned out to be impossible, as the keyboard repeat rate problem was much worse under extended basic than built-in Basic.

    It was time to do something about this. Instead of trying to hack the code (I tried quickly but too much code to disassemble and understand) it was time for a hardware solution. Execution speed on the TMS9900 is largely dependent on memory access speed. I added a 6-bit delay counter, which enabled me to add up to 63 wait states per memory access. The Pepino FPGA board has a 8 DIP switches, so I used three of those switches for determination of wait states (I did this with a clocked latch, so it is possible to adjust speed in flight):

    • DIP switch 1 on: 63 wait states
    • DIP switch 2 on: 31 wait states
    • DIP switch 3 on: 8 wait states
    • All off: no wait states

    Switch 1 has priority, so if it is on there will be 63 wait states. I also took a quick look at the CPU's memory timing under simulation: with no wait states reads take 40ns and writes 60ns, with 63 wait states reads take 670ns.

    Alas, it turned out that a 6-bit delay counter was too short, as I got these results when comparing execution speed under TI extended Basic for my test program:

    • Classic99 emulation: 1 min 11 s
    • 63 wait states: 24.7s, 2.9x faster
    • 31 wait states: 13.6s, 5.2x faster
    • 8 wait states: 5.6s, 12.7x faster
    • 0 wait states: 2.9s, 24.5x faster

    So even with the maximum of 63 wait states this thing goes too fast... Need to slow it down further. But not tonight. 

    Here is a video:

  • Nearly doubling the performance - 23x original TI-99/4A

    Erik Piehl09/17/2017 at 20:44 1 comment

    I started to see how I could optimize the CPU.

    I looked at my memory interface code in the TMS9900 core, and realized I have been using very conservative timings - just to make sure that when debugging the CPU the memory interface does not cause problems. But now it is time to optimize!

    My TI Basic test program:

    10 for i=0 to 1000
    20 print i;" ";
    30 next


    Takes 160 seconds on a standard TI, and 11.6 seconds on the previous version of the CPU.
    I tweaked CPU memory interface first on the read side, reducing the number of wait states. 
    That took me from 11.6s to 8.9s, and then after further tweaking the execution time dropped to 8.2s. This just by reducing the wait states on the read side.
    Next I reduced the number of wait states on write side. This brought down the execution time to 7.7s. The impact of reducing write states on the write side is much smaller than on the read side, since the CPU mostly reads data and seldom writes it. 
    After these changes I removed one extra "safety" state after each read (it was just there to make sure the bus interface has some time to settle after reads, but that is not really necessary as the main state machine anyway adds a delay cycle). That brought the time down to 7s. With these changes the execution time is only 60% of what it used to be! And the speed is now 22.9 times of the original TI.
    As a final tweak I removed one extra "safety" state that was there after each write - for the same reason as the read cycles. That reduced run time to about 6.8s, so now the CPU runs my benchmark 23.5 times faster than the original TI.

    Here is Parsec running at this new revised CPU:

    When doing these tests, I really appreciated the quick re-synthesis time, it only takes my PC a couple of minutes to do the synthesis, so test iterations are fast.
    I also took a look at how much FPGA capacity the current design takes - it takes 51% of the LUTs (look up tables), so there is plenty of space left. Also there is some debug features included in here, removing those would make the design smaller.

  • The keyboard repeat rate problem and fix

    Erik Piehl09/17/2017 at 19:45 2 comments

    If you looked at the video I posted on previous project log, you saw that I had great difficulty in typing in Basic programs because keyboard repeat rate was just crazy when CPU was running at 15x speed. 

    I decided to tackle this problem, by reading the TI ROM code from the excellent book "TI Intern". Page 21 looked promising, there was was some kind of keyboard scanning routine delay:

    Time delay routine at >0498
    0498 LI 12,>04E2
    049C DEC 12
    049E JNE >049C
    04A0 B *11

    Unfortunately changing the above did not help, I modified the counter from >04E2 to >024E2, but this did not help.

    After a little more searching (just for the word repeat in the book), I found a more promising piece of code. This time it was not in the Basic ROM, but in Basic GROM. GROM contains code in the interpreted GPL language, not TMS9900 machine code. I don't really know too much about GPL, but hey let's try changing it and see what happens:

    Page 149 and 150 talk about repeat counter GPL code. Memory location >830D is set to zero and when it exceeds >FE, repeat occurs. After repeat that location is decremented by >1E (or this is what I think the GPL code is doing). So the next attempt is to change the GPL code
    2A6B SUB @>830D,>1E 
    to a larger subtract, so that repeat would be slower. This actually helps! But the range is too small and sporadic repeats still occur, even SUB >FE is not enough. The parameter is byte sized, so I cannot subtract more than that. The FPGA CPU just goes too fast and the counter gets incremented from zero to FF too quickly.

    Then I got another idea: What if I could disable the repeat code altogether? At 2A4F there is CLR @>830D and it is a two byte opcode, just the same length as the INC opcode at 245F which is taking care of counting the repeat up.

    What if we just copy the CLR opcode to 2A5F, overwriting the INC? Then key repeat counter never increments, and we should never get into trouble, right?
    2A4F contains 86 0D and this must be CLR @ opcode.
    2A5F contains 90 0D and this must be INC @ opcode. So I'll just put 86 in 2A60 and hope for the best. 

    That worked! No more repeats and keyboard is usable under TI Basic. The downside of this fix is that while it helps with TI Basic, I don't know if it helps in other programs such as TI Extended Basic, which may use their own code for key repeat - I guess I will see.

  • Success! FPGA based TI-99/4A working!

    Erik Piehl09/17/2017 at 07:41 2 comments

    Finally I got my TMS9900 CPU to work enough that I can run original TI-99/4A software on my FPGA based TI-99/4A clone. Below you can find a link to my quick-and-dirty but rather long video about the whole project.

    Prior to this last working session I knew that I still needed to implement the divide instruction, so I went about doing it. I did that by first writing a very simple C program, and then converted that functionality to VHDL.

    unsigned short tms9900_div(unsigned int divident, int divisor) {
        unsigned short sa;      // source argument
        unsigned short da0;     // destination argument (high 16 bits)
        unsigned short da1;     // destination argument (low 16 bits);
        printf("divident: %d divisor: %d\n", divident, divisor);
        // algorithm
        da0 = (divident >> 16);
        da1 = divident & 0xFFFF;
        sa = divisor;
        
        int st4;
        if (
            (((sa & 0x8000) == 0 && (da0 & 0x8000) == 0x8000))
            || ((sa & 0x8000) == (da0 & 0x8000) && (((da0 - sa) & 0x8000) == 0))
            ) {
            st4 = 1;
        } else {
            st4 = 0;
            // actual division loop, here sa is known to be larger than da0.
            for(int i=0; i<16; i++) {
                da0 = (da0 << 1) | ((da1 >> 15) & 1);
                da1 <<= 1;
                if(da0 >= sa) {
                    da0 -= sa;
                    da1 |= 1;   // successful substraction
                }
            }
        }
        printf("quotiotent: %d remainder %d st4=%d\n", da1, da0, st4);
        printf("checking: quotiotent %d remainder %d\n\n", divident/divisor, divident % divisor);
        return da1;
    }

    Getting this algorithm implementation to work took something like 15 minutes, so this was quickly done. Also the VHDL implementation did not take long, although I did manage to bring a few bugs. I had been delaying a little the implementation of the divide instruction since I thought it would take a long time, but actually that was quickly done.

    After implementing the divide instruction it was not smooth sailing yet, since  keyboard was not working properly. I traced the problem to the CRU interface (LDCR and STCR) instructions. STCR which reads from the external CRU and writes to a destination, returned bit shifted data. As an example, the expected value for button '1' in my test program would have been >FEFF, but the read data was >FDFF, so there was a shift of one bit. I did run multiple simulation runs with my VHDL test bed, but it always worked. Finally after some head scratching this turned out to be a major timing error: the STCR instruction presented the address to read from on the first cycle, and already on the 2nd cycle following it (i.e. 10ns later) it was latching the data. Inside my FPGA TI-99/4A implementation that was way too fast, so I added a two clock cycle delay before sampling the CRUIN pin - and voila, my TI-99/4A clone was running!

    The performance however turned out to be slower than expected: it only runs 15 times faster than the original TI, despite a 30 fold difference in clock speed (3.3MHz vs 100MHz). When I was creating the TMS9900 core my first priority was to get the bloody thing running, so I did not pay much attention to how many states each instruction has to flow through to implement it's task. I do like to optimise though, and now that my TI clone is working, I can turn my attention to make it running even faster :)

    Source code can be found here:

    Link to GitHub (the FPGA CPU is in the soft CPU branch).

    And here is the video talking about the project a bit:

    Youtube link