« Back to project details Sort by:

Video-Terminal for the Forth OF816 Interpreter
04/09/2024 at 19:43 • 0 comments
In the previous post I have mentioned that I am working on a port of the 32-bit Forth interpreter OF816 to my X65-SBC, an 8/16-bit retro computer that I am building. The software runs on the X65-SBC in the 65C816 processor, but the user textual input and output was so far realized via the USB/UART interface terminated on a host PC in a terminal emulator (e.g. putty).

As the next logical step, shown in the demo below, I have implemented a video text terminal using the VERA chip and the VGA output from the X65-SBC. VERA is the computer’s video chip implemented in an FPGA. VERA has 128kB of internal VRAM and could be configured in various graphics modes, typically generating a 640×480-pixel resolution screen. For the purpose of a textual terminal output I am configuring VERA to display 80 columns by 60 rows of visible characters. Each character is 8×8 pixels, and each character can have one of the 16 foreground and background colors.

The following short demo shows the OF816 code running on the X65-SBC computer with the video terminal output from the VGA port (for the purpose of a youtube demo the VGA signal is captured in a PC and displayed in a live window). The text input to the OF816 software is (still) provided over the USB/UART (putty); this is a work in progress to utilize the PS/2 keyboard of the computer, next time.

Demo

VERA Mode Configuration

VERA is configured in the 80x60 character text mode with this code snippet (65c02 assembly):
```
    ; DCSEL=0, ADRSEL=0
    stz   VERA_CONTROL_REG
    ; Enable output to VGA 640x480, enable Layer0
    lda   #TV_VGA | LAYER0_ENABLE
    sta   VERA_VIDEO_REG
    ; DCSEL=0, ADRSEL=0
    stz   VERA_CONTROL_REG
    ; characters are 8x8, visible screen 80 columns, 60 rows.
    ; Complete screen is 128x128 characters, 8x8 font
    ; # Layer0 setup: Tile mode 1bpp, Map Width = 128 tiles, Map Height = 128 tiles
     ; ==> 16384 tiles, each 2B => 32768 B
    lda   #MAP_WH_128T << 6 | MAP_WH_128T << 4 | BPP_1
    sta   VERA_LAYER0_CONFIG_REG
    ; map entries start at address 0 of VRAM, and occupy 32kB
    lda   #mapbase_va
    sta   VERA_LAYER0_MAPBASE_REG
    ; tile (font) starts at 32kB offset
    lda   #(tilebase_va >> 11) << 2
    sta   VERA_LAYER0_TILEBASE_REG
```
The “map” size is 128×128 characters (tiles), but only 80×60 is visible on the screen. Using registers VERA_LAYER0_HSCROLL_REG ($9F30) and VERA_LAYER0_VSCROLL_REG ($9F32) it is possible to smoothly scroll the 80×60 viewport over the larger 128×128 map. This feature is typically used in 2D scrolling games. VERA allows map widths and heights from 32 to 256 tiles (32, 64, 128, 256). Tile width and height could be configured to 8 or 16 pixels; for the textual display we use the 8×8 pixel tiles.

Memory requirements for the map are: 128 * 128 tiles = 16834 tiles. Each tile consumes 2 Bytes of the VRAM, for the total tile-buffer memory 32768 B = 32kB. VERA supports multiple Tile Modes that differ in colour depth and in the support for additional features (e.g. V-flip, H-flip). For the textual display the most suitable mode is the “Tile mode 1 bpp (16 color text mode)“, as described in the VERA documentation:

In this mode the first byte of each map tile is the 8-bit character index (ASCII code), and the second byte contains a 4-bit background and a 4-bit foreground colour of the tile. To display a text in the 80×60 characters screen grid you just set the character index and colours for particular tiles.

Font Data

The 8-bit character index in each tile points to an 8×8-pixel “picture” (glyph) that shall be drawn at the tile position. This is the font data, and the font needs to be loaded in VERA’s VRAM during initialization phase. Each character in the font is 8×8 monochrome pixels, i.e. 8*8=64 bits, and that is 8 Bytes. The font has 256 ASCII characters, so the font data is 256 * 8 = 2048 B = 2 kB in total. The picture below shows the first 576 Bytes of the font that I am using:

In the picture I have highlighted 8 Bytes at the position 0x208 to 0x20F: 0x18 0x3C 0x66 0x7E 0x66 0x66 0x66 0x00. Since each character glyph takes 8 Bytes of the bitmap, by dividing 0x208 by 8 we get the character index (ASCII code) for the glyph: 0x208 / 8 = 0x41 = 65 decimal, which is the capital ‘A’ letter. We can visually decode the glyph bitmap data in a spreadsheet by converting each data byte to its binary representation, and (for better visibility), replacing each ‘0’ with a ‘.’, as shown below. Now you can seen the bitmap that VERA would draw to the screen for each tile referencing the 65th character index.

Loading of Font Data to VERA in 65C02 vs. 65C816 CPU

The 2 kB font data must be loaded to VERA during each system initialization phase from a RAM. Let’s compare how this can be done in software of the X65-SBC, depending if the computer is equipped with the 8-bit 65C02 processor or the newer 16-bit 65C816 processor.

The first presented code snippet is for the 16-bit 65C816 processor:
```
.P816     ; this is a code for 65C816.
    ; configure addressing ptr at the font data (tilebase), autoincrement
    lda   #<tilebase_va
    sta   VERA_ADDRESS_REG
    lda   #>tilebase_va
    sta   VERA_ADDRESS_M_REG
    lda   #((tilebase_va >> 16) & 1) | (1 << 4)
    sta   VERA_ADDRESS_HI_REG
    ; configure CPU register widths:
    rep   #SHORT_I  ; 16-bit index regs X, Y
    sep   #SHORT_A          ; 8-bit memory and accu
    .i16
    .a8
    ; copy font data to VRAM
    ldx   #0                ; X := 0
loop_font_cp:
    lda   _font8x8,x        ; A := _font8x8[x]
    sta   VERA_DATA0_REG    ; store A to VERA VRAM
    inx                     ; X++
    cpx   #SIZEOF_font8x8   ; X compare with 2048
    bne   loop_font_cp      ; branch back if not equal.
```
After the VERA setup code at the beginning, which takes 6 instructions, the main copy loop (beginning at loop_font_cp) takes just 6 instructions. The highlighted loop is executed 2048-times, i.e. for each byte of font data.

Now lets have a look at the same program but implemented in the assembler for the 8-bit 65C02 processor:
```
    ; configure addressing ptr at the font data (tilebase), autoincrement
    lda   #<tilebase_va
    sta   VERA_ADDRESS_REG
    lda   #>tilebase_va
    sta   VERA_ADDRESS_M_REG
    lda   #((tilebase_va >> 16) & 1) | (1 << 4)
    sta   VERA_ADDRESS_HI_REG
        lda     #<_font8x8
    sta     PTR
    lda     #>_font8x8
    sta     PTR+1
    ldx     #0                 ; X := 0
loop_font_cp:
    ldy     #0                 ; Y := 0
loop_font_cp_inner:
    lda     (PTR),y            ; A := PTR[Y]
    sta     VERA_DATA0_REG     ; store A to VERA VRAM
    iny                        ; Y++
    cpy     #0                 ; compare Y to zero
    bne     loop_font_cp_inner  ; branch if not zero yet
    inc     PTR+1               ; PTR := PTR + 256
    inx                           ; X++
    cpx     #SIZEOF_font8x8/256   ; compare X to 2048/256
    bne     loop_font_cp          ; branch if not equal yet.
```
The copy loop starts at the label loop_font_cp. This time, however, we need two nested loops utilizing both the X and Y registers, and an additional 2 Byte variable PTR in the RAM (in zero page). The 8-bit X and Y registers together implement a 16-bit loop counter, in fact.

The cause of this “complexity explosion” in the assembler is the limited 8-bit width of all registers in the 65C02 CPU. To implement a loop that runs 2048-times, you need (for example) the X and Y registers, and then you need an additional variable to keep track of the source array position (the processor could do just 8-bit index, which is too less).

In comparison the 16-bit 65C816 CPU is much easier and efficient to program in the assembly language (at least in this case). The 2048-time iterated loop can run with just the X register, which is 16-bit wide now, leaving Y reg. unused, and utilizing a 16-bit indexed access (lda _font8x8,x) to get a byte from the font data into the accumulator. This coding efficiency is the reason why I prefer using the X65-SBC computer with the 16-bit CPU 65C816, and do not look back at the 65C02.

In the end I am relatively satisfied how the terminal output turned out. (Note below the inconspicuous error message “Stack u/f” from OF816 => there is more to explore here! 😎)
32-bit Forth for the X65 with 65C816 CPU
03/22/2024 at 21:11 • 0 comments

I am working on a port of the 32-bit FORTH interpreter OF816 for my X65-SBC computer with the 65C816 CPU. The OF816 was created by mgcaret and is available on github. I made a fork of the OF816 project and added a new branch for my work: x65-sbc.
The OF816 already supports a couple of 65C816-based systems: GoSXB, Apple IIgs, Neon816 and the W65C816SXB. I added a new subdirectory X65 in the platforms directory and initially copied from the Neon816 port, because it seemed the simplest.

The initial support for X65-SBC was not difficult to program. I modified the `_system_interface` routine in the `platform-lib.s` to work with the USB/UART registers that are implemented in NORA FPGA on the X65-SBC. After adjusting memory addresses in a linker script, I was surprised that this minimal OF816 port worked on the first try.

The screenshot above shows a terminal emulator on Linux PC connected over USB/UART to the X65-SBC computer running the OF816 FORTH interpreter. Since I am just a beginner in the FORTH language, it took me a few lines to enter and run a Hello World loop.

This OF816 FORTH port currently communicates over the UART on the X65-SBC. Therefore the user interface is on the host PC for now, and the keyboard and video output on X65-SBC are not utilized.
This is just temporary as I am already working on a basic virtual terminal (i.e. screen editor) for the X65-SBC and the first "client" will the OF816 interpreter.

Reading of 6502/65816 CPU Registers by a PC-based Debugger

03/03/2024 at 20:08 • 0 comments

Modern microcontrollers and microprocessors have built-in facilities for external debuggers to read/write registers, set breakpoints and generally fully control the CPU. This is one of the main usecases of the standard JTAG interface. The 6502 and 65816 CPUs were created more than a decade before the first version of JTAG was even defined. They contain no support for external debuggers whatsoever.

To overcome the lack of debugger support, the system bus controller "NORA" in X65 implements the necessary functions for a debugger running on a host PC (with Linux or Windows) and connected over the USB-C port. With this support in NORA a debugger can control the 6502/65816 CPU at the instruction level.

One of the basic functions of a debugger is to read out the contents of the programmer visible CPU registers. This log entry describes how this is done in the X65 computer. ....

---------- more ----------

For the 8-bit 6502 CPU these are: A, X, Y, S (= SP) (all 8-bit), PC (16-bit), and flags (8-bit); shown below:

For the 16-bit 65816 CPU these are: A, X, Y, S (= SP) (all 16-bit), PC (24-bit), DBR (Data Bank Register, 8-bit), Direct Register (DH:DL, 16-bit), and the flags (8-bit); shown below:

Block Diagram of NORA's In-Circuit Debugger (ICD)

The block diagram below shows the concept of the In-Circuit Debugger (ICD) in the X65 computer. NORA FPGA implements all interfaces between the 65xx CPU, and the memory bus with SRAM, VERA and AURA FPGAs. The particular verilog module in NORA that implements the CPU/memory interface is called bus_controller. Another module, the phaser, generates the CPU clock signal PHI2 and additional internal timing signals that must be in sync with the CPU execution. The third module mentioned in the picture is the icd_controller which controls the bus_controller and phaser during debug sessions, and during system startups, but othertimes it is idle.

The icd_controller itself responds to commands received from an SPI Slave port, which is connected to physical SPI pins on the FPGA (signals ICD_CSn:in, ICD_MOSI:in, ICD_MISO:out, ICD_SCK:in). These pins are connected on PCB to the FTDI USB/SPI+UART converter (FT2232H), and from there via a USB cable to a host PC with a debugger software.

Debugger software for X65 is implemented in Python 3 (with the pyftdi module) and it is available in the
git repository in subfolder x65pyhost. For simplicity of development, debugger commands are not yet integrated in an IDE but they are available as individual programs called from a command line.
The most commonly used commands (= python scripts) are listed in the table:

Command (script)	Description
do-cpureset.py	Reset and stop the 65xx CPU
do-cpurun.py	Run the CPU
do-cpustep.py	Stop the CPU and step it for the given number of instructions or CPU cycles
do-dump.py	Memory dump (including IO areas)
do-loadbin.py	Load a binary from specified host file into X65 memory
do-loadprg.py	Load a C64/CX16 program from a .PRG file into X65 memory
do-poke.py	Write memory location in X65 memory (incl. IO area)
do-readregs.py	Read CPU registers (the CPU must be stopped)

Register Reading Sequence

Reading of 65xx CPU internal registers is implemented in the do-readregs.py and do-cpustep.py scripts that communicate with the ICD module in NORA FPGA over the USB/SPI bridge implemented on the PCB of X65 computer. Lets discuss how the registers are obtained from the CPU by NORA. As mentioned previously, the 65xx CPUs have absolutely zero support for external debuggers wishing to access internal registers outside of a program running in the processor.

The access is gained by forcing a predefined sequence of instructions to the CPU and observing signals on the CPU Data and Address Buses. At each CPU cycle, the 65xx processor accesses the external memory for either a read or write. By observing CPU status signals VPA and VDA (or SYNC), it is possible to determine if any given cycle is an opcode fetch. If it is an opcode fetch cycle, the debugger can instruct the ICD in NORA to force an alternative opcode to the CPU than what would be read from a memory, thus altering the program flow at will.

The following instruction sequence is forced from the debugger to read out all registers:

orig_pc:
    PHP             ; we get PC, SP, Flags
    PLP             ; 
    STA $2          ; block writes; We get A, and on 65816 we get B depending on flag M, and we get DH:DL (DPR) on address bus
    STX $4440       ; block writes; We get X, and on 65816 we get XH depending on flag X, we get DBR from CPU Bank Address.
    STY $6          ; block writes; We get Y, and on 65816 we get YH depending on flag X
    BRA orig_pc     ; jump back to the back to leave CPU registers in the original state

The code (PHP-PLP-STA-STX-STY-BRA) is position-independent: it could be execute at any CPU state. The first two 65xx instructions are *PHP* (Push Flags to Stack) and *PLP* (Pull Flags from Stack). NORA captures the trace of the the CPU Data Bus, Address Bus and Status signals while these instruction execute and the debugger uses the trace to reconstruct the original PC (Program Counter) address and the SP (Stack Pointer) address, and the flags register.

Screenshot below shows the trace output of *do-cpustep.py* (on Linux PC) when the 65xx CPU (actually 65C816 in Emulation Mode) is executing the opening PHP-PLP sequence. The screenshot is annotated at the top with explanations of trace columns, and at the bottom with the trace fields that correspond to exposed CPU internal registers:

The original PC (Program Counter) is the op-code fetch address of the first instruction. The SP (Stack Pointer) is the write address in the third CPU cycle (Cyc #12), and the CPU flags are the data value written to the memory at that cycle. The instruction PLP is necessary to restore the contents of SP inside the CPU to the original value.

The next screenshot shows the trace of the remaining instructions in the sequence: STA-STX-STY-BRA. The contents of registers A, X and Y is obtained from the data bus write cycles during the execution of the STA (Store Accumulator), STX (Store X-reg) and STY (Store Y-reg) instructions. To not disturb the original memory contents by these writes, the memory write signal is silently blocked by NORA, although this is not reflected in the trace buffer.

The final instruction BRA (Branch Always) is necessary to restore the original contents of the PC register; it instructs the CPU to jump five instruction back.

65816 CPU

So far we discussed how to obtain registers from the 8-bit 6502 processor. The 16-bit 65816 processor has the common registers extended to 16-bits, plus it has some additional registers. The traces shown above were obtained on the 65816 CPU running in the 8-bit *Emulation Mode*, where the processor is behaving like the 8-bit 6502. (Emulation Mode is indicated by the "e" in the "sta" field of the trace. Mode information is available on an output pin of the processor.)

When the 65816 processor switches to the 16-bit *Native Mode*, the additional registers DBR (Data Bank Register), PBR (Program Bank Register, part of PC) and DH:DL (Direct (Page) Register) get activated. The traces above mark where these registers are exposed as well. Then, depending if the new M and X flags in the CPU Flag Register are cleared, also the Accumulator and Index Registers X/Y are extended to 16-bits. The M-flag controls the width of the accumulator (A -> AH:AL) and of the memory accesses (in all LD and ST-type instructions), and the X-flag control the widths of registers X (XH:XL) and Y (YH:YL). These three important flags (Emulation or Native mode, M-flag, and X-flag) are available on CPU pins, therefore they are always visible in the trace buffer.

The screenshot below shows the trace of the sequence PHP-PLP-STA-STX-STY in case the 65816 CPU is in the 16-bit *Native Mode* and also the accumulator and index registers are switched to the 16-bit mode (i.e. M-flag=0, X-flag=0). (The BRA instruction is not shown for brevity, but it is also part of the forced sequence.) We can see that the execution of the STA, STX and STY opcodes gains one additional CPU cycle during which the upper parts of the A/X/Y registers are written to the memory (these writes are blocked in NORA), and exposed for the debugger. The contents of the new address registers PBR, DBR, and DH:DL is meanwhile also available on the address bus.

Integration

The complete register readout sequence is integrated in the script do-cpustep.py. This script dumps the 65xx CPU trace buffer and then executes a user-specified number of following instructions. Finally, the script executes the readout sequence in the background and prints the CPU register values. The forced readout sequence is not shown in the printed trace, of course.

The screenshot below shows the user interface when the command "./do-cpustep.py 3" (step for 3 instruction) is started on a Linux host PC connected with a USB cable to the X65 computer. The last line is the CPU register contents after the last instruction executed (and before "Upcoming", which has *not* been executed yet, it is just the prediction of the debugger based on PC). Notice how the displayed contents of the A register ($03) in the last line corresponds to the LDA instruction (LDA #$03), and the contents of the SP register ($1e9) corresponds to the push-address in the first PHA instruction ($1ea = the SP is decremented AFTER pushing).

Conclusion

In embedded systems reading of processor registers via a JTAG connection from a host PC is the basic functionality that we take for granted. Using the "NORA" FPGA as the system controller in the X65 computer brings this functionality also to the world of 6502/65816 processors.

X65-SBC

Video-Terminal for the Forth OF816 Interpreter

VERA Mode Configuration

Font Data