Close

More Thoughts on Video Refresh

A project log for VDC-II

Commodore 8568-inspired (and mostly compatible) video core for driving VGA-type displays.

samuel-a-falvo-iiSamuel A. Falvo II 05/12/2020 at 16:220 Comments

The VDC video modes all assume that the VDC can access arbitrary video RAM with impunity.  While you can fetch character and attribute data sequentially, resolving character codes to font data requires a potentially fresh hit to video memory, starting at either BASE+(16*code) or BASE+(32*code), depending on the configured character height.

Even if the character codes increase monotonically, the memory fetched will have 16- to 32-byte gaps in between referenced bytes.  This breaks the optimal access pattern for synchronous memories, all of which are optimized for sequential access patterns.  A good rule of thumb for synchronous memories is that every time you need to skip around in memory will cost you 10 cycles of latency.

Although it is possible to sequentially fetch character and attribute data, they occupy different segments of video memory, necessitating two video base pointers and two processing engines.  Correspondingly, if you fetch 8 bytes of character data, you must also fetch 8 bytes of attribute data.  The two bursts of data must be synchronized with each other externally to the memory fetch units.

Character and attribute bursts can happen in any order (e.g., attributes can be fetched ahead of character codes), but they must always be adjacent.  Moreover, both character and attribute bursts must occur prior to font resolution, as attributes provide the 9th character code bit.

On the iCE40LP8K I'm currently targeting, a ping-pong line buffer, such as what I used to implement the MGIA on the Kestrel-2 and -2DX, will be prohibitively expensive.  The space for a single line buffer of 256 characters would require 2048 DFFs (and, thus, logic elements).  We would need two of these, so that the memory fetch logic can fill one buffer while the other is used for video refresh.  Note that the FPGA only has 7680 logic elements.

Because they switch roles only on HSYNC boundaries, full-line buffers must be large enough to accommodate the widest display supported.  The VDC-II register space supports 256 characters (all 8 bits of R1 are significant).  If we couldn't accommodate a pair of line buffers large enough to support 256 characters, then we would need to ignore upper bits of register R1, which would break 8563 VDC compatibility.

Video data (resolved character/bitmap data plus corresponding attribute information) must be available when horizontal display enable asserts, since that's when we must start shifting out video data.

All of these problems interact.  Thankfully, besides the queue-based approach I discussed in a previous log, there's another approach to work around these matters.

Another Solution

Instead of using full-line buffers, we use a pair of ping-pong "strip" buffers.  Each strip is 4, 8, or 16 characters, depending mainly on externally imposed video memory latency requirements.  For the purposes of this description, let's assume a 4-character strip.

A strip buffer contains two bytes for each character column it supports: an attribute byte and a bitmap byte.  When attribute data is fetched, only the attribute bytes are updated.  When character data is fetched, only the bitmap bytes are updated.  The interface presented to the dot-shifter logic, however, always presents a 16-bit attribute/bitmap value pair.

To minimize the time needed to provide the complete set of data for a strip, attribute data should be fetched first.  That way, when character data is fetched, we can stream data not only from video memory but also (in parallel) the strip buffer to provide the complete 9-bit character code to the font fetch unit.  The font fetch unit can then resolve the character code to a bitmap byte.  For this to work, font data must reside in fast FPGA block RAM.

The following table illustrates the memory fetch access patterns with 0-wait-state memory on a pipelined Wishbone B4 interconnect to video RAM and an asynchronous strip buffer read port.  You could typically find this access pattern when placing character, attribute, and font data in block RAM.  Assuming we reference video memory at the same speed as the dot-clock, we can reload the strip with video data in just 13 pixels.  Note that a four character strip with 8 pixels per character contains 32 pixels, giving us ample time to refill the strip buffer.  Four characters at 3 px/char would have only 12 pixels, so we would expect to see visual artifacts under those conditions.  You'd want at least 4 px/char in order to ensure a clean display.

CycleVRAM AddressVRAM DataSBUF Read AddressSBUF Write Address
1ATTRPTR+0


2
ATTRPTR+1a0
ATTR0
3ATTRPTR+2a1
ATTR1
4ATTRPTR+3a2
ATTR2
5CHARPTR+0a3
ATTR3
6CHARPTR+1ch0
CHAR0
7CHARPTR+2ch1
CHAR1
8CHARPTR+3ch2
CHAR2
9FONT(ch0)ch3PAIR0CHAR3
10FONT(ch1)bm0PAIR1CHAR0
11FONT(ch2)bm1PAIR2CHAR1
12FONT(ch3)bm2PAIR3CHAR2
13
bm3
CHAR3

Below illustrates the same refresh attempt, assuming that both attribute and character matrix data is located in a HyperRAM chip, while font data continues to be confined to block RAM.  In this case, we see it takes 23 pixels to reload the strip buffer with video data, thanks to the HyperRAM access latency.  As you might imagine, four characters of 4 pixels each will not be sufficient to refresh the display without artifacts.  Therefore, if you intend on using a character-mode display with narrow characters, you should strive to keep the matrices inside VDC-II block memory space.  Where possible, use external memory resources only for bitmapped video modes, or, make sure to use sufficiently wide characters.

CycleVRAM Address
VRAM DataSBUF Read Address
SBUF Write Address
1
ATTRPTR+0


2(wait)


3(wait)


4(wait)


5(wait)


6(wait)


7ATTRPTR+1a0
ATTR0
8ATTRPTR+2a1
ATTR1
9ATTRPTR+3a2
ATTR2
10CHARPTR+0a3
ATTR3
11(wait)


12(wait)


13(wait)


14(wait)


15(wait)


16CHARPTR+1ch0
CHAR0
17CHARPTR+2ch1
CHAR1
18CHARPTR+3ch2
CHAR2
19FONT(ch0)ch3PAIR0CHAR3
20FONT(ch1)bm0PAIR1CHAR0
21FONT(ch2)bm1PAIR2CHAR1
22FONT(ch3)bm2PAIR3CHAR2
23
bm3
CHAR3

As long as the strip is wide enough to support the longest latency of the video memory, then we simply switch strip buffers after rendering the last pixel of a strip.  Switching strip buffers should also commence fetching the next strip's worth of video data as well.

This algorithm should work to keep the video data sequenced for video refresh while it is in the middle of the scanline.  The next issue to tackle is how to sequence the *first* strip, along the left-most edge of the display.  The CRTC doesn't have enough information to trigger the first strip fetch exactly 4-16 characters ahead of the left edge of the display.  The only events we can rely upon for this is:

  1. The negation of the display enable signal.
  2. The assertion of HSYNC.
  3. The negation of HSYNC.

I believe each of these events would serve a unique role.  We would accumulate the address increment value to the fetch pointers when display enable falls.  We would schedule the first strip fetch at the assertion of HSYNC.

Bitmap mode can be implemented by simply not resolving character codes into font bitmaps.  "Monochrome mode" (that is, where one turns off attributes) can be implemented by having the attribute fetch logic just synthesize default attribute values based on register settings.

Discussions