Close

Sizing Up

A project log for QuickSilver Neo: Open Source GPU

A 3D Graphics Accelerator for FPGAs

ruud-schellekensRuud Schellekens 06/09/2016 at 20:460 Comments

Before going any further, we have a few bits to clarify. Specifically, the bits used to represent all the values in the pipeline. First I'll talk about the sizes for each individual variable in the system. Then, I'll look at a few of the main memories I need to implement the pipeline up to this point.

As always, QuickSilver is defined by the imitations of its platform. While higher resolutions would be nice, I'll stick to 640x480 and spend more time on producing prettier pixels, rather than more pixels. Besides, the 25MHz pixel clock is a nice even division of the 100MHz base clock of QuickSilver. With a resolution of 640x480, we need at least 10 bits for the X-coordinate, and 9 for the Y-coordinate. For symmetry's sake, let's put both at 10. I wanted to look at getting some sub-pixel precision, so I added 2 more bits to each to represent 'quarter' pixels. This may or may not work out in testing, so I'm not promising anything!

The Z-coordinate is a bit trickier. The reason to store Z even after projecting all triangles to a 2D plane is to order them on screen using a Z-buffer. This means we'll need enough precision to differentiate between two objects close to each other, and consistently decide which is closer to the camera. Lack of precision results in what is called "Z-fighting", two objects that are pretty much at the same depth fighting over which gets to be on top, with the outcome varying per frame, and even per pixel. Unfortunately, lack of Z-buffer precision is just something we have to deal with, even on more modern GPUs. So in this case I'll take the "shoulders of giants" approach and just copy what early 3D accelerators did: 16-bits for Z.

(For some more background on the problems with z-accuracy, check out this article by Steve Baker: https://www.sjbaker.org/steve/omniv/love_your_z_buffer.html )

The VGA connection on the Nexys 2 board I'm using to develop QuickSilver only has a 332-RGB connection: 3 bits for red, 3 for green and 2 for blue, for a total of 8 bits. We could just accept this, but I do want a bit more precision than that. Besides, I also want to reuse these components when I'll implement texture mapping, using R and G as the U and V coordinates respectively. As a nice compromise between size and scope, I've decided on 12 bits for R and G, and 8 for B. This gives us a total 4096x4096 texture space, with 8 bits for whatever (shading maybe?). And it all sums nicely to 32.

In summary, for the input triangles:

ComponentXYZRGBVertexTriangle
Size (bits)1212161212872216

The PreCalc produces a few more values. Specifically, the m and n gradients, and the intermediate Q value. The gradients represent changes in the components over the screen. To represent these values, we can't just use integers, we would need some fractional numbers as well. These days, most computers use floating-point notation, similar to scientific notation. In it, you separate the digits of a value and its order of magnitude into to numbers: the significand and the exponent. In decimal you could write 123.45 as 1.2345 * 10². This representation is very flexible, allowing both very small and very large numbers in a very small number of bits. Unfortunately, working with it is rather complicated and expensive in hardware.

Instead we'll use the far simpler fixed-point notation. Basically, take whatever I write down and put the decimal point in a specific place. Using the same example as before, again in decimal, I say that I'll always use 4 numbers before the decimal point, and 4 behind. 123.45 would then be written as 0123 4500; 0.015 will become 0000 0150; and integer 9001 becomes 9001 0000. It works the same in binary.

I'll use the following notation to represent fixed-point types: s.I.F, where 's' indicates a sign bit, 'I' is the number of integer bits, and 'F' is the number of fractional bits. For example, s.16.10 is a number with a sign bit, 16 integer bits (before the decimal point) and 10 fractional bits (after the decimal point), for a total of 27 bits. It can represent values from -65536 to 65536 in steps of just under 0.001.

To represent the gradients, we need to be accurate up to the individual pixel. We should be able to specify a change of 1 over a full screen distance. This means we should add 10 fractional bits to each component's base width to get the gradient sizes. Additionally, while the coordinates before were always positive numbers, gradients can be both positive and negative, so we need a sign bit as well.

The Triangle FIFO will need to store not just the gradients of the triangle, but also the current values of each interpolant as we are drawing it scanline by scanline. These values do need the full precision of each component, plus the 10 fractional bits, but don't need the sign bit.

In summary, for the pre-processed triangles:

ComponentM1M2M3MZNZMRNRMGNGMBNB
Formats10.10s10.10s10.10s16.10s16.10s12.10s12.10s12.10s12.10s8.10s8.10
Size (bits)2121212727232323231919
ComponentXaXbYcurrYmidYbtmZcurrRcurrGcurrBcurrQ
Format10.1010.1010101016.1012.1012.108.10s10.10
Size (bits)20201010102622221821

In total, the Triangle FIFO will need to store 405 bits for each currently active triangle.

While we're working with bits, let's take a look at a few of the buffers and memory we'll need in our pipeline:

But first, a word on memory in FPGAs. An FPGA uses configurable logic blocks to implements your design, each containing a few look-up tables for logic and a few flip-flops. But to implement memory using those basic building blocks would be pretty wasteful and take up a large amount of space in a typical design. For that reason, FPGA vendors include dedicated memory separate from the programmable logic. In Xilinx devices, these are called Block RAM, or BRAM for short, and are 16kbits each. The Spartan 3E-1200 I'm using has 28 of these BRAM for a total of around 0.5Mbit. BRAM are pretty flexible, allowing for two accesses simultaneously ("True Dual-Port"), and can be configured as 16Kx1, 8Kx2, 4Kx4, 2Kx8, 1Kx16 or 512x32.

Let's start from the back (as always): The Scanline Buffer needs to store 640 pixels (the width of scanline) at 8 bits per pixel. Because we want to be writing new pixels and displaying the old at the same time, we'll need double buffering. The dual-port nature of the BRAM comes in handy here, as we can have two independent ports. In total, we'll need 2x640 pixels at 8bpp, which will fit nicely in a single 2Kx8 BRAM, with room to spare.

The next buffer is the UV Buffer, which stores the scanline data at full precision. Between the UV Buffer and Scanline Buffer sits the Shader (more on that later) which, for now, just uses dithering to reduce the colour depth to 8bpp. This buffer needs (again) double buffering of 640 pixels, this time at 32 bits each. 4 BRAM in total.

Next up is the Z Buffer. This buffer stores the nearest Z-value for pixel in the scanline as we're drawing it. If a we're drawing a pixel closer than that, we overwrite the Z-value at that point. If we have a pixel that is further away (behind the already written pixel) we just skip it instead. We don't need the values of the previous scanline for this, so a single 1Kx16 BRAM will suffice.

Probably the biggest block of memory in the pipeline is the Triangle FIFO. This memory stores the triangles that have been processed by the PreCalc block and are currently being drawn to the screen. As we've seen above, each triangle needs at least 405 bits to store all information. Unfortunately, while BRAM can be configured as very narrow and deep, they can't become very wide. Using plain BRAM, we would need at least 13 to fit the full 405-bit wide data. That's already half our total budget. Instead, I'll use 7 BRAM configured as 512x224, and spread the triangle data over two accesses for a total of 256 triangles. This is the maximum number of active triangles per scanline. I could spread the triangle over more accesses to reduce the BRAM count even further, but this would reduce the maximum number triangles per scanline, and add additional overhead for each read and write. I think 256 triangle over 2 accesses in 7 BRAM is a nice balance.

In total, we're using 13 out of 28 BRAM thus far. As we expand the pipeline we'll use a few more for caching and such, but these are the most important ones,

Discussions