C70100 GPU

The Goal: 3D rendering on an FPGA

Similar projects worth following
I decided I wanted to make a graphics processor. It's been a long time since I made a processor of any kind, this seems like it can be a fun twist.

My goal for now is to run it at 100Mhz and render basic 3D graphics. I am using the upgraded A7 T100 version of the Mercury 2.

No Cache - multiple memory banks instead

Why?  I figured out that cache takes more FPGA resources and likely wouldn't really help with speed as much as just having specialized types of memory.

Although all 3 of the memory blocks available to the main processor are in the same address space, they are optimized for different things.

There is DMEM, IMEM, and RAM.  This is basically just data cache, instruction cache, and RAM - except the first two are actually their own memory.


Parallel mathematical operations are achieved through 64 ALUs that are essentially separate from the main processor.  These ALUs are more like a coprocessor with DMA than an internal part of the processor.

Each ALU has a 32 bit accumulator.  They can do floating point and integer addition, subtraction, and multiplication.  I only added an inverse function for floating point.  And of course integer bitwise and shift operations are supported.

The ALUs have their own memory inside the FPGA that can exchange every ALUs 32-bit data in a single clock cycle.  The data bus is 2048 bits.  I have to say, I like that large number.  This is by far my biggest FPGA design ever, so it's fun.  Except for the synthesis and implementation....  which takes 20 minutes.

  • 1 × Mercury 2
  • 1 × Your VGA adapter of choice

  • Need hardware rasterization... Architecture needs an upgrade

    Dylan Brophy10/11/2020 at 02:30 3 comments

    At 640x480 pixels, at 60 FPS, you have to write about 18 million pixels per second.

    So each pixel needs to be rasterized and written to RAM in only 5 clock cycles (if clk=100Mhz).  There is absolutely no way I can do that in software.  Having 64 ALUs will help me calculate the positions of all the fragments in the needed time, but not actually write the fragments to RAM.

    Suppose hardware rasterization is supported.  The CPU + ALUs are computing the next set of fragments while the rasterizer writes pixels to RAM.  It would barely be fast enough to handle two triangles covering the whole screen.  Of course, you could have more triangles occupying all of the screen, or several triangles in a confined space overlapping.  But what if you have a 3D world with many overlapping triangles?  It's possible for more fragments to be created than there are pixels on the screen - and all of them need to be compared with the depth buffer.  Although only a subset get written to RAM, the comparison takes time...

    There are a few options:

    1. Decrease framerate
    2. Decrease resolution
    3. Increase clock speed
      1. Probably requires new, faster external RAM (hard to add to my board)
    4. Highly parallel comparing and writing
      1. Potentially requires more memory to be used in parallel
    5. Increase clock speed, but increase memory bandwidth by adding external RAM
      1. Certainly requires more memory

    I can add more external RAM to the Mercury board, but it may end up being slower or more expensive, because it would have to go through a 5v level-shifter built into the board.

    I am not interested in rolling my own FPGA board because I lack the production capabilities.  I could make a carrier board for the Mercury 2 though, and I probably will.

  • Proposed Rendering Pipelines

    Dylan Brophy10/10/2020 at 04:31 0 comments

    Why would I look at this before having a fully functional processor?

    I need to know what operations I will be doing the most, and perhaps even write the software that will be rendering, before I design the instruction set.  By understanding the program I'll be running I can optimize the processor for that program.

    There are three main ways I want to be able to render:

    • Rasterizing my 3D primitives
    • Ray tracing 3D primitives
    • Rendering 2D textures/graphics


    Here is my idea:

    1. Create a depth buffer somewhere, storing distance from the camera
    2. For each triangle, project the three vertices
      1. For each pixel in that triangle compare with the depth buffer, overwrite if closer to camera
      2. Also compute fragment color and write to color buffer if closer to camera
    3. Swap the video generator's address with that of the color buffer
      1. Old color buffer is new image, old image is new color buffer
      2. This achieves double buffering

    Ray Tracing

    Not sure if I ever want to really program this, but I want the option.

    1. For each pixel
      1. Find the nearest triangle (if any) that would render on that pixel
      2. Compute the color of that pixel using the triangle (or lack thereof)
      3. Write the color to the pixel

    2D Images

    Here we need the option of rotating images and computation of depth (think 2D games like Starbound).  Possibly smooth lighting or cool graphical effects.

    1. Create a depth buffer and color buffer
    2. For each sprite instance
      1. For each pixel in the sprite compute a rotated position for it
        1. Shade and write color to the color buffer
    3. Swap the video generator's address with that of the color buffer
      1. Old color buffer is new image, old image is new color buffer
      2. This achieves double buffering

    This is surprisingly similar to the Rasterization pipeline...  Am I doing something wrong?  Please let me know if I am, I'm somewhat new to this :P

  • Main Processor finally works!

    Dylan Brophy10/09/2020 at 15:53 0 comments

    I started this project a few weeks ago, so I already have a lot working.  Let me catch you up...

    It took me most of the time to make all of the components in VHDL and test them in the simulator.  When I finally got to flash the board and see the VGA output, the video worked the first time, but there was no sign of the processor working.  Which was particularly odd because it worked fine in the simulator...

    My program (written in binary :P, so the code is not intelligible) essentially boiled down to this:

    Load 0xFF002000 into every 32-bit block in 256 bytes somewhere in video RAM

    That first line essentially told the ALUs to load that constant 0xFF002000 and store it into RAM 64 times.  It didn't work because the processor basically told the ALU to cancel the operation before it started :(

    After fixing it, some order can be seen in the sea of random pixels:

    Now we can write in large blocks to video memory, we can *probably* do any math operation we want, but the design is not turing complete.  Next time I want to make something more interesting happen.

View all 3 project logs

Enjoy this project?



Yann Guidon / YGDES wrote 10/09/2020 at 15:50 point

I'm very curious !

  Are you sure? yes | no

Dylan Brophy wrote 10/09/2020 at 15:55 point

Thanks! First time posting on the IO in forever XD

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/09/2020 at 16:03 point

Well, welcome back, and don't go away this time ;-)

  Are you sure? yes | no

Dylan Brophy wrote 10/09/2020 at 16:49 point

No promises ;-P

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates