Rendering the World (Part 2)

A project log for Arduino Minecraft

Making a Minecraft clone that runs on the Teensy 4.1 in the Arduino environment.

dylan-brophyDylan Brophy 05/24/2024 at 14:272 Comments
I changed the screen resolution to 400x300 for now, but this significantly impacts the framerate.  This is fine, as I'm just running some tests for now.  First, I needed to fix the texture rendering.  The math I was using before was very monkey, and needed to be changed.  There is some code used to check if a given pixel is inside each rendered quad, and this generates some coefficients.  These coefficients apply to the edges, not vertices, however.  I found that it wasn't hard to do some linear transformations to extract the texture coordinates, and actually the new math is faster than the original code.  I also made grass have a different texture for the top; and now we have this rendering:
This is much better.  The quads in this image are diamond shaped and rotated, but the textures are almost perfect!  The shape of the terrain isn't too hard to make out, and with a more interesting scene, it could probably look a lot cooler.  We do see nearby textures bleeding into the dirt texture though.  To fix this, we can add a simple if block to prevent the texture coordinates from going out of bounds.  Certainly, this can't affect our render time much.  Certainly... right?

At this point I was doing some profiling to see what's eating up my CPU time.  The first thing I wanted to know was the overhead of the map floodfill algorithm; turns out it's really tiny.  I knew it wouldn't be more than half, but I had no idea it would be this small:
Processed 1035 blocks of 16384 total blocks.
Total time: 43.820000ms
Render time: 43.351000ms
Time taken for 100 frames: 4412ms
Time per frame: 44120.000us
FPS: 22.7

Less than half a millisecond - ore just more than 1% of the render time.  About 98.5% of the render time is spent on the 3D rendering code, and of that probably about 99% is spent on computing and drawing individual pixels - not even matrix multiplication or any actual 3D projection.  Ok, so let's see how much time we add by just adding some if blocks to check the texture coordinates:

// Almost all textures tile well.
if (kx > 1) kx -= 1;
else if (kx < 0) kx += 1;
if (ky > 1) ky -= 1;
else if (ky < 0) ky += 1
Processed 1035 blocks of 16384 total blocks.
Total time: 46.214000ms
Render time: 45.750000ms
Time taken for 100 frames: 4652ms
Time per frame: 46520.000us
FPS: 21.5

 That's a big increase, considering its a simple if block.  The reason is that the fragment code does an unbelievable number of iterations, so that if block is probably running millions of time per frame.  All together the if block adds about 2.39ms, or 5%.  That's a lot when you need less than 16ms for 60FPS.  If a simple if block like that adds that much time, then what about the other if blocks in the fragment code?

inline void fragmentShaderRaw(DisplayBuffer* display, int x, int y, uint8_t z, Color textureColor) {
  uint32_t displayIndex = display->width * y + x;

  uint8_t* depthLoc = &(display->depthArray[displayIndex]);
  if (*depthLoc < z)
    // Discard the fragment

  if (textureColor >> 4 == 0)
    // The fragment has no color; discard.

  Color fragColor = textureColor;

  // Apply this fragment to the framebuffer
  Color* outColor = &display->colorArray[displayIndex];
  if (fragColor >> 4 == 15)
    *outColor = fragColor;
  else {
    *outColor |= fragColor;
    *outColor &= 0x7;
  *depthLoc = z;
So now, looking at this simple and rather streamlined function, we can see that in reality it takes a lot of time.  Most of these checks cannot be avoided.  However, I was able to optimize a lot of my quad drawing function, which did a lot of the same math many times in a tight loop.  I found that each multiplication I removed saved me about half a millisecond!  Here is my timing now, under the same test:
This is much better, especially for the higher resolution of 300x400.

It is worth noting that, for these tests, the processor is running at 720Mhz.  This is about a 20% overclock.  Here are the stats at different processor speeds - you could probably calculate this but a table is convenient:
Clock Speed
Frame TimeFPS

Some of these chips go as high as 1.08 Ghz.  Other chips I have can't go above 600Mhz.  So there really is a chip lottery with the Teensy 4.1s.  The chip I was testing with here started crashing at 1.08Ghz, so I did not put that speed in the table.  In any case, it is promising that the speeds are this good at that resolution.


Jade wrote 05/27/2024 at 14:55 point

You might want to look into raycasting-like DDA-style algorithms for this, rather than rendering polygons


  Are you sure? yes | no

Dylan Brophy wrote 05/27/2024 at 16:17 point

Thank you, I'll have to check this out!  It would definitely be great to get rid of the z-buffer.  While browsing last night I also found this DOOM port to Teensy 4.1, which seems to run quite well, and may also have some good optimization inspiration:  I haven't read through that yet, but seems very interesting.

I definitely want to try to get the render time down because I still have to send the image to graphics memory, and that takes time too, not to mention simulating the world.  Hopefully I can use DMA for the image but not sure yet.

  Are you sure? yes | no