Rendering the World (Part 1)

A project log for Arduino Minecraft

Making a Minecraft clone that runs on the Teensy 4.1 in the Arduino environment.

dylan-brophyDylan Brophy 05/23/2024 at 21:520 Comments

Here's where I'm at now with the rendering:

I can render some of the terrain's shape, but the texture coordinates are not handled properly, and I can currently only render one chunk.  This is the first part for getting the rendering to do what I want, after this I need to:
  1. Optimize the code more and fix the texture bugs, and ensure it works for more scenarios (Part 2)
  2. Expand it to handle multiple 16x16x64 chunks (Part 3)

Block Iteration Speed - How to Render Voxels Fast

I don't have screenshots, but originally my render for a whole chunk took over a second - so entirely unplayable.  This was for two reasons:

It was easy to make the chunk sides not render.  We don't want this because the player should never see the sides anyway - they would simply enter a new chunk.  I think this was the bulk of it, because when I switch to the loop again, it now only takes 82ms to render.  That's a 90%+ speed boost.

The harder part was eliminating the loop.  How do you render the blocks without checking each one?  Well, all the blocks that need to be rendered have two things in common.  First, they all occur within the camera's view.  Second, they all touch the same connected region of air.  By using something akin to a floodfill algorithm, we can find only blocks visible to the player, and ignore about 90% of the blocks in the chunk - never even iterating over them.

A floodfill algorithm has expensive overhead, especially when considering the need to check the camera's viewport, which involves two matrix multiplications for every block.  However, to iterate over every block in the chunk, it is 16x16x64 iterations, or 16384 iterations.  For many blocks that aren't visible, this may cause rendering, and even for non-rendered blocks, it involves a series of checks for rendering.  This is especially true of caves.  So, it is much better never to iterate over those blocks at all.  The overhead for the floodfill algorithm, as it turns out, is well worth it.

My floodfill algorithm processed only 1140 blocks, or 7% of the 16384 blocks in the chunk.  It isn't perfect and probably has a bug or two, but still this is very pleasing.

The algorithm is a modified Breadth-First-Search.  Here's the code:

// This essentially checks if the block is in the camera's view
bool isValidNode(int x, int y, int z) {
  Vector3 pos = { x, y, z };

  applyMat4fToVertex(&pos, &(settings.worldMatrix));

  pos.z += 0.5f;
  if (pos.z > 0 && pos.z < 0.1f)
    pos.z = 0.1f;

  float k = 1 / pos.z;
  pos.x = (pos.x + (pos.x < 0 ? 0.5f : -0.5f)) * k;
  pos.y = (pos.y + (pos.y < 0 ? 0.5f : -0.5f)) * k;

  applyMat4fToVertex(&pos, &settings.projectionMatrix);

  return pos.z >= 0 && pos.z < 260 && pos.x >= 0 && pos.y >= 0 && pos.x < width && pos.y < height;

void searchRender(uint8_t blocks[16][16][64], uint8_t camx, uint8_t camy, uint8_t camz) {
  // This function does a breadth-first search, where the graph's nodes are every block within the camera's viewport.
  // x, y, and z are the coordinates to start the search at.

  Vector3iQueue queue(128);
  uint16_t searchedBlocks[16][64];
  bzero(searchedBlocks, 16*64*2);

  queue.append(camx, camy, camz);

  int blocksProcessed = 0;
  uint8_t x, y, z;
  while (queue.get(x, y, z)) {
    // Fill it.

    for (uint8_t q = 1; q < 64; q*=2) {
      uint8_t i = x + ((q >> 0) & 1) - ((q >> 3) & 1);
      uint8_t j = y + ((q >> 1) & 1) - ((q >> 4) & 1);
      uint8_t k = z + ((q >> 2) & 1) - ((q >> 5) & 1);

      // Skip blocks outside this chunk
      if (i & 0xF0 || j & 0xF0 || k & 0x80)

      // Mark the block as checked
      if ((searchedBlocks[j][k] >> i) & 1)
      searchedBlocks[j][k] |= 1 << i;

      // Check if the block is in view:
      if (!isValidNode(i, j, k))

      // Should we render it, or fill it?
      uint8_t block = blocks[i][j][k];
      if (block > 0)
        renderBlock(i, j, k, block);
        queue.append(i, j, k);

  Serial.printf("Processed %i blocks of %i total blocks.\n", blocksProcessed, 16*16*64);

When it's all said and done, this function takes only ~28ms to run in my test world, allowing for faster than 30FPS.  This is almost 3x as fast as the naive loop.

Optimizing Further

There's more optimizing I can do: