Close

Terribly unreliable

A project log for Terrible Cluster

5 Raspberry PI Zeros. One custom USB hub. Endless disappointment.

ajlittajlitt 10/20/2017 at 20:080 Comments

One problem that's plagued the project is the USB boot process.  The fourth node almost always fails to boot on the first try.

Terrible Cluster relies on the rpiboot utility from the Raspberry Pi Foundation to serve the bootloader and kernel to each of the compute nodes.  rpiboot first watches for a USB device to be plugged in matching a Zero in USB device boot mode.  It then sends down the first stage bootloader "bootcode.bin".  The first stage begins executing, recognizes that it was booted from USB, and the reinitializes the USB controller in device mode.  rpiboot notices that the device has re-enumerated under a different device ID, then it enters a USB fileserver mode.

At this point, bootcode.bin goes through the same process of loading executables and config files as when booting from an SD card: start_cd.elf, config.txt, kernel.img, etc.  But instead of fetching the files from SD, it requests them from the USB host.  rpiboot sits in a loop waiting for file requests and sending files to the device until it receives a message saying that the device is done, or a transaction with the device fails.

This works fine for me when I'm booting one Pi at a time.  However, when I am booting all four Pi Zero nodes at once it often fails when sending the kernel to the fourth node that it boots.  rpiboot does not operate sequentially, instead sending bootcode.bin to first USB device with a matching bootloader VEN/DEV.  It is always the fourth one that fails, and it does not appear to favor a particular slot.  Once a node finishes loading all of the required files over USB, it boots reliably.  So this has everything to do with the USB boot process, and not the OS.

If I can't trust rpiboot to sit in the background and serve files to booting nodes, then this isn't going to be a useful system for learning about software deployment.  Sounds like a good debugging project for the weekend...

Discussions