We've wasted so many hours fixing radios and LEDs on dozens of badges, turns out we were able to fix through software. We've posted some details on Twitter and complained/celebrated in IRC over the past week but wanted to write down exactly what happened.
Roughly 20% of all badges were having radio failures on startup. Through some built in debug features in code we could easily see the PLL was not locking on these badges during startup. Very strange behavior. In some cases this was due to a loose load capacitor for the radio's 32Mhz oscillator. Easy fix, but not enough on most of the failed radios. We then discovered that the radios on failed badges would start after a flash (about 30 seconds from cold startup). Turns out our badges were starting to quickly and not allowing the transceiver to startup in the radio. Moteino DualOptiboot has a similar fix and sets a delayed startup fuse. Updating our bootloader has since fixed a dozen badges. We tried adding a delay during setup() in the main badge code, sadly this is too late in the execution. The delay needs to occur very early in the bootloader.
40% of assembled badges were having LED failures. We're using WS2812B LEDs (at least that's what we ordered) that require specific timing. In code we bitbang to send the colors and meet the timing requirements. No issues during prototyping or the first 20 badges. Until we ran out of LEDs and switched to our large batch from China. We should have known better when the reels came labeled as WS2815B. Most of the badges with these WS2815B failed. Usually the left eye (LED #1) would work but 2-8 would flash or show full white randomly. Our fix was to desolder the left eye and replace with a new LED. Most of the time this fixed it as the LED would get a better connection to the pad.
Ultimately the fix was in software. Our timing of the 1s and 0s to the LEDs was just outside spec. The higher quality LEDs easily dealt with it but not the WS2815Bs. We've messed with the High/Low timings in code before but nothing helped. Eventually we tried lowering the latch time to 45usec, suddenly all the broken badges worked! 17 badges off the rework pile and into the complete pile.