I'm mainly writing this here so I've got something to refer to.
I've got the base system running how I like now. It boots, starts the USB gadget serial and ethernet, and registers a link-local IP address. I've got AVAHI running, so I can find the board by hostname. So far so good. Boot time is currently sitting between 10-12 seconds.
Next I wanted to free up the serial port (i.e. no console logging on the serial port). It works OK, as long as you never ever use the serial port!
- if the kernel console uses the serial port: OK!
- If no-one ever uses the serial port: OK!
- If the kernel console doesn't use the serial port, but you later open it with minicom, picocom or whatever: System hang!
We're talking hard hang, LED doesn't flash any more etc.
I thought the kernel was panicking - but without the serial port console it's impossible to tell. I added some print statements in the relevant parts of the kernel and it looks like it's making it through all of the tty/serial code just fine. It actually crashes *after* minicom exits - this doesn't really match up with a panic. It there were a panic it should happen during the exit sequence.
After some lucky Googling, I found this thread: http://lists.infradead.org/pipermail/linux-arm-kernel/2016-February/406730.html, which indicates that there's a problem with the PLL refcounting, which means the PLL can get turned off when the PL011 (serial) driver isn't in constant use.
I commented out the clock reference count decrement in the PL011 driver, and bam! Everything works. So that's not a proper fix but at least I know what the problem is now.
My next step is probably to switch to a 4.5/4.6-rc1 kernel (I'm currently on 4.5-rc7) and see if any of the recent bcm283x clock changes I've seen flying through the mailing list fixes the problem.
What better way to spend a Bank Holiday Monday?