Debugging the Linux Boot Process

A project log for Mackerel 68k Computer

A computer engineering journey through the Motorola 68k CPU family

Colin MaykishColin Maykish 07/22/2022 at 21:116 Comments

It has been many hours of compiling, booting, and debugging and I have not made much progress. In the last log, I had the very start of the Linux kernel code running on the 68008, but it was failing early and without much information. I needed more output.

After digging through the kernel code, it became clear that printk() was the tool for the job. This function is the kernel's version of the familiar printf(), but it handles logs a little differently. Instead of directly outputting messages, printk() acts a log buffer until the kernel boots far enough to create a serial console. Once the console is available, any logs stored up from before that will be printed. I spent some time writing a simple console driver and attempting to get the kernel to use it, but my issues started before the console is even loaded.

To get around this, I cheated. I replaced the contents of the printk() function with dumb printf()-style code, immediately writing the log messages to the serial port and skipping all the buffering and console business. Now the kernel can at least talk. This is what it says:

5Linux version 4.4.0-uc0 (mackerel@c14f97401618) (gcc version 4.9.2 (GCC) ) #1 Fri Jul 22 20:27:01 UTC 2022

6Flat model support (C) 1998,1999 Kenneth Albanowski, D. Jeff Dionne
7On node 0 totalpages: 512
7free_area_init_node: node 0, pgdat 000b50cc, node_mem_map 000cd000
7  DMA zone: 4 pages used for memmap
7  DMA zone: 0 pages reserved
7  DMA zone: 512 pages, LIFO batch:0
7pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
7pcpu-alloc: [0] 0 6Built 0 zonelists in  order, mobility grouping off.  Total pages: 0
5Kernel command line: console=debug,earlyprintk=debug
6PID hash table entries: 1024 (order: 0, 4096 bytes)
6Dentry cache hash table entries: 1024 (order: 0, 4096 bytes)
6Inode-cache hash table entries: 1024 (order: 0, 4096 bytes)
6Memory: 1144K/0K available (613K kernel code, 31K rwdata, 80K rodata, 36K init, 45K bss, 4294966152K reserved, 0K cma-reserved)
5Virtual kernel memory layout:
    vector  : 0x00000000 - 0x00000400   (   1 KiB)
    kmap    : 0x00000000 - 0xffffffff   (4095 MiB)
    vmalloc : 0x00000000 - 0xffffffff   (4095 MiB)
    lowmem  : 0x00000000 - 0x00200000   (   2 MiB)
      .init : 0x000b6000 - 0x000bf000   (  36 KiB)
      .text : 0x00000400 - 0x00099bc0   ( 614 KiB)
      .data : 0x00099bc0 - 0x000b5a60   ( 112 KiB)
      .bss  : 0x000bf000 - 0x000ca674   (  46 KiB) 

This is as far as it gets. Occasionally it will also crash and show a stack trace related to the mm_init() code. It seems pretty clear that there is an issue related to the memory. My first thought is that there is something wrong with the image file I'm generating or with the kernel configuration settings related to ROM/RAM sizes and locations. Seeing the "0K available" message is concerning. I have 2 full megabytes of RAM available from 0x000000 to 0x200000. My bootloader copies the Linux image into RAM starting at 0x400 (leaving the vector addresses alone) and then jumps there.

There is definitely more debugging work I can do in this area, but the other problem I'm having is harder to solve: hardware instability. Unsurprisingly, a bunch of vintage components on perfboards soldered together and connected on a backplane is not the pinnacle of reliability I would like. My RAM board in particular seems finicky. Rearranging the individual chips in the sockets can change the stability. The crashes I see (outside of Linux kernel problems) show up as illegal instruction exceptions or address errors, more evidence that something is off with the RAM hardware. One other possibility is that MFP or decoder CPLD is showing up on the CPU buses when it shouldn't be. This would also look a lot like a RAM problem.

In a last ditch effort to avoid blaming the hardware, I built and booted an older version of Linux (3.10.0) with and older version of gcc, but I got almost identical results. Turns out the Linux kernel startup changes very slowly between versions. The strategy at this point is two-fold: I need to deep dive into the Linux image structure and ROM/RAM configurations and I need to spend some time stress testing the hardware and improving reliability. A good first step might be to design a PCB for this RAM board.


Colin Maykish wrote 07/28/2022 at 15:02 point

It's interesting that your board is also showing 0K available. The Linux code I have on GitHub is based on uCLinux 2016 using the 4.4 kernel. I've had a bit more luck downgrading to uCLinux 2013 and using the 3.10 kernel.

On 3.10, I saw an accurate number for free memory instead of 0K, but I'm still getting stuck in the mm_init() code.

I think you're right about the config_BSP() being important, but I'm not sure my board is even getting that far. To check this, I stubbed out a version of config_BSP() that assigns the function pointers for the hardware and timer init functions that just print out a statement, but they aren't being called before my CPU crashes.

I've tried enabling a few more debug options in the kernel config, and it's throwing some errors with page states and spinlocks.

If you're interested, I can post this 3.10 kernel for you to try. Would be nice to compare our results.

I'm still leaning toward this being a hardware or timing problem on my setup.

  Are you sure? yes | no

Stephen Moody wrote 07/28/2022 at 15:23 point

I would be happy to try the 3.10 version as well if you want. It would be interesting to get it up and running just to see how it performs on the 68000 and to learn a little more about linux. 

I did get to a point with one of the kernel options where it was printing "percpu init late" before stopping so it looks like it could be a memory configuration issue possibly.

I'll give your version a try and if it's failing in the same way I guess it can rule out a hardware issue.

  Are you sure? yes | no

Colin Maykish wrote 07/28/2022 at 19:44 point

I pushed the 3.10 version to:

The config should be similar to the one in 4.4 and there are a bunch of printks all over the place.

I managed to track the code execution as far as slob_units() in the slob_alloc code. It's trying to dereference a pointer that has a value of 0xffffffff and crashing. I'm thinking this may be a symptom of some earlier memory init issues rather than a bug to fix directly. All of this happens in the percpu_init_late() function that you've already seen, so maybe both of our systems are already in the same place. This would certainly indicate a config issue.

There are also two other allocator choices in the Linux config: SLAB and SLUB, but I don't fully understand the differences yet. Might be something else worth exploring.

One note: Linux 3.10 doesn't officially support gcc newer than 4. I built and used gcc-4.9.2 and binutils-2.25. Here's the dockerfile I used to set up the environment:

  Are you sure? yes | no

Stephen Moody wrote 07/28/2022 at 13:36 point

I had a quick look at your code, partly wondering if i could port it to my computer. You mentioned in the last post that you have commented out the config_BSP in setup_no.c file. I wonder if you may need to implement this to get any further.

Looking at where it's implemented on the uCsimm target (m68EZ328.c), it looks like it's setting up the tick timer in the config_BSP section. Not a Linux expert, but doesn't it require the timer to run? Not sure at what stage in the process that's needed though.

Out of curiosity, I update the printk function and built your code and tried it on my computer to see what happens. It looks like i get the exact same thing happening. I'm using an 8MB RAM area for the code and it also reports 0K available.

page extmem init6Memory: 7224K/0K available (613K kernel code, 33K rwdata, 80K rodata, 36K init, 45K bss, 4294960072K reserved, 0K cma-reserved)
5Virtual kernel memory layout:
    vector  : 0x00000000 - 0x00000400   (   1 KiB)
    kmap    : 0x00000000 - 0xffffffff   (4095 MiB)
    vmalloc : 0x00000000 - 0xffffffff   (4095 MiB)
    lowmem  : 0x00100000 - 0x00900000   (   8 MiB)
      .init : 0x001b6000 - 0x001bf000   (  36 KiB)
      .text : 0x00100000 - 0x00199400   ( 613 KiB)
      .data : 0x00199400 - 0x001b5a80   ( 114 KiB)
      .bss  : 0x001bf000 - 0x001ca654   (  46 KiB)
kmem cache

  Are you sure? yes | no

Colin Maykish wrote 07/24/2022 at 19:18 point

Unfortunately, I'm already running at 2 MHz, the lowest in-spec speed for the 68008 according to the data sheet. I've written some code to do basic memory checks across the whole 2MB memory space (read, write, zero, etc.) and I'm finding no issues.

I'm starting to lean more toward this being a Linux configuration issue. I've gotten a bit further into the boot process by switching to the SLOB allocator instead of the SLAB in the kernel config, but I don't yet know enough about this to know if that's a real fix or just masking the problem differently.

  Are you sure? yes | no

Gravis wrote 07/24/2022 at 07:08 point

After looking at some of the pictures of the boards, my gut tells me this is a timing issue, despite the RAM chips being 83MHz.  You may try running the CPU at a much slower speed to see if that makes a difference.

  Are you sure? yes | no