• APP CPU Startup weiredness

    Daniel10/13/2020 at 21:52 0 comments

    APP CPU startup corrupts the heap

    I have been doing some more testing with the unhacked SDK approach and detected a heap corruption issue.  At first I though I did some mistake when initializing the stack (which I did as well...), but that was not what was causing the problems.

    My second suspect was the CPU caching. I didn't turn the cache on, but maybe it's on by default. But no... cache wasn't the issue either.

    I then tried to simplify the problem as much as possible and used a trivial infinite loop as startup address. This doesn't even access the registers. But still, as soon as the APP CPU is started, there is damage on the heap.

    It seems, some initialization of the APP CPU writes to that memory causing the corruption. It almost seems there is some code running before the APP CPU jumps the the entry vector specified in APPCPU_CTRL_D_REG. Or its an initialization sequence by the hardware. Doesn't really matter in the end, since I don't have any influence over it anyway.

    I found a comment in the SDK code indicating, there is in fact some kind of ROM code running before the CPU starts.

    /* Initialize heap allocator. WARNING: This *needs* to happen *after* the app cpu has booted.
       If the heap allocator is initialized first, it will put free memory linked list items into
       memory also used by the ROM. Starting the app cpu will let its ROM initialize that memory,
       corrupting those linked lists. Initializing the allocator *after* the app cpu has booted
       works around this problem.
    ...

    I didn't expect that, since writing the startup address into an register seems to be as low-level as it gets, but apparently, there is code that is executed before that address is loaded. There is always something new you learn...

    There is no way around SDK hacking

    Since we now know, we have to initialize the APP CPU before the heap is initialized and thus before any user code is run, we have to modify the SDK in order to get the APP CPU running without causing damage to the heap.

    I decided to add a single function call in cpu_start() immediately before heap_caps_init() is called. That happens in cpu_start.c.

    I also put the code on GitHub: https://github.com/Winkelkatze/ESP32-Bare-Metal-AppCPU

  • Second thoughts about the SDK hacking approach

    Daniel10/13/2020 at 11:06 0 comments

    Why didn't it work?

    I've been wondering why my SDK hacking attempts did work so poorly. But, after thinking a bit about it, it should have been obvious from the start:

    Synchronization functions heavily depend on the scheduler!

    I wanted my bare-metal main to behave like an ordinary FreeRTOS task, yet a lot of things an ordinary task is expected to do (like waiting) is a feature from the scheduler. But, we explicitly don't want the scheduler to be running, so we can't expect these things to work. From the FreeRTOS point of view, it's a bit like an interrupt wants to wait for a mutex (which is forbidden as far as i know).

    Possible solutions

    Sadly, there isn't much that can be done about that without digging deep into FreeRTOS and either teaching the synchronization functions to busy-wait while running on the second core or using a modified scheduler for the second core that only handles synchronization stuff but leaves the task alone otherwise.

    Thou shall not cache!

    I thought the SDK hacking approach is at least good for having the cache working and thus being able to execute code from flash. Turns out, it's not that simple. The cache loading functions also rely on the synchronization of  FreeRTOS to prevent issues from both cores trying to access the SPI flash at the same time. It also prevents the other core from accessing the cache while it's being updated.

    Since as stated above the synchronization is not working, also cache loads are not possible for now.

  • Hacking the SDK

    Daniel10/10/2020 at 16:03 0 comments

    It's finally hacking time ;)

    We try to patch the SDK, so we get control over the APP core, while still letting the SDK handle the initialization. Ideally, our main will behave similar to a FreeRTOS task while running bare metal.

    Approach

    First, we need to figure out, what the

    CONFIG_FREERTOS_UNICORE
    

    definition does. After all, we want to replicate that behavior to some extent. So we search through the SDK code and look for code that is dependent on this definition. We can broadly sort the parts into the following categories:

    • CPU starting / init code
    • Memory regions
    • Interrupt allocation / Synchronization / Crosscore
    • FreeRTOS

    Since we want to keep things simple, we try to not touch the FreeRTOS part and focus on the other three.

    Now, we want everything else to think things are running in singlecore mode to avoid modules trying to create tasks on the APP core and to avoid unnecessary memory allocation by FreeRTOS. So we add our new define, which will 'counteract' the UNICORE define in certain places. I named this:

    CONFIG_FREERTOS_BAREMETAL_APP_CPU

    Therefore, our sdkconfig.h contains both

    #define CONFIG_FREERTOS_UNICORE 1
    #define CONFIG_FREERTOS_BAREMETAL_APP_CPU 1 

    Now, in order for our define to do something, we add some code to selected files, which will undefine UNICORE if our define is set.

    #ifdef CONFIG_FREERTOS_BAREMETAL_APP_CPU
    #undef CONFIG_FREERTOS_UNICORE
    #endif
    

    What to change?

    As I wrote earlier, we apply the patch to any file related to starting the CPU / memory map / interrupt allocation / synchronization code. The definition is only used in a handful of places, so we can easily check every file and see if we assume that part is relevant for us.

    Additionally, we need to modify the start code in cpu_start.c to start our main instead of the scheduler.

    Doesn't work!

    When I tried as described above, the system immediately crashed during the dport access init while starting the second CPU. That was when I realized, a lot of the 'basic' SDK code like interrupt allocation actually depends on FreeRTOS functions (like mutexes) and the portNUM_PROCESSORS definition instead of the UNICORE define. And sadly, portNUM_PROCESSORS is referenced quite often. So instead of continuing that road, I decided to reduce the amount of initialization, so we won't have that problem.

    Minimal Hack

    Since I determined that getting the Interrupt allocation / Synchronization / Crosscore stuff working is a lot of work, I decided to ignore it for now.

    Without that, there are only the following files where I  changed something:

    • cpu_start.c
      CPU starting, initialization
    • panic.c
      Fatal error message / Core dump (not required, but nice to have)
    • spiram.c
      Cache flushing (only for external RAM)
    • soc_memory_layout.c / .h
      Memory map / region definitions

    The most important change to out manual approach (the one without SDK hacking) is the change in the memory map definition. This way we get cache for our CPU and can execute from external flash. In cpu_start.c we add our code to undefine the UNICORE definition and modify the start_cpu1_default() function. app_cpu_bare_metal_main() is our own main function that is defined in the user code.

    #if !CONFIG_FREERTOS_UNICORE
    void start_cpu1_default(void)
    {
    #ifdef CONFIG_FREERTOS_BAREMETAL_APP_CPU
        esp_cache_err_int_init();
        ESP_EARLY_LOGI(TAG, "Starting bare metal main on APP CPU.");
        app_cpu_bare_metal_main();
    #else
        // Wait for FreeRTOS initialization to finish on PRO CPU
        while (port_xSchedulerRunning[0] == 0) {
            ;
        }
    ...

    Limitations

    With this approach, we can execute from flash, so we are allowed to call SDK functions. However, we still need to be careful that the functions are not using mutexes / interrupts. DPort access is also limited, since we bypass the mutual access mitigation of the SDK. So whatever function we call, we have to check first, if this does anything 'forbidden'. Also, 'printf' doesn't work, so we have to fall back on the much more basic 'ets_printf'...

    Read more »

  • Running a task at max priority

    Daniel09/08/2020 at 11:34 0 comments

    Method

    The idea is simple, just build the SDK in normal Dual Core configuration, but pin all tasks to the first core. We then can run a task on the second core which should (in theory) run uninterrupted by everything.

    But: We can't be sure nothing is running on the second core!

    The ESP32 ecosystem and firmware is complex. There are so many libraries, so we can't be sure that there is really nothing else running on that core.

    Testing

    Easiest way to test this would be to toggle an IO in an infinite loop and then look at the output with a scope or a logic analyzer. If it is in fact running uninterrupted, the signal should be a clean square wave. If something is interrupting it, it should get 'stuck' from time to time. Since I have none of both at hand, I decided to measure the jitter in the execution time of a simple wait loop. With this method, I can't be sure it runs without interruptions, but I can at least detect some interruptions.

    void app_task(void *param)
    {
    	(void)param;
    	uint64_t min = UINT64_MAX;
    	uint64_t max = 0;
    	uint64_t sum = 0;
    	uint64_t last = 0;
    
    #define NUM_ROUNDS 1000
    	uint64_t times[NUM_ROUNDS];
    
    	printf("app_task started on core %i\n", xPortGetCoreID());
    
    	last = esp_timer_get_time();
    
    	for(volatile uint32_t i = 0; i < NUM_ROUNDS; i++)
    	{
    		for(volatile uint32_t delay = 0; delay < 1000; delay++);
    
    		uint64_t now = esp_timer_get_time();
    		uint64_t diff = now - last;
    		last = now;
    
    		if (diff < min)
    		{
    			min = diff;
    		}
    
    		if (diff > max)
    		{
    			max = diff;
    		}
    
    		sum += diff;
    		times[i] = diff;
    	}
    
    	printf("app_task finished\nmin=%llu\nmax=%llu\navg=%llu\n", min, max, sum / 1000);
    	for(volatile uint32_t i = 0; i < NUM_ROUNDS; i++)
    	{
    		printf("%i=%llu\n", i, times[i]);
    	}
    }
    

    Results

    If the code would run (more or less) uninterrupted, the times values should be almost the same (+/- 1). Except the first may be a bit off, due to the beginning of the loop. But looking at the output we get something like:

    87=120                                                                          
    88=120                                                                          
    89=120                                                                          
    90=121                                                                          
    91=120                                                                          
    92=130                                                                          
    93=120                                                                          
    94=120                                                                          
    95=120                                                                          
    96=120                                                                          
    97=121

     So it seems like it takes ~120ns to execute one run. However, the 92th run took 130ns. So unless the esp_timer_get_time() itself takes longer from time to time.,we got an interruption here! Sadly I have no way to verify this. But, since just a few runs take longer (always about 10ns) I guess its some interrupt handling.

    Conclusion

    FAIL

    The simplest, straight-forward approach with just running a task at max prio does not seem to work. I don't know what causes the problems, but it seems something (maybe interrupt handling) is ruining my timing here. Also, we can't be sure the core is actually free, so I can't recommend this method.

  • Build the SDK in single core mode and launch the second core by hand ​

    Daniel08/28/2020 at 15:53 0 comments

    Method

    At first, we try to not hack the SDK and to avoid any side effects due to the SDK or any library trying to launch a task on the second core. So we want the running system to be completely unaware of the second core running.

    So, to start with, we build everything with

    CONFIG_FREERTOS_UNICORE 1
    

    Since I'm using micropyton as a platform, I need to make some further changes to assign everything to the PRO core. If everything is working, you should get a boot message telling you the ESP is running in single core mode.

    I (534) cpu_start: Pro cpu up.                                                  
    I (534) cpu_start: Application information:                                     
    I (534) cpu_start: Compile time:     Aug 25 2020 15:32:36                       
    I (538) cpu_start: ELF file SHA256:  0000000000000000...                        
    I (543) cpu_start: ESP-IDF:          v4.0.1                                     
    I (548) cpu_start: Single core mode                                             
    I (553) heap_init: Initializing. RAM available for dynamic allocation:          
    I (560) heap_init: At 3FFAFF10 len 000000F0 (0 KiB): DRAM                       
    I (566) heap_init: At 3FFB6388 len 00001C78 (7 KiB): DRAM                       
    I (572) heap_init: At 3FFB9A20 len 00004108 (16 KiB): DRAM                      
    I (578) heap_init: At 3FFBDB5C len 00000004 (0 KiB): DRAM                       
    I (584) heap_init: At 3FFCA270 len 00015D90 (87 KiB): DRAM                      
    I (590) heap_init: At 3FFE0440 len 0001FBC0 (126 KiB): D/IRAM                   
    I (597) heap_init: At 40078000 len 00008000 (32 KiB): IRAM                      
    I (603) heap_init: At 4009DFE4 len 0000201C (8 KiB): IRAM                       
    I (609) cpu_start: Pro cpu start user code                                      

    Looking into the datasheet, we see, the config for the second core is straight-forward through the DPORT_APPCPU_CTRL_* registers.

    To launch the CPU, we first ensure, it's not already running by checking the CLKGATE register. If it is disabled, we reset the CPU, load the entry vector and start the CPU by enabling the CLKGATE. We also have to allocate stack for the second core.

    if (DPORT_REG_GET_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN))
    {
        printf("APP CPU is already running!\n");
        return;
    }
    
    if (!app_cpu_stack_ptr)
    {
        app_cpu_stack_ptr = heap_caps_malloc(1024, MALLOC_CAP_DMA);
    }
    
    DPORT_REG_SET_BIT(DPORT_APPCPU_CTRL_A_REG, DPORT_APPCPU_RESETTING);
    DPORT_REG_CLR_BIT(DPORT_APPCPU_CTRL_A_REG, DPORT_APPCPU_RESETTING);
    
    printf("Start APP CPU at %08X\n", (uint32_t)&app_cpu_init);
    ets_set_appcpu_boot_addr((uint32_t)&app_cpu_init);
    DPORT_REG_SET_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN);
    

    According to the Tensilica spec, the first thing to do after start is to reset the 'Window' registers. This is a special feature of this CPUs, the general purpose registers are banked. The banks can be switched with the 'Window' registers. We also need to initialize the stack pointer, which is in the register A1 by convention. We then call our main for the APP CPU. After the main finishes, we let the second core turn off its own clock to halt it.

    static void IRAM_ATTR app_cpu_init()
    {
        // Reset the reg window. This will shift the A* registers around,
        // so we must do this in a separate ASM block.
        // Otherwise the addresses for the stack pointer and main function will be invalid.
        asm volatile (                                \
            "movi a0, 0\n"                            \
            "wsr  a0, WindowStart\n"                \
            "movi a0, 0\n"                            \
            "wsr  a0, WindowBase\n"                    \
            );
        // init the stack pointer and jump to main function
        asm volatile (                    \
            "l32i a1, %0, 0\n"            \
            "callx4   %1\n"                \
            ::"r"(&app_cpu_stack_ptr),"r"(app_cpu_main));
        DPORT_REG_CLR_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN);
    }
    

    And that it! We can now have a app_cpu_main() function that runs completely independent of the rest of the system. I verified this works by incrementing a counter in the main function. Every time I start the APP core, I can verify if this counter has been incremented.

    Limitations

    No cache

    The APP CPU cache for external flash access is at the fixed at address 0x40078000, which is part of the allocateable memory as you can see in the boot log. So we must not enable the CPUs cache. Therefore any code running on that core must be run from the IRAM. That's not very nice, since you have to be very careful when calling any functions. If you only plan to...

    Read more »