ESP32 RTOS + Bare Metal: Best of Both Worlds?

A project log for BlueRetro

Multiplayer Bluetooth controllers adapter for retro video game consoles

Jacques GagnonJacques Gagnon 02/27/2021 at 13:510 Comments

Ever since I finished working on the latency tests & improvement, I've been working on trying to free up the 2nd core from its FreeRTOS duty by running it bare metal as originally demonstrated by @Daniel  with #Bare metal second core on ESP32. I highly recommend reading the project logs for more detail. I will focus on describing how to refactor a complex application to use this hack in this log.


My original goal was to free myself from a workaround I have been using since the beginning of the project for bit-banged interface (Dreamcast, NES/SNES and Genesis). The issue with Big-Banging with an ESP32 running meaningful code on both core (this is an important nuance!) like the Bluetooth controller task on Core0 and wired interface on Core1 is that the wired task will get interrupted either by the FreeRTOS tick interrupt on its own core or by some event that requires core cooperation like Flash or DPORT access or something else on the first core. To be honest I haven't made an in-depth analysis of the source of interruption but they are either cause by interrupt on Core1 or some event on Core0. 

Original Issue

I can't use easily any of the ESP32 peripheral for NES/SNES and Genesis since the games themselves are bit-banging the protocol. So the way controllers are polled vary greatly from game to game and that would probably make using the I2S peripheral hard. Also the high amount of output lines for Genesis overall and for NES/SNES multitaps make it impossible to use a single SPI slave and we only got 2 on the ESP32. I can see how I could implement Dreamcast's maple protocol using 2 SPI peripheral but again to support 4 players there is not enough SPI hardware available.

Some examples of what happens when we get interrupted while bit-banging:

Two edges of the Genesis select signal are miss and the output is not updated.
Two edges of the SNES clock is missed that would result in buttons output being shifted to wrong cycle
Multiples edges of the Dreamcast input is loss result in corrupted packet receive.

If you search a bit online on this subject people will often recommend having a task on Core1 doing a loop without any yield but disabling interrupt on that Core and disabling the Core1 idle task watchdog. This only work if Core0 is not doing anything significant. If you are running the Bluetooth controller task, you will quickly get watchdog timeout on Core0!!

So my original fix was to use the DPORT access locking functions (esp_dport_access_stall_other_cpu_start / end)  which does two things. First it disables interrupts on the current core by entering into a critical section and second it generates an interrupt on Core0 than essentially make the Core0 loop doing nothing in a high level interrupt (level 4) until we release the DPORT access lock. The function exists to work around a silicon bug for pre-V3 ESP32 but here I use it only for its locking property we don't really care about DPORT access. But the problem with the DPORT locking is that it is sometimes too long to get the lock. The critical section depends on a mutex and the locking function wait for the Core0 to confirm it's "stall".

While this work around work pretty good for the maple bus, it not 100% perfect for the genesis drivers. I got fewer glitches but I still got some which is unacceptable when playing a game.

My hope was that removing FreeRTOS from the 2nd core would remove the need to stall the first core. It didn't exactly turn that way...

Updating Example to Latest ESP-IDF

The original code from GitHub was base on ESP-IDF v4.0.2. A lot of rework happened since then in the master branch to add support for S2, S3 and C3 chips. Beside the things that moved around one of the problems that prevented it from working is that some initialization code for region protection that used to be in-lined was now a function located on flash. Using the older inline version fixed that issue since anything running on bare metal Core1 can't access the flash as it would corrupt the cache.

Feasibility evaluation

Before going full steam ahead and putting a lot of work porting BlueRetro to work with one core bare metal, I did a few tests on the side to make sure it was worth it.

Did it remove the need for stalling Core0?

My first test was to confirm I didn't need to stall the Core0 anymore. So I used a simple SNES interface task loop with added debug on one of the output data lines. Each time a clock edge is received that code toggle the D1 line. So it's easy to spot when we miss a bit. I did the test and couldn't find or detect any glitch and then I concluded this proved Core0 stalling was not required anymore. So I continued with my next test.

However, I realized this much later but I completely drop the ball when doing this very important test. Running bare metal didn't solve the need to stall the first CPU, it's simply that my test didn't replicate the condition on which the first core impact the second one! 

The first problem is that I choose the worst system to test this on, the SNES interface is very resilient to bit skip and they are hard to spot with the logic analyzer or game play. Dreamcast, on the other end, fail right away at the peripheral identifying phase since the packet is so long.

Then using a simple project using two FreeRTOS tasks I reproduced the wrong issue. I had the first core pretty much doing nothing and running my interface on the 2nd task. I saw I was missing bits easily but those were due to interrupt on the 2nd core for the FreeRTOS tick since I forgot to disable them for my test. Then ticking I was reproducing the issue I tried using the bare metal hack and saw no bit miss. This was only because since FreeRTOS was not running on that core there was no interrupt register to it!

Reproducing the first Core0 interference requires meaningful work to be executed like the Bluetooth controller task. If I had only launched it I would have spotted right away, it didn't help. The two months worth of work that followed is a result of this invalid test early on. Spoiler, in the end it's good this test failed cause otherwise I wouldn't have continued working on this!

Can we still run interrupt on Core1?

While some interface requires bit-bang other like N64 (RMT), GameCube (RMT), PlayStation (SPI) and JVS (UART) use the ESP32 peripherals. I still want to run the interrupt handler for those interface on Core1. BlueRetro being a universal adapter with auto-detect at run time it's not possible to compile two versions. So event if running bare metal is mostly of no use for those interface it still got to work.

The main issue here is the way the interrupt handler work by storing a table of the ISR function pointer for each core. The size of this table is determined at compile time and so with the config CONFIG_FREERTOS_UNICORE there is no provision to register anything for Core1.

I initially worked around this by taking over the level 2 interrupt handler as this level is not used by anything in the ESP-IDF. It worked well but I didn't like doing so because it requires hacking into the SDK.

Looking more into the interrupt handler, I noticed a hook that can be enabled to pre-handle the interrupt for testing and that also allows clearing it to prevent the regular handler from running at all! This is documented here in more detail. So by simply adding the -DXT_INTEXC_HOOKS flag I had a reliable way to register interrupt for Core1 without any hacking of the SDK!

Can we avoid making change into the ESP-IDF?

Having the interrupt hook without any SDK change made me want to do the same for the Core1 initialization hook. In the latest SDK the init function start_cpu0_default is a weak symbol so it's actually possible to replace it with our own version! So we can do the bare metal hack with a vanilla SDK!

How annoying is it to run everything from IRAM & ROM?

The big overall rule when running Core1 bare metal is that you can't make any access to the flash anymore. So all function and data need to be in IRAM and DRAM respectively. But it's still possible to run function from the ROM! So function like ets_printf can be run and even a lot of the libc functions are stored there too like memcpy and memset. So really running everything from IRAM/ROM didn't sound like a big deal.

Porting BlueRetro Core1 function to Bare Metal

So thinking all my requirements were met I started porting the whole project wired drivers.

Replacing IDF GPIO & interrupt HAL

I didn't use much of the IDF drivers as they didn't really match my use case. However I used some of the simplest helper function for trivial things like configuring GPIO and interrupt handlers. Those functions are located on flash so I can't really use them anymore. I simply made alternate version of those functions and placed them in IRAM (see gpio.c & intr.c). Most of the underlying functions used by the helper function where low level function that are in-lined. So I could use those directly.

Replacing IDF Ring Buffer

For rumble feedback and keyboard scan code communication between the 2 cores, I was using the ring buffer provided by the SDK. However this module heavily relies on FreeRTOS so I had to replace it. I used liblfds to build a lock-free queue base around two bounded, single producer, single consumer queues. One track the free items while the other track the used ones. I isolated the new queue within its own IDF component to avoid having the lib defines within the main project. The lib didn't support xtensa CPU but I found that MIPS defined value matched those needed by the ESP32.

Do not trust IRAM_ATTR!

I had some crash related to flash cache corruption in my PSX/PS2 driver. However every function had the IRAM_ATTR attribute and all data (including strings and const data) had the DRAM_ATTR attribute set. After investigating for a few days, I notice the problem was going away if I reduced the number of cases in my switch case. None of the individual case was the source of the issue but after a certain threshold of them enabled the issue was showing up! It looks like somehow some of the constants were getting place in the flash regardless of the IRAM_ATTR flag passes some threshold! I resolved this issue by adding a linker fragment file for the project and adding to it all my Core1 source files with the "noflash" attribute. This has the huge advantage to enable using string literal as usual rather than use a variable with DRAM_ATTR.

Turn out we still need to stall Core0!

At that point in the porting process I finally gave the Dreamcast's maple bus interface a try and it didn't work at all! After debugging a bit with the logic analyzer, I immediately saw I was still missing bits! I then realized that my previous testing was invalid by looking back at my test branch.

I walked away from the project for a week, and then decided to try adding the Core0 stall back in the maple driver. I couldn't use the DPORT access function anymore as those rely on FreeRTOS. Base on the DPORT code, I made a simplified version of the stalling mechanism. With this version the locking is only possible one way: Core1 stall Core0. I used high interrupt level 5 as some of the panic code share the DPORT level 4 interrupt. Also the stall start function will not block anymore as no mutex is required and that I moved the validation that Core0 is stall to the stalling end function. This made the maple driver work again. I also gave the genesis driver a try and to my surprise Mortal Kombat 1 6 buttons was now 100% glitch free. Something I was not able to achieve before! Turnout I didn't waste my time!

ets_delay_us do not work anymore!

I don't often use the delay function so it took me a while to even notice it was not working. When doing my full regression testing, I notice the SEGA mouse didn't work in Wacky Worlds on the Genesis anymore. This game implement the SEGA Three-Wire handshake very poorly and it requires to emulate the exact timing a real SEGA mouse would take to toggle the ACK line. While debugging with my logic analyzer I saw that the timing was completely off as if the delay was not present.

Without working delay, transmission is done base on the bit request (TR) and ack (TL) flags as fast as possible, notice the game select (TR) is hardcoded to specific length

Turn out the ROM function ets_delay_us didn't work anymore! The disassembly of the function is available here and it's actually quite simple. This function depend on the clock rate from xthal_get_ccount function and clock frequency setting locate in the global variable g_ticks_per_us_pro. Base on my test xthal_get_ccount output was fine and even if I could access the content of g_ticks_per_us_pro fine I thought that maybe the ROM code couldn't somehow. So I reimplemented the function by using the CONFIG_ESP32_DEFAULT_CPU_FREQ_MHZ define directly and then my delay went back to normal!

What Wacky Worlds expect is long delay before ACKs


In the end, I think the major improvement came from replacing the DPORT access function by my own core0 stall function that was much faster to initiate the stall. While having the second core free of FreeRTOS certainly help, I bet I could have the same result with it. I did successfully port BlueRetro to use the bare metal hack and everything work well. I kind of like it, so I'll keep it this way. If it turns out to be hard to maintain in the future, it would be very easy to revert anyway. The genesis driver is now finally stable and so I can now move on to implement new features :) .