Close

Dismissing the "new arch", the EVB, and Bitter fighting with Linux drivers (2022-10-09)

A project log for Linkia

Tiny Linux handheld with LoRa+WiFI+BT connectivity.

reimu-notmoeReimu NotMoe 10/21/2022 at 10:270 Comments

Original date: 2022-10-09

Related tweets: [1] [2] [3] [4] [5] [6] [7] [8]


Dismissing the "new arch"

Since the suspend-to-RAM works perfectly on the latest X1000 SoM, there are no strong reasons to keep a separate "EC" MCU anymore. But unfortunately I already bought 25 pcs of the PIC24 MCU, and they're EXPENSIVE. :(

So I decided to let the X1000 do everything as before. But it's not easy because of the missing/broken Linux drivers...

The EVB

Despite I said that I'm "too lazy to draw a EVB" (lol), a proper EVB is still needed to verify its functionalities. So I drew one.

Bitter fighting with Linux drivers

Peripheral drivers

I spent a week to get the Ethernet and I2S audio working.

The DMA problem

Early struggles

If you read my tweets carefully in the last season, you can see I'm occasionally concerned by the "DMA problems" of the mainline kernel. And it's a very long story. Basically, it always give me a feeling that the DMA on the X1000 (and X1501) never worked correctly with the mainline kernel. This includes the standalone DMA controller (PDMA) and busmastering DMAs of certain peripherals.

I first noticed this problem in 2022-06. The symptom is very confusing: if you use SLAB/SLUB for the Linux mm implementation, everything seems to be working, but if you use SLOB, visible kernel oops and panics will appear as soon as something started a DMA transfer. As someone who only had the experience of debugging driver problems on x86 processors at that moment, I don't have a single clue about what happened. The SLOB allocator has bugs? Unlikely. The DRAM KGD is broken? No, everything is fine with Ingenic's old 4.4 kernel. Then there must be the problem of the DMA implementation itself.

We started by enabling a few memory debugging options of the kernel that scans its data structures constantly for corruption, and only enable the DMA of one peripheral at a time. But the results were inconclusive. The memory corruption sometimes happens, sometimes don't. I originally suspected that it's the "INCRxx" setting of the AHB bus of these busmastering peripherals that caused these problems, but setting to a lower value only let the data corruptions happen less, and they're not entirely eliminated.

What else can I suspect? The only remaining thing would be the cache management code of the Linux kernel. I started reading the code in Ingenic's old 4.4 kernel, and yeah! It must be the problem! So I asked in the linux-mips mailing list:

In the past month, I was struggling with random memory corruptions and crashes on the Ingenic X1000. After some detailed testing, I need to point out, the current cache management routines seems to be incorrect for X1000, and maybe all X series SoCs. It mainly affects DMA operations. Every form of peripheral to RAM transfer will corrupt the RAM, and this includes the dwc2 and SFC's DMA and the PDMA controller. If all the DMAs are disabled (e.g. hard coding dma_capable = false in dwc2), it will be fine running CPU and I/O benchmarks for a week. If you have the hardware, you can enable the kernel data structures & memory debugging and see for yourself.

So I went back and looked at Ingenic's old 4.4 and 3.10 kernel sources. They used a separate file (sc-xburst.c) for the cache routines, which is based on an very old sc-mips.c. And there are two important macros, called MIPS_CACHE_SYNC_WAR and MIPS_BRIDGE_SYNC_WAR. They're both set to 1. However these macros are removed from the kernel long time ago. The line `mips_sc_ops.bc_wback_inv = mips_bridge_sync_war;' seems to be the key point.
Do you have any recommendations of what could be done to fix this problem?

And sadly, no one seems to be interested in this topic. I'm on my own. But I don't have the expertise. This incident also severely prevented the launch of another project of mine, the X1501 Pico SoM. I had no choice but to suspect the SLOB allocator really has problems.

Time flies

It's finally October (2022-10). And the DMA problem is still a mystery. I started to get desperate. Change to another SoC? It's already too late, and AFAIK nothing else can achieve such a low power consumption. I started to get more and more desperate.

I started reading everything in `arch/mips` desperately. I'm not going to share what I did in these days since this article is already too long, but I was really not in a good shape, believe me. Eventually, I found a way to disable the L2 cache. All DMA problems appear to be disappeared. I can run I/O benchmarks for 24 hours without problems. All? Yes, not all. But now only these non-busmastering peripherals still have occasional problems. So I started to read the driver code of the standalone DMA controller (PDMA), that is, `drivers/dma/dma-jz4780.c`. It took me half an hour to found a NULL pointer dereference bug in the code by my bare eyes and kernel crash logs. And I fixed it.

Finally

Just at the time that I thought I fixed everything and I can finally celebrate. Another problem happened. Playing audio in the 44100Hz sample rate will stall everything that is using the PDMA. Bruh... I really don't know what to say about it. It just sucks.

I went 10x more desperate. And after a random nap in a random day, I decided to port the DMA driver in the old 4.4 kernel to mainline. The code is spaghetti, but TL;DR it finally works! Everything finally works! I tried more than 10 different ways to trigger the bugs, but they're finally gone!

Still, I'm speechless and I don't know how to make a conclusion of this incident.

Discussions