As my software ecosystem begin to grow binary image sizes also became larger and larger. For example, most recent addition to software list - ported PicoC C language interpreter (source) has binary size more than 64 KB. Addition of proper C library also has increased almost every binary image size. Program load times also become longer and longer which made working with system quite uncomfortable. Thus I've decided to investigate this issue and optimize system at least for reading.
Profiling load process
My system uses an SD card as mass storage device. SD card is linked via SPI interface to Atmega640 microcontroller which exposes SPI access ports to adress space accessible for CPU. CPU then reads and writes data in PIO mode. Microcontroller doesn't handle SD card protocol, that is done by operating system code.
To profile loading times I've hooked up logic analyzer to certain signals in my system: SS, SCK, CS# for MCU and CS# for RAM. This way I could deduce how much time system spends on CPU<->RAM, MCU<->SPI and internal MCU operations. My first discovery were:
- One sector (512b) read time is roughly 60ms.
- SPI communication takes negligible fraction of all time spent.
- Most of time is CPU<->RAM operation.
Conclusion: operatiing system code is bottleneck and should be improved.
Reworking the code
When I have analyzed relevant code I found that single SPI access operation goes through several layers of procedure calls and is surrounded by conditionals that may enable debug logging. It seemed logical to organize code this way when I was writing it, but ultimately it resulted in very bad perfomance. To improve things I've collapsed several procedures into one, removed conditionals and have made procedure static inline to save cycles on call/return.
When I've tested new version I've noticed 100% speed improvement: single sector read now took only 30 milliseconds. This was already an achievment, but there was stil room for improvement.
Implementing DMA: high expectations, low yield
My next step was to eliminate CPU from reading at all. My circuitry was designed with future DMA implementation in mind: I had enought signals wired to MCU to make it possible to master the bus. So, I've implemented DMA mode for SPI in the firmware and modified the operating system. Unfortunately tests showed than sector read time has reduced insignificantly: from 30 to 25 ms. That was quite a disappointment. It appears that I underestimated loss on internal MCU operations during DMA process: address counting, ports manipulation, etc. That code also has been improved, but this only reduced sector read time to 18 ms. Mostly this was achieved by parallelizing SPI transfer and internal logicv operations of firmware.
Comparing to the initial timings I've achieved dramatic improvements in performance: sector read time reduced from 60 to 18 ms, this is 200% improvement. But this feat of optimization has also discovered the ultimate limitation of using general-purpose MCU as perepherial controller. After all optimization actual transfers over SPI accounted only for 15% of time costs. Almost everything else was wasted because of sequential nature of MCU software-defined operations. No breakthrough optimization of this architecture is possible. It also showed that making DMA controller out of general puprose MCU is generally a bad idea.
Next step in increasing SD Card performance is to design custom hardware controller for SPI and possible DMA using TTL and/or CPLD chips.