The big question is now how to implement our trained model on a microcontroller. Ideally that should be a solution that works with PyTorch (since I trained the models in it) and that minimized SRAM and Flash footprint also on very small devices (No point in having a 6k parameter model if your inference code is 30k).

I spent quite some time searching and reviewing various options. A short summary of my findings:

Tensorflow based

•Tensorflow lite
•TinyEngine from MCUNet. Looks great, targeting ARM CM4.
•CMSIS-NN- ARM centric. Examples. They also have an example for a pytorch to tflite converter via onnx
•TinyMaix - Very minimalistic, can also be used on RISC-V.
•Nnom – Relatively active project. Small footprint and portable.

Pytorch based

•PyTorch Edge / Executorch - The answer to Tensorflow lite from PyTorch. Seems to target intermediate systems. Runtime is 50kb...
•microTVM. Targeting CM4, but claims to be platform agnostic.

•MAX7800X Toolchain and Documentation (proprietary) This is a proprieteray toolchain to deploy models to the MAX78000 edge NN devices.

Uses pytorch for quantization aware training with customized convolutions. (ai8x-training)
Checkpoint is then quantized (https://github.com/MaximIntegratedAI/ai8x-synthesis/blob/develop/izer/quantize.py)
Checkpoint is then converted to c with the network loader (https://github.com/MaximIntegratedAI/ai8x-synthesis/blob/develop/izer/izer.py)
Highly prioprietary toolchain

•Meta Glow Machine learning compiler, seems to target rather medium to large platforms.

Generates IR which is then compiled to target architecture with LLVM. C++ frontend.

ONNX based

•DeepC. Open source version of DeepSea. Very little activity, looks abandoned
•onnx2c - onnx to c sourcecode converter. Looks interesting, but also not very active.
•cONNXr - framework with C99 inference engine. Also interesting and not very active.

The Dilemma

The diversity of different solutions, and also the number of seemingly abandoned approaches, shows that deployment on tiny MCUs is far from a one-fits-all solution.

Most of the early solutions for edge inference / TinyML are based on tensorflow lite, PyTorch only seems to be catching up more recently. There are also some solutions that convert models from the ONNX interchangeable format to C code.

The issue is clearly that it is very hard to combine the requirements of easy deployment, flexibility and very small footprint into one. To optimize small models, they need to be trained taking the limitations of the inference code into account (quantization aware training). This means that training and inference cannot be implemented completely independently.

An attractive workaround that seem to gain more traction is to use compilers that create specific infrence code, such as microTVM and Glow. Both of those seem to be used for slightly more powerful microcontrollers (CM4 and up).

In terms of very small inference engines, onnx2c, tinymaix and nnom seem to be quite interesting. The last two are unfortunately based on tensorflow lite. TinyEngine seems to be the most optimized MCU inference engine, but is highly reliant on the ARM architecture.

Plan/Next steps

Take a more detailed look at tinymaix and onn2x. If no useful solution emerges, revert to a handcrafted approach.

Tiny Inference Engines for MCU deployment

Tensorflow based

Pytorch based

ONNX based

The Dilemma

Plan/Next steps

Discussions

Tiny Inference Engines for MCU deployment

Tensorflow based

Pytorch based

ONNX based

The Dilemma

Plan/Next steps

Improving Quality (More Layers!)

Building my own Inference Engine

Discussions

Become a Hackaday.io Member