Running a PyTorch Model on the ESP32 S3

This log describes the steps I am taking to make a model trained in PyTorch work on the ESP32 S3 (M5Stack Core S3)

Public Chat
Similar projects worth following
I'm logging my actions while I'm trying to move a model from PyTorch to the ESP32 S3. The S3 has a SIMD extension baked into the XTensa CPU, allowing you to do certain operations vectorized. The available operations have been chosen to help accelerate (quantized) AI loads.

One of the most important operations made available is a vectorized fused multiply and add on 8 or 16 bit data, while loading the next data and performing alignment of this data". This is ideally suited for (quantized) convolutions and fully connected layers (matrix multiplications).

Espressif offers 2 options to access these functionalities, ESP-DSP and ESP-DL. I will get into how they both suck later and show a third option: using assembly language to run the instructions directly.


My priority is to get a model to run fast enough on an ESP32 S3. This means we have to make some tough decisions... Even though you should always keep the quality of your code to a high standard, when it comes to pure optimization nothing is sacred. If you're lucky, the compiler will do most of the work optimizing the code, but when it comes to custom instruction sets like the one found in the ESP32 S3 we're out of luck. We have a few choices, and I would always advice to start with a high level option, like an API that abstracts some of the internals, and move your way down if you run into obstacles. This is what I did here, I went from ESP-DL to ESP-DSP to raw assembly in order to get things done. Maintainability, Reusability, Readability? Out the window! And in return we get full control over the SIMD instructions that make this model go fast!


The AI framework built by espressif, containing samples on how to do face detection and face recognition, can be found here: esp-dl (github) . This all looks great at first sight, and I started to try and make an interface between PyTorch and how this framework expect its models. I built a custom QuantizationObserver to make sure the quantization to int8 takes into account the limitations of the ESP-DL framework and converted my model. I quickly ran into some issues / bugs, probably on my side, that needed to be resolved.

So I started to look into the source code of the framework to see if I was using the proper permutations of data and hit a brick wall in the form of library files. No source, other than the headers, is shared with this project, making it a black box. It requests of you to use a custom model converted tool for which also no source is shared. I can't work like that, so I ditched the idea of using this framework.


The DSP framework build by espressif, that exposes an API using the extended instructions (SIMD) present on the ESP32 and ESP32-S3 SoCs, esp-dsp (github). It has functionalities like convolution and matrix multiplication, so I figured I could use that to create my own convolution / linear layers. In the end, a convolution or linear layer is just a reordering of data (unfold), a matrix multiplication and folding to the output shape. A matrix multiplication is just a set of dot products. There is a header dspi_dotprod.h that takes in 2d "images" (one input, one kernel) and performs a dot product, on quantized data ([u]int8, [u]int16). It suggests that you can set up the strides for x and y, which should make it possible to define patches of a source image, keeping the stride of the source image, without copying data around.

I wrote a function to split an image up into patches, the size of the kernel and put it through this function, but I got horrible results. Luckily this project does have source code available, assembly code to be exact, so I dove into the code and realized that it would fall back to slow code if the patch isn't aligned properly, or has weird sizes, and a slew of other constraints on the data before it would run an optimized path. Furthermore, the function does not saturate the result, meaning that if your dot product is larger than a certain maximum, it would simply take the lower 8 bits, instead of clamping it to the output range.


Because the ESP-DSP project has its source code available, I decided to look into it and use assembly language to do exactly what I need to do. And this is where the world of possibilities opened up. The star of the show: EE.VMULAS.S8.ACCX.LD.IP, an instruction that performs a fused multiply + add + load, and gives us options on how to increase the pointer to the source image at my own discretion (with some acceptable restrictions). Now we are in Xtensa territory, which has a properly documented ISA and documentation. I will show how I used these instructions to get my face detector running at insane speed later...


XTensa Instruction Set

Adobe Portable Document Format - 4.50 MB - 05/13/2024 at 18:48



Contains the ISA for the S3 extended instructions

Adobe Portable Document Format - 13.94 MB - 05/13/2024 at 18:47


  • 4x4 Convolutions

    E/S Pronk4 days ago 0 comments

    We're so used to using 3x3 convolutions we don't often think about switching it up, and why would we? The 3x3 convolution is very efficient, and 2 of them back to back with a nonlinearity between them usually outperform an equivalent 5x5 convolution. So why would you use 4x4 convolutions instead?

    Technically, a 4x4 kernel can be constructed by padding a 3x3 kernel with 0's, which means that they can serve as drop in replacements. When you look at the SIMD instruction set available on the ESP32 S3 you quickly see that you are best off when working with 16 bytes at a time. Also, you can avoid headaches by keeping your data access aligned to 16 byte boundaries. So for the first layers in my network I replaced the 3 to 16 channel 3x3 convolution with a 4 channel (RGB+pad) 4x4 one, and retrained the network. The bigger convolution requires a bit more resources during training and improves the accuracy of the network slightly. But now, I can load the feature map for a single output channel and multiply+accumulate it with a block in the source image in only 10 instructions (using NHWC format):

    ee.ld.accx.ip %[bias],0
    ee.vld.128.ip q0,%[in],128
    ee.vld.128.ip q4,%[weight],16
    ee.vld.128.ip q1,%[in],128
    ee.vmulas.s8.accx.ld.ip q5,%[weight],16,q0,q4
    ee.vld.128.ip q2,%[in],128
    ee.vmulas.s8.accx.ld.ip q6,%[weight],16,q1,q5
    ee.vld.128.ip q3,%[in],128
    ee.vmulas.s8.accx.ld.ip q7,%[weight],16,q2,q6
    ee.vmulas.s8.accx q3,q7

     Imagine all the shifting and masking needed in order to make this a 3x3x3 convolution.

    Once you reach a point where the number of channels is a multiple of 16 you're out of the woods, as long as you're using NHWC :)

  • Quantization: PyTorch vs ESP32 S3

    E/S Pronk05/13/2024 at 19:13 0 comments

    I'm working on a custom model, and I'm using pytorch to train it. Most of the layers are custom so I can't just export to some standard format and hope for the best. I'm going to duplicate the layers' logic in C on the ESP32, then use PyTorch to quantize my model weights.

    I would like to try the ESP-DL library from Espressif, but unfortunately they use a different quantization scheme than PyTorch and claim you can't use your model with their API. This is not entirely true, it's just that there is no easy way to use your model with their quantization scheme, but you certainly can.

    The key thing to understand is how both quantization schemes work. PyTorch uses a zero-point and a scale:

    f32 = (i8 - zero_point) * scale

     while ESP-DL uses an exponent:

    f32 = i8 * (2 ** exponent) 

    which they claim is not compatible.

    We can make this work though, if we force PyTorch to use a zero-point with value 0 and a scale that is always 2 to the power of a (signed) int.

    Getting a zero-point of 0 is easy, we have to set the qconfig to use a symmetric quantization scheme. The scale is a little bit harder but no rocket science either: We can overload a suitable QuantizationObserver to produce qparams with a scale that is updated to

    scale = 2 ** round( log2( scale ))

     Like so:

    import as Q
    class ESP32MovingAverageMinMaxObserver(Q.MovingAverageMinMaxObserver):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
        def _calculate_qparams(self, min_val: torch.Tensor, max_val: torch.Tensor):
            s,z = super()._calculate_qparams(min_val, max_val)
            assert (z == 0).all()
            s = 2 ** s.log2().round().clamp(-128,127)
            return s,z

    Then when it is time to export the weights we also export the exponent we use in ESP-DL by simply getting the log2 of the scale of the weight tensor.

View all 2 project logs

Enjoy this project?



XieMaster wrote 17 hours ago point

This project is simply awesome!

I have recently been using ESP32S3 to run target detection AI projects, but I am using the SSCMA open source project (, and through the EDGE IMPULSE platform ( ) to train the FOMO model.

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates