face detector written mostly in assembly, model size roughly 25 kb (with 8 bit quantized weights). Sorry can’t share any details just yet…

SPEED!

My priority is to get a model to run fast enough on an ESP32 S3. This means we have to make some tough decisions... Even though you should always keep the quality of your code to a high standard, when it comes to pure optimization nothing is sacred. If you're lucky, the compiler will do most of the work optimizing the code, but when it comes to custom instruction sets like the one found in the ESP32 S3 we're out of luck. We have a few choices, and I would always advice to start with a high level option, like an API that abstracts some of the internals, and move your way down if you run into obstacles. This is what I did here, I went from ESP-DL to ESP-DSP to raw assembly in order to get things done. Maintainability, Reusability, Readability? Out the window! And in return we get full control over the SIMD instructions that make this model go fast!

ESP-DL

The AI framework built by espressif, containing samples on how to do face detection and face recognition, can be found here: esp-dl (github) . This all looks great at first sight, and I started to try and make an interface between PyTorch and how this framework expect its models. I built a custom QuantizationObserver to make sure the quantization to int8 takes into account the limitations of the ESP-DL framework and converted my model. I quickly ran into some issues / bugs, probably on my side, that needed to be resolved.

So I started to look into the source code of the framework to see if I was using the proper permutations of data and hit a brick wall in the form of library files. No source, other than the headers, is shared with this project, making it a black box. It requests of you to use a custom model converted tool for which also no source is shared. I can't work like that, so I ditched the idea of using this framework.

ESP-DSP

The DSP framework build by espressif, that exposes an API using the extended instructions (SIMD) present on the ESP32 and ESP32-S3 SoCs, esp-dsp (github). It has functionalities like convolution and matrix multiplication, so I figured I could use that to create my own convolution / linear layers. In the end, a convolution or linear layer is just a reordering of data (unfold), a matrix multiplication and folding to the output shape. A matrix multiplication is just a set of dot products. There is a header dspi_dotprod.h that takes in 2d "images" (one input, one kernel) and performs a dot product, on quantized data ([u]int8, [u]int16). It suggests that you can set up the strides for x and y, which should make it possible to define patches of a source image, keeping the stride of the source image, without copying data around.

I wrote a function to split an image up into patches, the size of the kernel and put it through this function, but I got horrible results. Luckily this project does have source code available, assembly code to be exact, so I dove into the code and realized that it would fall back to slow code if the patch isn't aligned properly, or has weird sizes, and a slew of other constraints on the data before it would run an optimized path. Furthermore, the function does not saturate the result, meaning that if your dot product is larger than a certain maximum, it would simply take the lower 8 bits, instead of clamping it to the output range.

EE.VMULAS.S8.ACCX.LD.IP

Because the ESP-DSP project has its source code available, I decided to look into it and use assembly language to do exactly what I need to do. And this is where the world of possibilities opened up. The star of the show: EE.VMULAS.S8.ACCX.LD.IP, an instruction that performs a fused multiply + add + load, and gives us options on how to increase the pointer to the source image at my own discretion (with some acceptable restrictions). Now we are in Xtensa territory, which has a properly documented ISA and documentation. I...

2x2 MaxPool in ESP32 S3 Assembly
E/S Pronk • 3 days ago • 0 comments
```
xor a8, a8, a8
xor a9, a9, a9
movi a7, {image_height}
movi a6, {image_width}
slli a12, a6, 4
or a13, a12, a12
addi a13, a13, -16
    
max_col:
    movi a6, {image_width}
    or a10, a8, a8
    add a8, a12, a8
    or a11, a9, a9
    addx2 a9, a12, a9
    
max_block:
    ee.vld.128.ip q0, a11, 16
    ee.vld.128.xp q1, a11, a13
    ee.vmax.s8.ld.incp q2, a11, q5, q0, q1
    ee.vmax.s8.ld.incp q3, a11, q6, q5, q2
    sub a11, a11, a12
    ee.vmax.s8 q7, q6, q3
    st.qr q7, a10, 0
    addi a10, a10, 16
    addi a6, a6, -2
    bnez a6, max_block
end_max_block:

    addi a7, a7, -2
    bnez a7, max_col
end_max_col:
```
I managed to get my ESP32 S3 Emulator to a level where it can run a lot of the SIMD instructions. I am implementing functionality as I need it, only implementing instructions when I see a reason to use them in the assembly code I'm writing. This makes the process a bit more manageable, because writing the emulator is mind-numbing, carpal tunnel inducing torture as it is...
By using HWC format, and using a number of channels that is a multiple of 16 helps with alignment. Each pixel is exactly one EE.VLD.128.IP instruction for 16 channel data. The max pool uses these in conjunction with the EE.VMAX.S8.LD.INCP which calculates the maximum between 2 vectors containing 16x8 bit signed integers, while loading new data.
Biting the bullet...
E/S Pronk • 05/24/2024 at 23:01 • 0 comments
Writing assembly for the ESP32 S3 is fun, don't get me wrong. But it is also very frustrating when you have to debug it, rebuild, wait for the upload, monitor, crash, repeat. I know you can do emulation and gdb, but to be fair, I haven't ever used gdb before and I don't feel like learning to use it at this time.
At the same time, I'm still learning all the ins and outs of the instructions available on the Xtensa CPU. This basically boils down to reading and rereading the ISA over and over until it sticks.
I'm all for getting things done fast, with the least amount of resources spent, because I'm a prototyper making proof-of-concepts. But you do have to be smart about it. I noticed that as my program grows in complexity, so does the time debugging the assembly and its overhead (building, flashing).
So I decided to spend a considerable amount of time writing an ESP32 S3 emulator, in python. It must be able to read some assembly, execute it and be able to show me the state of the registers. I can then import this in a notebook and start an interactive session with the emulator basically. When the assembly does what it is supposed to do, it could output the assembly to be assembled by the assembler, or maybe even assemble it itself.
This will give me 2 things:
- I will learn the details of every instruction in the ISA and get a very detailed overview of the CPU's inner workings.
- I can develop and debug much quicker
My prediction is that the time spent writing this emulator is very easy to win back, considering the amount of assembly code I want to write for this (and subsequent) projects.
The code will be available here. Keep in mind that its purpose is to serve the 2 goals stated above, not to be a complete or perfect emulator.
esp32s3_emu on github
4x4 Convolutions
E/S Pronk • 05/17/2024 at 00:58 • 0 comments
We're so used to using 3x3 convolutions we don't often think about switching it up, and why would we? The 3x3 convolution is very efficient, and 2 of them back to back with a nonlinearity between them usually outperform an equivalent 5x5 convolution. So why would you use 4x4 convolutions instead?

Technically, a 4x4 kernel can be constructed by padding a 3x3 kernel with 0's, which means that they can serve as drop in replacements. When you look at the SIMD instruction set available on the ESP32 S3 you quickly see that you are best off when working with 16 bytes at a time. Also, you can avoid headaches by keeping your data access aligned to 16 byte boundaries. So for the first layers in my network I replaced the 3 to 16 channel 3x3 convolution with a 4 channel (RGB+pad) 4x4 one, and retrained the network. The bigger convolution requires a bit more resources during training and improves the accuracy of the network slightly. But now, I can load the feature map for a single output channel and multiply+accumulate it with a block in the source image in only 10 instructions (using NHWC format):
```
ee.ld.accx.ip %[bias],0
ee.vld.128.ip q0,%[in],128
ee.vld.128.ip q4,%[weight],16
ee.vld.128.ip q1,%[in],128
ee.vmulas.s8.accx.ld.ip q5,%[weight],16,q0,q4
ee.vld.128.ip q2,%[in],128
ee.vmulas.s8.accx.ld.ip q6,%[weight],16,q1,q5
ee.vld.128.ip q3,%[in],128
ee.vmulas.s8.accx.ld.ip q7,%[weight],16,q2,q6
ee.vmulas.s8.accx q3,q7
```
Imagine all the shifting and masking needed in order to make this a 3x3x3 convolution.

Once you reach a point where the number of channels is a multiple of 16 you're out of the woods, as long as you're using NHWC :)
Quantization: PyTorch vs ESP32 S3
E/S Pronk • 05/13/2024 at 19:13 • 0 comments
I'm working on a custom model, and I'm using pytorch to train it. Most of the layers are custom so I can't just export to some standard format and hope for the best. I'm going to duplicate the layers' logic in C on the ESP32, then use PyTorch to quantize my model weights.

I would like to try the ESP-DL library from Espressif, but unfortunately they use a different quantization scheme than PyTorch and claim you can't use your model with their API. This is not entirely true, it's just that there is no easy way to use your model with their quantization scheme, but you certainly can.
The key thing to understand is how both quantization schemes work. PyTorch uses a zero-point and a scale:
```
f32 = (i8 - zero_point) * scale
```
while ESP-DL uses an exponent:
```
f32 = i8 * (2 ** exponent) 
```
which they claim is not compatible.
We can make this work though, if we force PyTorch to use a zero-point with value 0 and a scale that is always 2 to the power of a (signed) int.
Getting a zero-point of 0 is easy, we have to set the qconfig to use a symmetric quantization scheme. The scale is a little bit harder but no rocket science either: We can overload a suitable QuantizationObserver to produce qparams with a scale that is updated to
```
scale = 2 ** round( log2( scale ))
```
Like so:
```
import torch.ao.quantization as Q
class ESP32MovingAverageMinMaxObserver(Q.MovingAverageMinMaxObserver):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
    def _calculate_qparams(self, min_val: torch.Tensor, max_val: torch.Tensor):
        s,z = super()._calculate_qparams(min_val, max_val)
        assert (z == 0).all()
        s = 2 ** s.log2().round().clamp(-128,127)
        return s,z
```
Then when it is time to export the weights we also export the exponent we use in ESP-DL by simply getting the log2 of the scale of the weight tensor.

View all 4 project logs

isa-summary.pdf XTensa Instruction Set Adobe Portable Document Format - 4.50 MB - 05/13/2024 at 18:48	Preview	Download

esp32-s3_technical_reference_manual_en.pdf Contains the ISA for the S3 extended instructions Adobe Portable Document Format - 13.94 MB - 05/13/2024 at 18:47	Preview	Download

Running a PyTorch Model on the ESP32 S3

Description

Details

SPEED!

ESP-DL

ESP-DSP

EE.VMULAS.S8.ACCX.LD.IP

Files

isa-summary.pdf

esp32-s3_technical_reference_manual_en.pdf

Project Logs

Collapse

2x2 MaxPool in ESP32 S3 Assembly

Biting the bullet...

4x4 Convolutions

Quantization: PyTorch vs ESP32 S3

Discussions

Similar Projects

TinyML meets dog training

Look Who's Talking 0256

AR Breadboarding

Portable Vertical Plotter

Running a PyTorch Model on the ESP32 S3

Become a Hackaday.io member

Just one more thing

Description

Details

SPEED!

ESP-DL

ESP-DSP

EE.VMULAS.S8.ACCX.LD.IP

Files

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse