4x4 Convolutions

A project log for Running a PyTorch Model on the ESP32 S3

This log describes the steps I am taking to make a model trained in PyTorch work on the ESP32 S3 (M5Stack Core S3)

es-pronkE/S Pronk 05/17/2024 at 00:580 Comments

We're so used to using 3x3 convolutions we don't often think about switching it up, and why would we? The 3x3 convolution is very efficient, and 2 of them back to back with a nonlinearity between them usually outperform an equivalent 5x5 convolution. So why would you use 4x4 convolutions instead?

Technically, a 4x4 kernel can be constructed by padding a 3x3 kernel with 0's, which means that they can serve as drop in replacements. When you look at the SIMD instruction set available on the ESP32 S3 you quickly see that you are best off when working with 16 bytes at a time. Also, you can avoid headaches by keeping your data access aligned to 16 byte boundaries. So for the first layers in my network I replaced the 3 to 16 channel 3x3 convolution with a 4 channel (RGB+pad) 4x4 one, and retrained the network. The bigger convolution requires a bit more resources during training and improves the accuracy of the network slightly. But now, I can load the feature map for a single output channel and multiply+accumulate it with a block in the source image in only 10 instructions (using NHWC format):

ee.ld.accx.ip %[bias],0
ee.vld.128.ip q0,%[in],128
ee.vld.128.ip q4,%[weight],16
ee.vld.128.ip q1,%[in],128
ee.vmulas.s8.accx.ld.ip q5,%[weight],16,q0,q4
ee.vld.128.ip q2,%[in],128
ee.vmulas.s8.accx.ld.ip q6,%[weight],16,q1,q5
ee.vld.128.ip q3,%[in],128
ee.vmulas.s8.accx.ld.ip q7,%[weight],16,q2,q6
ee.vmulas.s8.accx q3,q7

 Imagine all the shifting and masking needed in order to make this a 3x3x3 convolution.

Once you reach a point where the number of channels is a multiple of 16 you're out of the woods, as long as you're using NHWC :)