2x2 MaxPool in ESP32 S3 Assembly

A project log for Running a PyTorch Model on the ESP32 S3

This log describes the steps I am taking to make a model trained in PyTorch work on the ESP32 S3 (M5Stack Core S3)

es-pronkE/S Pronk 05/30/2024 at 23:100 Comments
xor a8, a8, a8
xor a9, a9, a9
movi a7, {image_height}
movi a6, {image_width}
slli a12, a6, 4
or a13, a12, a12
addi a13, a13, -16
    movi a6, {image_width}
    or a10, a8, a8
    add a8, a12, a8
    or a11, a9, a9
    addx2 a9, a12, a9
    ee.vld.128.ip q0, a11, 16
    ee.vld.128.xp q1, a11, a13
    ee.vmax.s8.ld.incp q2, a11, q5, q0, q1
    ee.vmax.s8.ld.incp q3, a11, q6, q5, q2
    sub a11, a11, a12
    ee.vmax.s8 q7, q6, q3
    st.qr q7, a10, 0
    addi a10, a10, 16
    addi a6, a6, -2
    bnez a6, max_block

    addi a7, a7, -2
    bnez a7, max_col

I managed to get my ESP32 S3 Emulator to a level where it can run a lot of the SIMD instructions. I am implementing functionality as I need it, only implementing instructions when I see a reason to use them in the assembly code I'm writing. This makes the process a bit more manageable, because writing the emulator is mind-numbing, carpal tunnel inducing torture as it is...

By using HWC format, and using a number of channels that is a multiple of 16 helps with alignment. Each pixel is exactly one EE.VLD.128.IP instruction for 16 channel data. The max pool uses these in conjunction with the EE.VMAX.S8.LD.INCP which calculates the maximum between 2 vectors containing 16x8 bit signed integers, while loading new data.