Close

The problem with SIMD

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 07/17/2021 at 17:340 Comments

I am still evaluating the options available to re-run w26 and hopefully run w32 in a reasonable time. So far, I look at these methods:

The last one would provide the most speedup (in the 10K× range today) but it relies on a very wide SIMD programming model, like 32 lanes of 32 bits. Branches, conditions and the likes create all kinds of disruptions and the performance drops (see https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming) for example). A RTX 3090 could perform 10496 integer ADD32 at 1.7GHz (17000 Giga-adds/second) but detecting the 0 and then branching and processing the result stalls ALL the 31 other lanes.

Same issue with the VideoCore GPU of the Raspberry Pi SBCs. The Pi3+ has 3 available clusters of 4 pipelines that can compute 16 ADD32 at 400MHz: that's 76G add32 per second. But this GPU is designed for graphics computations and any disruption (from tests, branches, processing) reduces the real throughput by an order of magnitude.

I could enlist some laptops but they are quite demanding (power, noise, room, OS/HDD) and Intel's hyperthreading reduces the performance of individual tasks. OTOH, the Pi3B+ has 4 independent ARM cores at 1.4GHz, that I might overclock a bit (with the help of a fan). And these are "normal cores" that have no sequencing/scheduling/threading constraints so even a little cluster would be cheap, efficient, not intrusive, quiet and repurposable. Because if I have to invest in HW, I would like to use it for other unrelated projects later. Having a cluster of 4 or 6 Pi3B+ would let me start with a POSIX implementation then, when it is operational, explore how to accelerate with the GPU (if possible at all).

Which makes me also think about implementing a TCP/IP dispatcher soon.

Discussions