After reading some online tutos (such as this excellent page about GCC at https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) I was ready to write my first program to fully exploit my modern computer. The i7-10750H supports AVX2, which extends SSEx to 256 bits wide. This means 8 parallel operations at 3.5GHz on a single core, and the processor has 12 logical cores... So that's a speedup of 8×16 compared to the initial version of my code. The CPU is advertised for reaching 5GHz but it seems Linux is configured by default for "power saving", a single thread will run easily at 3.6GHz, but the frequency drops a bit when more threads are added.
We already know that AVX2 works as 128-bit chunks internally so the 256-bit wide chunks wouldn't normally add extra performance but the loop is pretty tight (Screen Rant Pitch Meetings fans ?) and the latency is not insignificant so I have no hesitation to increase it further to 512 bits, or 16 parallel 32-bit computations. That CPU is meant to have a 45W TDP, right ?
I can get the code to process about 3.4 billion additions back-to-back per SIMD lane, or 55 billion adds per second per core. That could project to about 660 billion adds per second peak. It seems I can push the width to 24 words to get the runtime to decrease a bit, and still get 67G adds/second (800Ga/s for 12 threads, to be tested).
AVX2 is great for that kind of speedup (and CUDA even more) but I have not found how to check a condition on a result.
More realistically: with the full iterative code (sans condition), I can get 740M steps per second (per core and per lane). I use 24 lanes (3 ZMM registers in parallel/unrolled to mask the latency) so that's 17G steps per second per thread. AVX2 prevents the "turbo boost" and limits the clock to 3.5GHz when running at full scale but that's still about 200G steps per second (potentially). This would scan w26 in about 50 hours ideally...
But this performance will certainly drop again when I find how to compare and branch.
And the runtime increases by >60% when it runs on all available threads... it runs at about 450M steps/lane/sec, that's still 130G steps/s combined.