Close

The road to w32

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 12/29/2021 at 09:350 Comments

The current version of my scanner is pscan_20211229.tgz and at this moment, it runs on the 12-threaded laptop at about 500 semi-arcs (crossings) per second on w25. w24 took 5h to complete in scalar mode on 12 threads so I expect this width to run for the whole day. The new laptop and the better approach (which scans only 1/2 of the states thanks to the symmetry) with better programming makes w26 "reasonably accessible", unlike the 2 months of the previous linear scan of w26. It might run for about 5 days maybe, so I can start this right now (tomorrow) because the next step would take longer than that to program.

The next step is the SIMD version. I try to approach it very progressively with a first step that runs 2 scalar scans simultaneously for every thread. Managing the data flow and debugging all the corner cases takes a while, but at the end, replacing the scalar computation with wider SIMD intrinsics will be a snap. And even without the ultra-wide intrinsics, going from 1 to 2 scalar computations per cycle will better exploit the inherent parallelism of the i7 and might bring some more speed, maybe 50% (though the lousy hyperthreading of the early i7 would reduce this).

With 256-bit SIMD, there is a direct speedup of 8 though AVX2 is notorious for slowing the i7's clock so ok, maybe 6. This makes w27 and w28 reasonably accessible. w25's log is about 800MB already so when SIMD becomes operational, it will be time to work on the fusion program. w26's log weighs about 1.7GB so all the data will fit in RAM and it will be quite fast to fuse. But the logs are piling up, larger and larger...

But w28 is not really interesting (unless a big surprise?) and w32 is the goal. The SIMD version is only an intermediary so the algorithm can be easily translated to CUDA. Only then will I be able to complete w32. But the runtime will be massive and I want to make sure the log is complete, accurate and reproducible. I might spread the run on many smaller ranges, maybe 16M points each (like w24), and next version of the scanner should support an "increment" value other than 1 so it can skip many values and quickly sample the statespace, and compare with the normally-scanned points. This will provide some "quality assurance" because I don't want any doubt to creep in about the validity and integrity of the logs.

So the path is : scalar -> multithreaded -> SIMD -> CUDA and I'm halfway through. The big leap will be CUDA though (I have never used it) but all those preliminary steps will make the conversion easier, since CUDA seems to implement many of these paradigms.

Discussions