Close

I need a big cluster.

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 07/02/2021 at 22:390 Comments

The complete scan of w26 is about to end in a few hours and I'm already thinking beyond that, and more particularly : HOW it would be possible. It is certainly out of reach of my laptop. I didn't even solve the issues with the "bidirectional" code but the 2x gain would have been in vain, considering the huge amount of computation that is required.

Despite being good to know, I'm not interested in w28, w29, w30 or w31 and I would like to directly validate and characterise w32, even as I imagine it's not going to fly far, but even knowing its "altitude" would be a big progress, in case the number of orbits is "low enough". The w32 state space represents a prominent target because a lot of platforms are 32-bit wide and using w26 adds some complexity and reduces the throughput. There are 3 XOR-related operations in the w26 code that would be eliminated if w32 had a sufficient quality : no unreachable state and a reduced number of orbits with similar size (no short orbit please).

So far my i7 has computed 2 million billions of cycles in 2 months for w26. From there I estimate that mapping w32 completely would require 2⁶⁴ cycles, or 2¹³=8K times more computations, which would amount to 16000 months, or 1300 years. This is just ridiculous but it's reminiscent of other challenges, such as the famed FPGA-based DES crackers, Bitcoin mining farms, Mersenne prime checkers, SETI, protein folders and many other distributed endeavours (scientific or not).

Memory-wise, it's not totally crazy but beyond my 16GB : I imagine that I'll consider only crossings of the Y=0 axis, of which there are 2³², or 4G. Then I want to know the coordinates of the precedent and next crossings (that's 2 uint32), the length of that arc (that's a uint64) and the known start of the list of arcs (the lowest of the arcs in the linked list thus formed) => that's 20 bytes. Total=80GB : the size of a modest SSD but also 5× my current installed RAM.

Fortunately, when laid out this way, the challenge is almost "trivially paralleled" and clearly CPU-bound: there would be 2³² subtasks started with an individual coordinate and each subtask must return the length of its assigned arc, plus the next X for which Y=0. There would be about 2⁶⁴ cycles to run and the more cores (even the simplest) the better. Given N cores, the total work can be divided into 2³² / N ranges that each core explores.

Knowing how one can compute the precedent arc, with the "backwards" code, it is tempting to compute them in parallel "to save time" but this risks duplicating the efforts. OTOH it is a safety measure, to double-check the validity of the work, at the cost of doubling the exploration time.

Some recent chips feature tens, even hundreds of small/simple cores so this would be the way to go. For example, a beefy GPU that would be programmed in CUDA, but I don't have that and have no idea how to get there. Maybe a multicore server processor like SUN's Niagara or Ampere's 128-core ARM processor. I have no access to a modern/recent supercomputer, academic or otherwise. Even a JS-based applet that works in the background would do the trick... but I don't see how to enroll/enlist enough volunteers. Particularly in an age where stealth miners are now automatically blocked.

Anyway the goal would be to determine the number of orbits that cross Y=0, and their length. This is inferred by the procedure explained above with the 2³² subtasks, then some data processing will string the arcs together and collate the data. From there it is easy to figure how many states have not been accessed, if it is the case.

I might get inspiration from the cryptocurrency miners, who have deployed a lot of methods to get cheap CPU time, but in return their abuses have been thwarted and it makes my work more difficult.

Discussions