Close

Another strategy to scan the statespace

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 08/20/2021 at 21:570 Comments

The bidirectional scan is nice and a great exercise... but the backwards code is slower so the speedup is below 2.

On top of that, It is not scalable to more CPU when I could use 4 easily in SMP, and I just got my hands on a 12-threads CPU.

A different approach is needed.

For the "massively parallel" systems (mainly FPGA or RPi) I considered a totally parallel algorithm with jobs, with a direct association of an arc for each job. Each arc would be characterised by Y=0, a start X and end X, and a length (cycle count). This has the advantage that no intercommunication is required : the arcs can be attributed to a compute node then a compute core, with a central authority and even a hierarchy of sub-attribution. The drawback is the crazy size of the reports and the slow algorithm to sort everything at the end.

With a common memory pool, using SMP or GPGPU, another approach is possible where the sorting/post-processing takes place almost in parallel. The first major change is that each compute node does longer jobs, they don't stop when they find Y=0. They send the data (new X and length) and continue. The jobs are stopped when the manager encounters a collision, then reallocates a new starting point.

All the steps go forward, for two reasons:

  1. This is faster (the backwards version is maybe 30% slower)
  2. This makes the collision detection safer and easier, because the manager just has to compare with the starting X of the other jobs.

This is scalable from 2 to many more CPU, the critical parts are the synchronisation and scheduling. One thread has to manage all the messages from all the compute nodes, in a serial way to prevent race conditions, at least for small clusters.

A sort of hash table, with a size a few times larger than the number of jobs, will hold the starting X and length of the currently running jobs and also will manage the merging of arcs. When a job finds a X that matches the start of another job, this first job gets cancelled, the other job gets updated with the starting parameter of the first job. It's more or less how you handle a linked list...

This saves a lot of memory space because it is proportional to the number of compute nodes, not the size of the statespace.

A 2-level table seems appropriate :

  1. The first level is a hash table. Low collision rates are desired so the empty space should be maximised, it should fit in L2 or L3 though. The only value to store is the corresponding thread number. For a small system, less than 255 CPUs, this fits in one byte. One megabyte will have a collision rate of about 1M/nCPU, or 1/4096 worst case. For CUDA, a 16-bit word is required.
  2. A table of thread/jobs (starting at index 1 because 0 would mean "not used" ?) where the temporary length and start are stored.

However, to perform a better allocation for the new threads, a "scoreboard" of all the visited Y=0 should be maintained. A simple bit is enough. For w32, this means 4Gb, or 512MB. That's going to thrash the memory like crazy...  For w26 that's 8MB, which fits in the L3 cache of the i7-10750 (with 4MB for other uses).

To prevents the memory latency from stalling the computations, a decoupled/cancellable model lets the computation jobs enqueue their results in their FIFO and move on, without any other consideration.

Discussions