Going both ways

I am now writing a program that runs both backwards and forwards iterations simultaneously on 2 cores, using the POSIX threads library.

The performance seems reasonable, I included some little scheduling tricks to save a fraction of performance here and there.

Forward alone:

[yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
Measuring the first orbit of the Pisano-Carry state space
 for WIDTH = 16 bits, or about 2147516416 iterations.
Starting to scan forward from Step 0 : X=1, Y=0, C=0 
Forward scan reached init state after 2147516414 steps and 33035 crossings
2.72user 0.00system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 1664maxresident)k
0inputs+0outputs (0major+71minor)pagefaults 0swaps

Backwards alone :

[yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
Measuring the first orbit of the Pisano-Carry state space
 for WIDTH = 16 bits, or about 2147516416 iterations.
Starting to scan backwards from Step 0 : X=1, Y=0, C=0 
Backwards scan reached init state after 2147516414 steps and 33035 crossings
4.63user 0.00system 0:04.64elapsed 99%CPU (0avgtext+0avgdata 1572maxresident)k
0inputs+0outputs (0major+68minor)pagefaults 0swaps

Backwards and forward simultaneously:

[yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
Measuring the first orbit of the Pisano-Carry state space
 for WIDTH = 16 bits, or about 2147516416 iterations.
Starting to scan forward from Step 0 : X=1, Y=0, C=0 
Starting to scan backwards from Step 0 : X=1, Y=0, C=0 
Forward scan reached init state after 2147516414 steps and 33035 crossings
Backwards scan reached init state after 2147516414 steps and 33035 crossings
7.63user 0.00system 0:04.76elapsed 160%CPU (0avgtext+0avgdata 1540maxresident)k
0inputs+0outputs (0major+70minor)pagefaults 0swaps

The overhead is marginal but explodes when other computations occupy the remaining 2 threads. There is nothing to do there though.

The big question is how to synchronise the cores ? How do we know the orbit is closed and where ?

So far the strategy is to run as fast as possible until X==1. This happens approximately 2^(W-1) during a whole orbit. However the orbit has approx. 2^(2W-1) steps so the "crossings" get rarer as W increases.

For now, in this "rare" event, I test for Y==0 to end the loop, but this is where the threads could compare their respective results, namely the Y tagged with the iteration count. C is most probably 1 so not worth registering.

Anyway this cuts down the number of comparisons and for W=26 this would happen in average every 2²⁵ iteration, or 32M steps. At almost 1G steps per second, that's about 30 times per second. A buffer of one second will help smooth things for the moments when a lot of crossings occur on one thread while the other is in a long trajectory between distant crossings. So let's have a FIFO with 64 or 128 values.

Semaphores should be avoided as much as possible because they block the other thread which could perform useful computations. However safe synchro primitives must be used to prevent the usual problems in parallelism.

-o-O-0-O-o-

The desired behaviour is that the fastest thread sends a sort of message (step number and value of Y) to the slower thread, which will compare Y with its own series of values. The messages are materialised and queued by a FIFO, which the "faster/sender" writes and the "slower/receiver" reads. This matches what Wikipedia describes in Non-blocking algorithm:

[...] some non-blocking data structures are weak enough to be implemented without special atomic primitives. These exceptions include:

a single-reader single-writer ring buffer FIFO, with a size which evenly divides the overflow of one of the available unsigned integer types, can unconditionally be implemented safely using only a memory barrier

...

So there is no need of semaphores, mutex, futex, Compare-And-Swap or other complex or blocking techniques, as long as proper memory ordering hygiene is observed. Which is not trivial either but it's promising in the sense that performance will not be impacted.

A simple implementation is found at page 26 of Lecture 18: Fine-grained synchronization & lock-free programming with a reference to DDJ's article Lock-Free Code: A False Sense of Security.

A bigger enhancement

Can you add states ?

Discussions

Become a Hackaday.io Member