Close

sdrawkcab gnioG

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 05/13/2021 at 04:150 Comments

The log 22. Some more theory explored the idea of computing the sequence in reverse, with the hope of halving the time needed to compute a whole orbit. Today I made this microbenchmark: orbits_forwd2.c.

This is the result of several basic optimisations targeting OOO CPUs such as the i7 of my laptop, and I got pretty interesting results. First let's look at the basic backwards algorithm:

// reverse the split
tb = (C << WIDTH) | X;
Xb = Y;
Yb = tb - Y;

// Adjust the carry
Cb = Xb - Yb;
Cb = ((int32_t) Cb) >> WIDTH;
Yb += Cb; // N + -1 = N - 1
Cb &= 1;

 So far it's quite good, as it avoids branch prediction horrors by computing C directly. There is no access to the carry bit so it is computed by hand. The result of the comparison is sign-extended into the full register, giving either 0 or -1. This is added to selectively decrement Y and we're done.

But it's long and slow so there is no hope or benefit on this front, except for the claim and confirmation that indeed, the computation is reversible (as was supposed) and the theory is confirmed: if there is only one successor and one predecessor, then the values must loop and create an orbit. I checked with several programs up to w=16 and the algo is good.

The story doesn't end here though. The comparison part introduces more delay through dependencies so I tried to break them through substitutions. I was not disappointed and after some progressive enhancements, I came up with the version included in orbits_forwd2.c.

    // reverse the split
    t = C | X;
    X = Y;
// Adjust the carry
//  C = X - Y;  // normal comparison
//  C = Y - (t - Y); // Substitution
    C = (2*Y) - t;  // more substitution
    Y = t - Y;

    t = ((int32_t) C) >> WIDTH;
    Y += t; // N + -1 = N - 1
    C &= SIZE;

Among the enhancements, some lines have been moved up or down, t is used 2×, C's value is left unshifted for a direct use "in place" and successive substitutions provide a formulat that can be run in parallel, breaking the dependency as expected. This creates a weird situation where C and Y compute the opposite values : Y-t and t-Y. Little by little, this makes the formula look more rich and interesting than the forward formula...

Now let's speak numbers : through the optimisations, the backwards version is usually 50% slower than the forward version. Some more tweaks (smoother interactions with the loop body) bring the overhead to about 1/3 but this is never going to be as fast as the forward version. The fastest runs for w=16 I had are 2.88s for forward and 3.88 for backwards.

Furthermore, the computation seems to saturate the i7's integer pipelines and the 2nd thread's performance is affected. For example, if I run the microbenchmark while the 2 cores are already running the orbit search with w=26 and w=27, the microbenchmark's runtime is doubled. Or I don't understand Intel's "hyperthreading" and it's more lousy than I thought.

This means that for the orbit search to work efficiently, it must use one core to walk forward and another to walk backwards at full speed. Synchronisation then becomes a nightmare : how do we know when and where the two trajectories meet ?

The first idea is naturally to compute both directions in "lock-step": compute forward, compare, compute backwards, compare, rinse, repeat. This implies that all computations happen in the same CPU core but there is the risk of saturation of the units. So this might not accelerate the whole search.

Another idea is to run two cores in parallel, one in each direction, but synchronisation is a big issue. The comparisons can be minimised when Y=0 and for w=27, this could fit into 2^24=16M bytes, but still larger than my L3 cache so it's going to thrash a lot. Worse : cache coherency will further slow everything down and let's forget about the "huge TLB" support.

Anyway : the computation is reversible with a reasonable amount of effort but it doesn't seem to provide a practical advantage, considering today's computers architecture.

Discussions