Close

How to double the scan rate

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 11/26/2021 at 17:430 Comments

As noted in 76. More efficient scan of the state space and reminded in 84. Compute better, compute more, there is a simple way to double the scan speed within a single loop, when trying to cover the whole orbit. To see how, let's look at a diagram from the old log 21. Some theory:

In the video, I used a different picture:

The symmetry and complementarity are apparent, which mean that when one orbit is being scanned, the other is implicitly scanned as well.

Let's say we have an orbit defined by the width W and the coordinates X, Y and C. The complementary orbit has coordinates X', Y' and C'. Because of the central symmetry, we have X + X' = 2^W -1 and Y + C + Y' = 2^W (because C+C'=1)

The existing code tests for intersection of the orbit with Y=0 (which is just a delayed version of X=0):

do {
    arclen++;
    C += X + Y;    Y = X;
    X = C & MASK;  C = C >> WIDTH;
} while (Y != 0);

This needs to be modified to test for 2 intersections: Y=0 and Y=MASK.

Oh and the Y = X part could... should be removed, and the variables renamed, thus unrolling the loop a bit.

do {
    arclen++;
    C += X + Y;
    X = C & MASK;  C = C >> WIDTH;
// test Y here
    arclen++;
    C += X + Y;
    Y = C & MASK;  C = C >> WIDTH;
// test X here
} while (1);

So it's back to the original "add X to Y and Y to X" dance. And it's already paying off. Do you remember the parallel version with forward and backward iterations? It scanned 2147516415 steps in 1.71 seconds. The unrolled version gets us close to that:

$ gcc -g -Wall -DWIDTH=16 -Os faster_02.c -o faster &&  \
> /usr/bin/time ./faster
Ended odd after 2147516415 steps
2.07user 0.00system 0:02.07elapsed 99%CPU 

Pretty good, right ? Here is the code.

  do {
    C += X + Y;
    Y = C & MASK;  C = C >> WIDTH;
    if (X==0) {
      if (Y==1) {
        arclen++;
        printf("Ended odd after %"PRIu64" steps\n", arclen);
        break;
      }
    }

    arclen+=2;
    C += X + Y;
    X = C & MASK;  C = C >> WIDTH;
    if (Y==0) {
      if (X==1) {
        printf("Ended even after %"PRIu64" steps\n", arclen);
        break;
      }
    }
  } while (1);

The unrolled version is pretty efficient, despite a longer, redundant code.

Compare the unrolled version to the following, earlier version, that ran in 2.23s:

  do {
    do {
      N++;
      C += X + Y;
      Y = X;
      X = C & MASK;
      C = C >> WIDTH;
    } while (Y != 0);
  } while (X != 1);

10% speed increase is nice to have.

Even arclen++; has been simplified to free up an adder in the pipeline. I increment by 2 on each loop, and "fix" on the exit of the odd condition. Unfortunately there was no noticeable effect on my platform. I keep it around just for the comfort of "if ever..."

But we're not yet there: despite the speedup, it's still missing one half of the comparisons. The comparisons keep execution units busy so more comparisons means more time to compute, but the OOO keep these out of the critical datapath so the performance should not be significantly affected.

.

.

.

Good news everyone !

The new version of the loop core with multiple tests has no effect on its speed.

loopstart:
    C += X + Y;
    Y = C & MASK;  C = C >> WIDTH;
    if (X==0)
      goto endofloopodd;
    if (X==MASK)
      goto endofloopodd_comp;

loopeven:
    C += X + Y;
    arclen+=2;
    X = C & MASK;  C = C >> WIDTH;
    if (Y!=MASK)
      goto endofloopeven_comp;
    if (Y!=0)
      goto loopstart;

The bad news is the hit when run in parallel: it increases for all running threads, for each added thread, on a i7-2620M:

$ /usr/bin/time ./faster4
2.05user 0.00system 0:02.08elapsed 98%CPU 

$ /usr/bin/time ./faster4 & /usr/bin/time ./faster4
2.16user 0.00system 0:02.17elapsed 99%CPU
2.16user 0.00system 0:02.19elapsed 98%CPU

$  /usr/bin/time ./faster4 & /usr/bin/time ./faster4 & /usr/bin/time ./faster4
2.17user 0.00system 0:02.17elapsed 99%CPU
2.73user 0.00system 0:02.73elapsed 99%CPU
2.75user 0.00system 0:02.76elapsed 99%CPU 

$ /usr/bin/time ./faster4 & /usr/bin/time ./faster4 & /usr/bin/time ./faster4 & /usr/bin/time ./faster4
2.91user 0.00system 0:02.92elapsed 99%CPU
2.91user 0.00system 0:02.93elapsed 99%CPU
2.94user 0.00system 0:02.97elapsed 99%CPU
2.94user 0.00system 0:02.97elapsed 98%CPU

In a sense, it tells a lot about the efficiency of the code, if the "hyperthreading" system of the i7 can't manage to run both threads at full speed. OTOH it's worrying...

Trying on a more recent i7-7600U:

1 thread: 1.75
2 thread: 1.75, 1.75
3 threads: 1.75, 2.05, 2.06
4 threads: 2.11, 2.11, 2.11, 2.11

The 2.11s run time remains constant even when overloading the system so it seems Intel got their shit together on the more recent generations. The 7600U will serve as a workhorse for longer runs but I have the 10750H available too.

Note: gcc optimisation levels don't affect the runtime, -Os = -O2 = -O3. But the runtime is 3× longer without them...

.

.

Now, I have another problem: all arcs start and end at Y=0. But the added test interrupts the arc at Y=MASK... Should I work with semi-arcs? I'll have to crunch the numbers...

.

_____________________________________________________________

Update 20211203 :

There was a bug:

    if (Y!=MASK)      goto endofloopeven_comp;

should be

    if (Y==MASK)      goto endofloopeven_comp;

and by miracle, another weird bug disappeared.

The performance is affected by a few percents, due to the possible compiler optimisations that are not possible, now that the circumvention of the secondary bug is gone. But  2.13s is still good, considering that it covers all semi-arcs. faster_05.c  is the new working version.

Discussions