More results !

Things went incredibly well lately for this little peep-hole optimisation and benchmarking fest.

The final code clocks 130% faster than the original one ! With a weird discovery : the speedup is only 123% when the system is idle. I can only suppose that when I run another application at the same time, it messes with the Branch History Table of the branch predictor. Anyway, it's still a very very good improvement. There is still room for more though that would be almost pointless because all the low-hanging fruits have been plucked and the use of x86 asm makes it less portable.

The most advanced version is coded in pure assembly language so the throughput goes from 0.1276853 word per Herz, to 0.2847745 word per Herz. It uses branchless coding techniques, in particular direct carry flag reuse, as shown below:

uint64_t orbit_1_5(uint64_t modulo,
    uint64_t parity, uint64_t maxlen) {

  uint64_t len=0, X=1, Y=parity, nodulo=-modulo;

  asm volatile("\n\t"  // the asm code:
    ".p2align 4,,12\n"
"reloop15:\n\t"

// X to Y
    "movq %[nodulo], %%rsi \n\t"  // t = -modulo
    "addq      %[X],  %[Y] \n\t"  // Y += X;
    "addq      %[Y], %%rsi \n\t"  // t = - modulo + Y
    "cmovc   %%rsi,   %[Y] \n\t"  // if CC=0, move t to Y
    "adcq       $0,   %[X] \n\t"  // increment X if CC=0

    "jnz   Xnot0_15        \n\t" //  if (X == 0) {
    "cmpq        $1,  %[Y] \n\t"  //    if ((Y == 1)
    "je     Yeq1_15        \n\t"
    "cmpq %[maxlen],%[len] \n\t"  //     || (len > maxlen))
    "jb    Xnot0_15        \n\t"
"Yeq1_15:"
    "incq     %%rax        \n\t"  // return len+1;
    "ret                   \n"
"Xnot0_15:"                       // }

// Y to X
    "movq %[nodulo],  %%rsi \n\t"  // t = -modulo
    "addq        $2, %[len] \n\t"  // len+=2
    "addq      %[Y],   %[X] \n\t"  // X += Y;
    "addq      %[X],  %%rsi \n\t"  // t = X - modulo
    "cmovc   %%rsi,    %[X] \n\t"  // if CC=1, move t to Y
    "adcq       $0,    %[Y] \n\t"  // increment X if CC=1

    "jnz   reloop15        \n\t"   //  if (Y != 0) goto reloop;
    "cmpq        $1,  %[X] \n\t"   //  if (X != 1)
    "jnz   reloop15        \n\t"   //    goto reloop;

    : [Y] "+r" (Y), // outputs
      [X] "+r" (X),
      [len]  "+a" (len)  // figé à rax
    : [nodulo] "r" (nodulo),  // input
      [maxlen] "r" (maxlen)
    : "cc", "rsi" // clobber
  );
  return len;
}

Getting there was epic...

I would rather code directly in asm with NASM for example. I try to keep the code portable and adaptable. I had to go around many coding limitations with my older GCC6.4 (no branch allowed to external labels) and ridiculous semantic changes in GCC11. Oh and I had mistakenly introduced a bug... But this is solved, as my programs and scripts prove a match between the reference code and all the intermediate ones, up to the one above.

So this means that with more than 2× speedup, I can scan more moduli, or do it in a shorter time. One further gain is the early exit, which is set just above the length of a maximal orbit. We know that there is no size between the maximal and perfect orbits so there is no point scanning beyond the maximal size. This saves another 20% of total runtime (at most) but that's an easy one to code.

So here are some stats using the "relative benchmark" developed earlier:

function      perf. ref   perf new    ratio
orbit_0 :     0.1332254   0.1335702   0.0026046
orbit_0_1 :   0.1341385   0.1348277   0.0051707
orbit_0_2 :   0.1331803   0.1187875   -0.1080367
orbit_0_4 :   0.1348576   0.1621874   0.2026608
orbit_0_5 :   0.1275372   0.1386805   0.0873863  <-- unexplained drop for the reference code ??
orbit_1_0 :   0.1273422   0.1385944   0.0883720
orbit_1_1 :   0.1281276   0.2480188   0.9357474
orbit_1_2 :   0.1275311   0.2431667   0.9067352
orbit_1_3 :   0.1278160   0.2873751   1.2483585
orbit_1_4 :   0.1272402   0.2867738   1.2538537
orbit_1_5 :   0.1279600   0.2864957   1.2389944

The progress goes down and up, since to prepare each enhancement, the code needs adaptation and validation before applying it.

Small variations at the end can be in the 1 or 2% ballpark so orbit_1_5 is not really slower than orbit_1_4, as further tests have shown.

The big jumps come from

The branchless code
The direct use of the carry flags

If iterations are guaranteed to come in pairs then it's even possible to go faster, since the adcq $0 can be sikipped and merged as adcq %[Y], %[X] just after.

Oh, you wanted to try by yourself ? Then get the source code !

From there, going from 1 word every 8 clock cycle, to 1 every 3.6, scanning the space up to 1 million will be easier. I'll see how fast it goes when scattered over 12 cores/24 threads soon...

Branchless gPEAC

More sieving

Discussions

Become a Hackaday.io Member