Branchless gPEAC

The latest (and first) test result is clear: a basic branchless iteration is 35% faster than the reference code.

...
46  2693509383  334015557  0.1240076   2693511054  448954961  0.1666802  0.3441131
47  2693530123  332309597  0.1233733   2693583097  449581581  0.1669084  0.3528731
48  2693527773  333995349  0.1239992   2693637400  449048669  0.1667072  0.3444211
49  2693514118  334861793  0.1243215   2693502259  449658083  0.1669418  0.3428230
# 50:  0.1233530   0.1664931   0.3497823

OYIIIISSSSSS....

Thanks to the benchmarking framework developed before, I could clearly and accurately get this encouraging result right away, which is only one first step on the way to objctive peep-hole optimisation. Already, using this new code saves 1/4 of the computation time for a mass-scale scan...

The reference code is below:

uint64_t orbit_1(uint64_t modulo,   // !!! must be negative !
    volatile unsigned int *stop_flag) { //  uint64_t max) {
  uint64_t len=0, X = 1, Y = 0, t;

reloop:
    Y += X;
    t = Y + modulo;  // Y - -modulo : reduction
    if (!(t >> 63)) {
      Y = t;
      X++;
    }

    if (((Y == 1) && (X==0))
     || (*stop_flag != 0))
      goto return_1;

    X += Y;
    t = X + modulo; // X - -modulo : reduction
    if (!(t >> 63)) {
      X = t;
      Y++;
    }

    len+=2;
  if ((X != 1) || (Y != 0))
    goto reloop;
  return len;

return_1:
  len++;
  return len;
}

It's close to the very original source code but with already one "optimisation" : the modulo parameter is negative, so the inner code can use an addition opcode, which can be merged in a LEA when combined with other additions, and it follows the coding trick of the log 122. Funky modulo.

Already, we're setting the stage for later : x86 has no MIN/MAX opcode but CMOV so it is a natural fit just after the ADD, in a version written in asm.

What prevents the compiler from optimising the conditional jump into a conditional move is the conditional increment. Again, in asm, it's very simple : since the carry flag is not affected by the CMOV, one can do an ADC instruction just after it. 3 opcode and you're done ! Well, not really since the carry flag must be complemented...

Also we can see that the result of the conditional incrementation leaves useful flags for the succeeding test of X==0 for example. But first let's see how to make and test an intermediary version. Here is one which is 35% faster:

uint64_t orbit_2(uint64_t modulo,   // !!! must be negative !
    volatile unsigned int *stop_flag) { //  uint64_t max) {
  uint64_t len=0, X = 1, Y = 0, t, s;

reloop:
    Y += X;
    t = Y + modulo;  // Y - -modulo : reduction
    s = (int64_t)t >> 63;
    if (!s)
      Y = t;
    X += 1+s;

    if (((Y == 1) && (X==0))
     || (*stop_flag != 0))
      goto return_1;

    X += Y;
    t = X + modulo;
    s = (int64_t)t >> 63;
    if (!s)
      X = t;
    Y += 1+s;

    len+=2;
  if ((X != 1) || (Y != 0))
    goto reloop;
  return len;

return_1:
  len++;
  return len;
}

I only changed a few lines and the critical instructions are already looking very different:

addq    %rcx, %rdx
leaq    (%rdx,%rdi), %r8
movq    %r8, %r9
sarq    $63, %r9
cmove   %r8, %rdx
leaq    1(%rcx,%r9), %rcx

The unpredictable jump has disappeared and the movq and sarq can be shaved by direct asm code.

Differential benchmarking

More results !

Discussions

Become a Hackaday.io Member