Close

Please, GCC !

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 05/24/2021 at 08:380 Comments

I get tired of fighting GCC's weird code generator that won't get a clue from my source code. I'm (still) using gcc6.4.1 (2017, I know) but I don't know how to make it understand basic things... This is critical at my level because any little performance win saves many days of computation !

I get variable performance tiers (sometimes 50% bumps) and I can't see how to attribute (or force) them. I made many tests and I simplified the source code to this point, trying to remove as many jumps as possible out of the inner loop:

  while (N < limit) {
    chunk=0;

    do {
      do {
        chunk++;
        // Compute forward :
        t = X + Y + C;
        Y = X;
        C = t >> WIDTH;
        X = t & m;
      } while (Y != 0);

      crossings++;
      if ((X == 1) && (C == 0)) {
        N+=chunk;
        // Something something
        return NULL;
      }
    } while(chunk < chunksize);
    N+=chunk;
    printf("Forward Step %"PRIu64" : X=%d, Y=%d, C=%d \n",
        N, X, Y, C);
  }

I can see unexplained jumps in performance and I looked at the generated asm, where I found some good and some less good surprises...

gcc -g -S -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_05.c -o pt_orbit.S

And I get this:

.LVL6:
        .loc 1 86 0
        cmpq    limit(%rip), %r15  => while (N < limit) {
        movl    12(%rsp), %r9d
        jnb     .L3                => go to end of main loop
.LVL7:
.L16:
        .loc 1 68 0 discriminator 1
        xorl    %esi, %esi   => esi = chunk = 0
        jmp     .L8          => wait ? Why jump here ??
.LVL8:
        .p2align 4,,10
        .p2align 3
.L5:
        .loc 1 112 0
        cmpq    %r14, %rsi   => Why compare here and not at the bottom of the loop as I wrote ?
        ja      .L20
.L4:        => Oh and this unwanted prefix has eaten from the alignment...
        .loc 1 96 0 discriminator 1
        movl    %ebp, %ebx   => wait, why a circular assignation ?
        movl    %r12d, %ebp  => X into ebp
.LVL9:
.L8:      => and the loop starts only now ?
        .loc 1 93 0 discriminator 1
        leal    (%rbx,%r9), %eax    => t = Y + C
        .loc 1 91 0 discriminator 1
        addq    $1, %rsi            => chunk++
.LVL10:
        .loc 1 93 0 discriminator 1
        addl    %ebp, %eax          => t += X
.LVL11:
        .loc 1 95 0 discriminator 1
        movl    %eax, %r9d          => wait, another assignation ? t goes to r9 now ?
.LVL12:
        .loc 1 96 0 discriminator 1
        movzwl  %ax, %r12d          => X = t & mask
        .loc 1 95 0 discriminator 1
        shrl    $16, %r9d           => C = t >> 16
.LVL13:
        .loc 1 101 0 discriminator 1
        testl   %ebp, %ebp           => loop again if ebp is not zero.
        jne     .L4
        .loc 1 103 0
        addl    $1, %r13d
.LVL14:
        .loc 1 104 0
        cmpl    $1, %r12d      => X == 1 ?
        jne     .L5
        testl   %r9d, %r9d    => C == 0 ?
        jne     .L5
.LVL15:
        .loc 1 106 0
        movl    $.LC1, %edi   => prepare printf...
        .loc 1 105 0
        addq    %r15, %rsi
.LVL16:
        .loc 1 106 0
        movl    %r13d, %edx
        xorl    %eax, %eax
.LVL17:
        call    printf

The movzwl thing is a good way to avoid moving the 16 bits of mask around, and it performs a move/copy to another register. But then WHY not exploit this and rename a few things around that ? EAX is not used after the movzwl so it should have saved one mov ! It should look like this :

   movzwl  %ax, %r12d          => X = t & mask
   shrl    $16, %eax           => C = t >> 16

So the inner loop should look a bit like this:

; rsi => chunk
; r12 => X
; eax => t, C
; ebp => Y

.L4:
   addl    %ebp,  %eax    => t = Y + C
   addq    $1,    %rsi    => chunk++
   addl    %r12d, %eax    => t += X
   movl    %r12d, %ebp    => Y = X
   movzwl  %ax,   %r12d   => X = t & mask
   shrl    $16,   %eax    => C = t >> 16

   testl   %ebp, %ebp     => loop again if ebp is not zero.
   jne     .L4

As promised, that's 5 opcodes for the computation, 1 counter update and 2 conditional jump instructions. That one should fly, right ? But should I have to switch to LLVM to get this ?

Meanwhile the orbit search for w=26 is still running at maybe half the potential top speed...*

 
-o-O-0-O-o-
 

Grumble !!!!

Same source code, different optimisation level:

[yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_05.c -o pt_orbit && /usr/bin/time ./pt_orbit
Measuring the first orbit of the Pisano-Carry state space
 for WIDTH = 16 bits, or about 2147516416 iterations.
Starting to scan forward from Step 0 : X=1, Y=0, C=0 
Forward scan reached init state after 2147516415 steps and 33035 crossings
4.04user 0.00system 0:04.15elapsed 97%CPU (0avgtext+0avgdata 1420maxresident)k
0inputs+0outputs (0major+69minor)pagefaults 0swaps
[yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -Os pt_orbit_05.c -o pt_orbit && /usr/bin/time ./pt_orbit 
Measuring the first orbit of the Pisano-Carry state space
 for WIDTH = 16 bits, or about 2147516416 iterations.
Starting to scan forward from Step 0 : X=1, Y=0, C=0 
Forward scan reached init state after 2147516415 steps and 33035 crossings
2.40user 0.00system 0:02.42elapsed 99%CPU (0avgtext+0avgdata 1528maxresident)k
0inputs+0outputs (0major+69minor)pagefaults 0swaps

To achieve this I also reused a trick I just found and used in the "desired asm" code above:

C += X + Y;
Y = X;
X = C & MASK;
C = C >> WIDTH;

Removing t helps a bit... The result is the fastest version so far, the last best time was around 2.7s. The asm dump follows:

.L2:
        .loc 1 84 0
        cmpq    limit(%rip), %rbp
        jnb     .L6
        xorl    %esi, %esi           => chunk = 0
.LVL8:
.L3:
        .loc 1 93 0 discriminator 1
        movl    %r14d, %eax         => whaaat ? again ??
        movl    %r12d, %r14d        => Y = C
.LVL9:
        .loc 1 89 0 discriminator 1
        incq    %rsi                => chunk++
.LVL10:
        addl    %eax, %ebx          => C += Y
.LVL11:
        .loc 1 91 0 discriminator 1
        addl    %r12d, %ebx         => C += X
.LVL12:
        .loc 1 93 0 discriminator 1
        movzwl  %bx, %r12d          => X = C & mask
.LVL13:
        .loc 1 94 0 discriminator 1
        shrl    $16, %ebx           => C = C >> 16
.LVL14:
        .loc 1 104 0 discriminator 1
        testl   %r14d, %r14d        => Y == 0 ?
        jne     .L3
        .loc 1 106 0
        incl    %r15d               => crossings++
.LVL15:
        .loc 1 107 0
        cmpl    $1, %r12d
        jne     .L4
        testl   %ebx, %ebx
        jne     .L4

This version of the code gets about 700-750M steps per second on my old i7.

Discussions