Close

Carry on with GCC

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

Yann Guidon / YGDESYann Guidon / YGDES 05/27/2021 at 17:160 Comments

The last logs made me discover that GCC has some builtins to manage overflow which might make a difference for the x86 users since the carry is so critical with this algorithm. So let's explore the results.

The first thing is that it doesn't explicitly support adding with the carry flag itself, unless there is some unknown GCC magic under the hood. The ADDC instruction is an implicit ternary add, or "triadd" (sorry for the neologism) that is not supported on all processors though it plays a huge role in multi-precision maths. This is critical because the algo should perform C+X+Y in one operation only. This ensures the result is squarely bound in the 0-0x1FFFF range (and not 0x1FFE). If you perform it in two steps, you end up with two carries to manage and more code to write, hence a slower execution with justifies using a larger register...

Some more web search shows that GCC can generate ADDC when doing double-precision adds but not as a single "triadd". This is interesting though because maybe using a double-precision variable for X and Y could solve the problem...

Anyway GCC seems to keep one of its promises :

uint32_t addmultiple(
  uint32_t a, uint32_t b, uint32_t c) {
  unsigned int t, o;

  o = __builtin_sadd_overflow (a, b, &t);
  return t+c+o;
}

gives the following asm:

gcc -Wall -g -Os -S carrybuiltin.c -o carrybuiltin.S =>

addmultiple:
        xorl    %eax, %eax ==> clear the MSB in preparation for seto
        addl    %esi, %edi
        seto    %al        ==> set the LSB
        addl    %edi, %eax ==> instead of adcl edx, edi
        addl    %edx, %eax
        ret

The use of SETO is a first step ! However GCC doesn't get the clue for the next ADDL, maybe because I explicitly use an idiom that puts the carry into an integer variable. GCC not dumb: GCC see "add integer variables", GCC add integer variables.  

All hope is not lost though:

#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

uint32_t addmultiple(
  uint32_t a, uint32_t b, uint32_t c) {
  unsigned int t, o;
  o = ( (t=a+b) > a);
  return t+c+o;
}

====>

leal    (%rdi,%rsi), %eax
cmpl    %eax, %edi   \
adcl    %edx, %eax   /  should be merged as addl
ret

Ah, that's much better! The carry is merged almost by magic! But this is not "there" yet: the CMP is redundant with the ADD instruction. And because GCC WANTS to return the result in EAX, it insists on using LEA for the computation, which does not affect the carry flag. Sometimes, being smart can play against you. So you must be extra-extra-smart, at the risk of coding some obscure kludge...

So let's kludge even more !

uint32_t addmultiple(
  uint32_t a, uint32_t b, uint32_t c) {
  a+=b;
  return a+c+ (a<b); // BEWARE of the comparison direction!
}
====> gcc -Wall -g -Os -S carrybuiltin.c -o carrybuiltin.S
addmultiple:
        addl    %esi, %edi
        movl    %edx, %eax ==> harmless move
        adcl    %edi, %eax
        ret 

There ! There ! I can't believe it but it's there ! There is an extra movl but this is not relevant because this ultra-short code snippet is not representative and will disappear in a larger function through countless register renaming.

Just a warning though: get the comparison right! If you invert it, GCC can't optimise and will emit a longer instruction sequence.

Unfortunately, it seems to be only a "one-trick poney" that even refuses to repeat its feat:

uint32_t addmultiple(
  uint32_t a, uint32_t b, uint32_t c, uint32_t d) {

  a+= b;
  a+= c+ (a<b);
  a+= d+ (a<c);
  return a;
}

addmultiple:
        addl    %esi, %edi
        adcl    %edx, %edi => Good !
        xorl    %eax, %eax
        cmpl    %edi, %edx
        seta    %al
        addl    %ecx, %edi
        addl    %edi, %eax => Bad !
        ret

Come on, GCC !

 
-o-O-0-O-o-
 

Edit:

OK one reason why GCC couldn't use adcl a second time is because my source code was wrong. It must have been:

  a+= b;
  a+= c+ (a<b);
  e+= d+ (a < (c+ (a<b)));
  return a+e;

That changes everything ! GCC correctly infers the common sub-expression and does not compute it twice but fails to infer adcl again anyway.

Le sigh.

I suppose I have to write in asm directly now.

Discussions