Pisano-carry Checksum algorithm

Add X to Y and Y to X, says the song. And carry on.

Similar projects worth following
Another purely additive, non-cryptographic, fast and minimalistic checksum.

Aimed at being a 32-bit checksum for very low resources,
- very small, using only XOR and ADD,
- without the flaws of Adler32 and Fletcher, yet the same size and complexity,
- almost as good as CRC32, which it may replace when the memory footprint is critical,
- very easy to implement in digital circuits or microcontrollers with 16-bit accumulators,
- suitable for the detection of trivial corruption in network packets or files.

Disclaimer: This is a linear, reversible checksum so it's not appropriate for cryptography or hash tables. This is not even the fastest checksum ever.
But it's still quite good.

The most common application uses 2×16 bits. The 2×8 bits version doesn't work, and 2×32 bits seems untestable for now.

If you wonder why, just click and read the log.

The original algo/project name was "Fibonacci checksum" but it later appeared that it was not the most accurate choice because Leonardo Pisano (the real name of Fibonacci) is the name associated to the periodicity of the sequence under modulo.


In 2012, Piano Magic released their album "Life Has Not Finished with Me Yet". One song contains a weird repeating pattern...

Glen Johnson's lyrics are often cryptic and more evocative than objective, but any geek's mind would cling on this mantra at the end:

Add X to Y and Y to X

This is really weird but... Why ? What's the point in this obscure "song" with no precise theme or clear subject ? And what does it do ? This last question is the most easily answered : just follow the damned algorithm.

C'est parti...

X=1, Y=0
Y+=X => 0+1=1
X+=Y => 1+1=2
Y+=X => 1+2=3
X+=Y => 2+3=5
Y+=X => 3+5=8
X+=Y => 5+8=13
X+=Y => 8+13=21
Y+=X => 13+21=34
X+=Y => 21+34=55

No need to go further, most of you should have recognised, the famous Fibonacci sequence.

This gave me a compelling idea to modify the old Fletcher & Adler algorithms, keeping their very low footprint and minimalistic computational complexity. Both of these well known algos use a pair of values and have a similar structure. The trick is that rearranging the data dependency graph provides the equivalent of a minimalistic polynomial checksum, because the result is fed back on itself, in a more robust way than Fletcher's algo.

At first glance, this new checksum loop's body becomes something like :

Y += ( X ^ datum[i  ] );
X += ( Y ^ datum[i+1] );

This loop body is totally trivial to unroll. As trivial is the design of the corresponding digital circuit. This early version seemed to contain the whole checksum entropy in the last computed value but now X and Y are part of the signature. And the really important enhancement is the use of the carry!

For superscalar 32 bits CPUs, the following code seems to work well though the missing carry hardware (and/or lack of language support) requires more instructions to emulate:

t = X + Y + C;    Y = X ^ data;
C = t >> 16;      X = t & 0xFFFF;

In this worst case, without support of a carry flag, that's 5 basic operations (not counting memory loads) that fit in 4 registers and 3 cycles, to process 2 bytes. Not too bad. I'll let you deal with alignment. But is it really safe or safer ?

The following logs will show how the idea evolves and the performance increases, through discussions about carry wrap-around, register widths, scheduling, avalanche, parallelism, orbits, structure, black holes...


1. The variable
2. Datapath
3. Adler32 weakness
4. Easy circuit
5. Periodicity
6. A promising 32-bit checksum
7. Progress
8. Why
9. Orbital mechanics
10. Orbits, trajectories and that damned carry bit
11. Two hairy orbits
12. How to deal with black holes
13. Anteriority
14. Moonlighting as a PRNG
15. A new name
16. More orbits !
17. First image
18. Structure and extrapolations
19. Even more orbits !
20. Start and stop
21. Some theory
22. Some more theory
23. Even more theory.
24. A little enhancement
25. sdrawkcab gnioG
26. Further theorising
27. What is it ?
28. A bigger enhancement
29. Going both ways
30. Can you add states ?
31. Please, GCC !
32. _|_||_||_|______|______|____|_|___|_________|||_|__
33. Carry on with GCC
34. A good compromise



  • Endian (who uses Big Endian anyway today ?)
  • Alignment (align your memory access and/or blocks, instead, doh! )
  • 8-bit version/16-bit signature

Ironically, this structure reminds me a LOT of a 2-tap LFSR. So a more "powerful" version would use a Mersenne LFSR-inspired structure, with additional registers : Z, W, V... In several 2005 articles, I have explored "Mersenne Twisters" with 3 and 4 taps but even that would be overkill for such a basic application where throughput is the...

Read more »


Compare the original (bit-limited) algo with the new (non-masking) algo.

x-csrc - 1.29 kB - 06/18/2021 at 00:42



Faster checksumming routine, with 16 and 32 bits versions.

x-csrc - 1.30 kB - 05/25/2021 at 11:42



runs both forward and backwards iterations in parallel, sometimes they meet in the middle.

x-csrc - 8.32 kB - 05/25/2021 at 04:10



Microbenchmark that compares the running time of forward and backwards computation.

x-csrc - 1.78 kB - 05/13/2021 at 03:21



same as orbit_count.c but with command line options to resume the search at a given point.

x-csrc - 1.99 kB - 05/07/2021 at 06:03


View all 20 files

  • The new non-masking algorithm

    Yann Guidon / YGDESa day ago 0 comments

    The last log A good compromise proposed a new approach and data organisation that is reasonably safe, fast  and scalable.  I had to test it so I wrote this program: test_w26.c

    It almost went according to the plan, as you can see in the code below:

    // The original algo (optimised)
    C += X + Y;
    Y = X + M_;
    X = C & PISANO_MASK;
    C = C >> PISANO_WIDTH;
    // The new algo (crude)
    t = X + Y + C + M_;
    C = ((t ^ X ^ Y) >> PISANO_WIDTH) & 1;
    Y = X;
    X = t;

    Of course I wanted to test the equivalence so the new code is not optimised, and it's only a matter of time until it gets squished. The only disappointment is that there is one more value to XOR than I wanted, but it's a minor cost, particularly since this can be optimised out, because one XOR is redundant (just use one more temp var).

    But the good news is :

    It works as intended on (almost) the first try !

    With only 26 bits that are "known good", the variables can be 32 or 64 bits large, and the algo doesn't spill precious data/entropy in the void of a missed carry out flag. With 2 32-bit registers, you can be safe to checksum blocks containing 2^52 zeroes.

    Now I have to reorganise and optimise this.

  • A good compromise

    Yann Guidon / YGDES2 days ago 0 comments

    At this moment, the search for w26 is only at the 73% mark so it's not officially "good" but very probably so. I doubt I can ever confirm or refute w32 so, this far, w26 is the best I can have confidence in. And w32 is very unlikely "good" anyway. So what can I do with w26 ? And how ?

    I think it's not as useless as I originally thought, on the contrary, and even more so since my last encounters with GCC. I want something that is quite simple, robust, and which works anywhere without ridiculous compiler wrangling and weird gymnastics.

    Remember that even though w26 works on 26 bits, its period is double that ! So even with a pair of 32-bit registers, I get a 2⁵⁶ period without effort. This should be enough, right ? And with such a size, you'd be wiser to chunk your data, such as 16M blocks just to be safe.

    So let's see how to design a checksum that works well for 32 and 64 bits, using 32-bit wide registers.

    • First, we deal with a 26 bits wide accumulator/adder, but we can see that it's ridiculous to scrub the remaining 6 MSB, because they contain useful entropy. The existing codes mask the MSB, mostly for convenience (in shifting the carry back to the LSB), but this would work against the checksum's robustness, so let's use the whole 32 bits now. There is no harm in using a bit that is not the MSB for the carry loopback.
    • One would be tempted to use the whole 32 bits of the register to mix the input data stream, but even though it would double the bandwidth, I hesitate and I prefer sticking to 16 bits of mixing, despite the extra alignment/masking/shifting mess. With this arrangement, the system has one chance in approx. 2¹⁰ (1024) to fall in a "funnel" (when the state falls in a "trajectory" that becomes undistinguishable from an orbit, making the system non-reversible hence more likely to lose information).

    So far we have split the 32 bits into 3 zones :

    • 6 MSB : sort of "guard bits", they come for free and are good to keep around.
    • 10 middle bits : the MSB of the 26-bit Pisano iterator
    • 16 LSB : where the iterator gets its state mixed.

    There is some headroom, and should work well in most cases. But we have to solve one last issue : we can't get the carry bit.

    In fact we can't get it directly because the previous methods would either mask the "guard" MSB or shift the computation such that the carry bit appears on the carry flag. But this bit is not lost. The iterator virtually operates on the bits 0 to 25 so we could detect a carry when the bit #26 changes state. So the carry is computed by

     (((X^Y)>>26) & 1)

    which is a decent code sequence on any decent 32-bit CPU.

    In hardware, it's even more simple, particularly if you can access intermediary results from the adder's internal nodes.

    The corresponding source code adds some latency, there is a string of 3 dependent instructions which reduces the speed a bit, but it's not horrible either. Furthermore we can use ADDs everywhere, and not worry about missing a carry in the MSB anymore in case we add the carry separately. Overall, it's a good design gain and the whole thing, whether you use 32 bits (the last result) or 64 bits (the last result concatenated with the precedent) has a lot of headroom.

    From there, we can make this macro:

    Z = Z + W + dat16 + (((Z^W) >> 26) & 1)

    Then substitute Z and W for X and Y alternatively.

    PISANO_MACRO(Y, X, data[0])
    PISANO_MACRO(X, Y, data[1])

    There is still room for "enhancement" with the x86 breed, using the RCx (Rotate through Carry) opcodes. For example RCL 6 will put the bit 26 into the carry bit, and you just need to ADC to save one  ADD opcode. So the sequence would be

    MOV T, X
    XOR T, Y
    ADD Y, dat
    RCL T, 6
    ADC X, Y

    I wish I could avoid the first MOV... but I have to admit its inevitableness and test the hell out of it....

  • Carry on with GCC

    Yann Guidon / YGDES05/27/2021 at 17:16 0 comments

    The last logs made me discover that GCC has some builtins to manage overflow which might make a difference for the x86 users since the carry is so critical with this algorithm. So let's explore the results.

    The first thing is that it doesn't explicitly support adding with the carry flag itself, unless there is some unknown GCC magic under the hood. The ADDC instruction is an implicit ternary add, or "triadd" (sorry for the neologism) that is not supported on all processors though it plays a huge role in multi-precision maths. This is critical because the algo should perform C+X+Y in one operation only. This ensures the result is squarely bound in the 0-0x1FFFF range (and not 0x1FFE). If you perform it in two steps, you end up with two carries to manage and more code to write, hence a slower execution with justifies using a larger register...

    Some more web search shows that GCC can generate ADDC when doing double-precision adds but not as a single "triadd". This is interesting though because maybe using a double-precision variable for X and Y could solve the problem...

    Anyway GCC seems to keep one of its promises :

    uint32_t addmultiple(
      uint32_t a, uint32_t b, uint32_t c) {
      unsigned int t, o;
      o = __builtin_sadd_overflow (a, b, &t);
      return t+c+o;

    gives the following asm:

    gcc -Wall -g -Os -S carrybuiltin.c -o carrybuiltin.S =>
            xorl    %eax, %eax ==> clear the MSB in preparation for seto
            addl    %esi, %edi
            seto    %al        ==> set the LSB
            addl    %edi, %eax ==> instead of adcl edx, edi
            addl    %edx, %eax

    The use of SETO is a first step ! However GCC doesn't get the clue for the next ADDL, maybe because I explicitly use an idiom that puts the carry into an integer variable. GCC not dumb: GCC see "add integer variables", GCC add integer variables.  

    All hope is not lost though:

    #include <stdint.h>
    #include <inttypes.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    uint32_t addmultiple(
      uint32_t a, uint32_t b, uint32_t c) {
      unsigned int t, o;
      o = ( (t=a+b) > a);
      return t+c+o;
    leal    (%rdi,%rsi), %eax
    cmpl    %eax, %edi   \
    adcl    %edx, %eax   /  should be merged as addl

    Ah, that's much better! The carry is merged almost by magic! But this is not "there" yet: the CMP is redundant with the ADD instruction. And because GCC WANTS to return the result in EAX, it insists on using LEA for the computation, which does not affect the carry flag. Sometimes, being smart can play against you. So you must be extra-extra-smart, at the risk of coding some obscure kludge...

    So let's kludge even more !

    uint32_t addmultiple(
      uint32_t a, uint32_t b, uint32_t c) {
      return a+c+ (a<b); // BEWARE of the comparison direction!
    ====> gcc -Wall -g -Os -S carrybuiltin.c -o carrybuiltin.S
            addl    %esi, %edi
            movl    %edx, %eax ==> harmless move
            adcl    %edi, %eax

    There ! There ! I can't believe it but it's there ! There is an extra movl but this is not relevant because this ultra-short code snippet is not representative and will disappear in a larger function through countless register renaming.

    Just a warning though: get the comparison right! If you invert it, GCC can't optimise and will emit a longer instruction sequence.

    Unfortunately, it seems to be only a "one-trick poney" that even refuses to repeat its feat:

    uint32_t addmultiple(
      uint32_t a, uint32_t b, uint32_t c, uint32_t d) {
      a+= b;
      a+= c+ (a<b);
      a+= d+ (a<c);
      return a;
            addl    %esi, %edi
            adcl    %edx, %edi => Good !
            xorl    %eax, %eax
            cmpl    %edi, %edx
            seta    %al
            addl    %ecx, %edi
            addl    %edi, %eax => Bad !

    Come on, GCC !



    OK one reason why GCC couldn't use adcl a second time is because my source code was wrong. It must have been:

      a+= b;
      a+= c+ (a<b);
      e+= d+ (a < (c+ (a<b)));
      return a+e;

    That changes everything ! GCC correctly infers the common sub-expression and does not compute it twice but fails to infer adcl again anyway.

    Le sigh.

    I suppose I have to write in asm...

    Read more »

  • _|_||_||_|______|______|____|_|___|_________|||_|__

    Yann Guidon / YGDES05/25/2021 at 04:12 7 comments

    That's the surprisingly aesthetic output from the debug version of the latest program:

    This is halfway between the look of a cellular automaton and a procedurally-generated maze on an 8-bit console... Every "_" is a datum enqueued by the "slow backwards" thread and every "|" is enqueued by the "fast forward" thread.

    And it runs fast !

    $ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -Os pt_orbit_07.c -o pt_orbit && /usr/bin/time ./pt_orbit
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan backwards from Step 0 : X=1, Y=0, C=0 
    Starting to scan forward from Step 0 : X=1, Y=0, C=0 
      EUREKA !!!!  Backwards scan reached forward at X=45918 Y=0 C=0
    after 822785867 + 1324730547 = 2147516414 steps
     and 12706 + 20329 = 33035 crossings
    C'est fini.
    3.09user 0.00system 0:01.55elapsed 199%CPU

    The "fast-forward" loop alone runs at about 2.4s and running the slower version in parallel saved 0.9s ! Amdahl's law applies of course and it's not possible to get to 1.2s because the slower part will always hold everything else back. But my flexible synchronisation method with 2 FIFOs works well, at least to keep the speed maxed.

    However it doesn't work as expected all the time and there are many misses. I'll have to solve that but you can already play at home with pt_orbit_07.c.

    Enjoy !

  • Please, GCC !

    Yann Guidon / YGDES05/24/2021 at 08:38 0 comments

    I get tired of fighting GCC's weird code generator that won't get a clue from my source code. I'm (still) using gcc6.4.1 (2017, I know) but I don't know how to make it understand basic things... This is critical at my level because any little performance win saves many days of computation !

    I get variable performance tiers (sometimes 50% bumps) and I can't see how to attribute (or force) them. I made many tests and I simplified the source code to this point, trying to remove as many jumps as possible out of the inner loop:

      while (N < limit) {
        do {
          do {
            // Compute forward :
            t = X + Y + C;
            Y = X;
            C = t >> WIDTH;
            X = t & m;
          } while (Y != 0);
          if ((X == 1) && (C == 0)) {
            // Something something
            return NULL;
        } while(chunk < chunksize);
        printf("Forward Step %"PRIu64" : X=%d, Y=%d, C=%d \n",
            N, X, Y, C);

    I can see unexplained jumps in performance and I looked at the generated asm, where I found some good and some less good surprises...

    gcc -g -S -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_05.c -o pt_orbit.S

    And I get this:

            .loc 1 86 0
            cmpq    limit(%rip), %r15  => while (N < limit) {
            movl    12(%rsp), %r9d
            jnb     .L3                => go to end of main loop
            .loc 1 68 0 discriminator 1
            xorl    %esi, %esi   => esi = chunk = 0
            jmp     .L8          => wait ? Why jump here ??
            .p2align 4,,10
            .p2align 3
            .loc 1 112 0
            cmpq    %r14, %rsi   => Why compare here and not at the bottom of the loop as I wrote ?
            ja      .L20
    .L4:        => Oh and this unwanted prefix has eaten from the alignment...
            .loc 1 96 0 discriminator 1
            movl    %ebp, %ebx   => wait, why a circular assignation ?
            movl    %r12d, %ebp  => X into ebp
    .L8:      => and the loop starts only now ?
            .loc 1 93 0 discriminator 1
            leal    (%rbx,%r9), %eax    => t = Y + C
            .loc 1 91 0 discriminator 1
            addq    $1, %rsi            => chunk++
            .loc 1 93 0 discriminator 1
            addl    %ebp, %eax          => t += X
            .loc 1 95 0 discriminator 1
            movl    %eax, %r9d          => wait, another assignation ? t goes to r9 now ?
            .loc 1 96 0 discriminator 1
            movzwl  %ax, %r12d          => X = t & mask
            .loc 1 95 0 discriminator 1
            shrl    $16, %r9d           => C = t >> 16
            .loc 1 101 0 discriminator 1
            testl   %ebp, %ebp           => loop again if ebp is not zero.
            jne     .L4
            .loc 1 103 0
            addl    $1, %r13d
            .loc 1 104 0
            cmpl    $1, %r12d      => X == 1 ?
            jne     .L5
            testl   %r9d, %r9d    => C == 0 ?
            jne     .L5
            .loc 1 106 0
            movl    $.LC1, %edi   => prepare printf...
            .loc 1 105 0
            addq    %r15, %rsi
            .loc 1 106 0
            movl    %r13d, %edx
            xorl    %eax, %eax
            call    printf

    The movzwl thing is a good way to avoid moving the 16 bits of mask around, and it performs a move/copy to another register. But then WHY not exploit this and rename a few things around that ? EAX is not used after the movzwl so it should have saved one mov ! It should look like this :

       movzwl  %ax, %r12d          => X = t & mask
       shrl    $16, %eax           => C = t >> 16

    So the inner loop should look a bit like this:

    ; rsi => chunk
    ; r12 => X
    ; eax => t, C
    ; ebp => Y
       addl    %ebp,  %eax    => t = Y + C
       addq    $1,    %rsi    => chunk++
       addl    %r12d, %eax    => t += X
       movl    %r12d, %ebp    => Y = X
       movzwl  %ax,   %r12d   => X = t & mask
       shrl    $16,   %eax    => C = t >> 16
       testl   %ebp, %ebp     => loop again if ebp is not zero.
       jne     .L4

    As promised, that's 5 opcodes for the computation, 1 counter update and 2 conditional jump instructions. That one should fly, right ? But should I have to switch to LLVM to get this ?

    Meanwhile the orbit search for w=26 is still running at maybe half the potential top speed...*


    Grumble !!!!

    Same source code, different optimisation level:

    [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_05.c -o pt_orbit && /usr/bin/time ./pt_orbit
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan forward from Step 0 : X=1, Y=0, C=0 
    Forward scan reached init state after 2147516415 steps...
    Read more »

  • Can you add states ?

    Yann Guidon / YGDES05/23/2021 at 03:42 0 comments

    Going further on the path explained in log 27. What is it ?, I'm exploring the question of addition.

    The Pisano-Carry state space has the operation "next" and "previous" state but addition is not natural, or more precisely it does not correspond to something we already know. Getting the "successor" is not strictly equivalent to adding 1, because there is no such "number" in this object: states are coordinates and have 2 elements (and half). We already know addition between integers, between complex numbers, between numbers in a Galois field (such a N^m or polynomials over Z²)... But here it's something different.

    Another idea is that if you can compute "next", you can also compute "next next" and so on... This could maybe speed up the search/exploration of orbits but would that be faster than individual steps ? And how would we know that the intermediary steps would (or would not) visit a given state ?

    Let's go back to the code's definition of one step:

    t = X + Y + C;    Y = X ^ data;
    C = t >> 16;      X = t & 0xFFFF;

    In a more mundane way, this could be written as:

    #define SIZE (1 << WIDTH)
    t = X + Y + C;
    Y = X;
    C = t / SIZE;
    X = t mod SIZE;

    Computing 2 steps though is not so easy because X and Y are swapped all the time:

    #define SIZE (1 << WIDTH)
    t  = X + Y + C;
    Y1 = X;
    C1 = t / SIZE;
    X1 = t % SIZE;
    u  = X1 + Y1 + C1;
    Y2 = X1;
    C2 = u / SIZE;
    X2 = u % SIZE;

    The div/mod operations don't make the sequence easy to simplify but, as in the Adler32 advanced version, there should be a way to perform it at the very end of the sequence only. If this is true, then the general algorithm could be sped up further...

    (To be continued...)

  • Going both ways

    Yann Guidon / YGDES05/19/2021 at 01:30 0 comments

    I am now writing a program that runs both backwards and forwards iterations simultaneously on 2 cores, using the POSIX threads library.

    The performance seems reasonable, I included some little scheduling tricks to save a fraction of performance here and there.

    Forward alone:

    [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan forward from Step 0 : X=1, Y=0, C=0 
    Forward scan reached init state after 2147516414 steps and 33035 crossings
    2.72user 0.00system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 1664maxresident)k
    0inputs+0outputs (0major+71minor)pagefaults 0swaps

     Backwards alone :

    [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan backwards from Step 0 : X=1, Y=0, C=0 
    Backwards scan reached init state after 2147516414 steps and 33035 crossings
    4.63user 0.00system 0:04.64elapsed 99%CPU (0avgtext+0avgdata 1572maxresident)k
    0inputs+0outputs (0major+68minor)pagefaults 0swaps

    Backwards and forward simultaneously:

    [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan forward from Step 0 : X=1, Y=0, C=0 
    Starting to scan backwards from Step 0 : X=1, Y=0, C=0 
    Forward scan reached init state after 2147516414 steps and 33035 crossings
    Backwards scan reached init state after 2147516414 steps and 33035 crossings
    7.63user 0.00system 0:04.76elapsed 160%CPU (0avgtext+0avgdata 1540maxresident)k
    0inputs+0outputs (0major+70minor)pagefaults 0swaps

    The overhead is marginal but explodes when other computations occupy the remaining 2 threads. There is nothing to do there though.

    The big question is how to synchronise the cores ? How do we know the orbit is closed and where ?

    So far the strategy is to run as fast as possible until X==1. This happens approximately 2^(W-1) during a whole orbit. However the orbit has approx. 2^(2W-1)  steps so the "crossings" get rarer as W increases.

    For now, in this "rare" event, I test for Y==0 to end the loop, but this is where the threads could compare their respective results, namely the Y tagged with the iteration count. C is most probably 1 so not worth registering.

    Anyway this cuts down the number of comparisons and for W=26 this would happen in average every 2²⁵ iteration, or 32M steps. At almost 1G steps per second, that's about 30 times per second. A buffer of one second will help smooth things for the moments when a lot of crossings occur on one thread while the other is in a long trajectory between distant crossings. So let's have a FIFO with 64 or 128 values.

    Semaphores should be avoided as much as possible because they block the other thread which could perform useful computations. However safe synchro primitives must be used to prevent the usual problems in parallelism.


    The desired behaviour is that the fastest thread sends a sort of message (step number and value of Y) to the slower thread, which will compare Y with its own series of values. The messages are materialised and queued by a FIFO, which the "faster/sender" writes and the "slower/receiver" reads. This matches what Wikipedia describes in Non-blocking algorithm:

    [...] some non-blocking data structures are weak enough to be implemented without special atomic primitives. These exceptions include:
    • a single-reader single-writer ring buffer FIFO, with a size which evenly divides the overflow of one of the available unsigned integer types, can...
    Read more »

  • A bigger enhancement

    Yann Guidon / YGDES05/16/2021 at 02:17 0 comments

    The log 24. A little enhancement explored the idea of a marginally "safer" version which would compensate for the altered carry bit that would lead to a "funnel". The result was not stellar and a huge hit on performance. Meanwhile I'm still looking for a method to deal with 32-bit wide data to further enhance the performace-per-cycle but the theory is pessimistic on this front.

    Today, a new idea percolated and it might bring both "safety" and "performance" together by reusing the existing building blocks in new, clever ways, with the magic of the fractional format. Indeed when we look at the classic code:

    t = X + Y + C;    Y = X ^ data;
    C = t >> 16;      X = t & 0xFFFF;

    We see that the useful values are in the least significant bits. There is no need to think further if the platform is 16-bit wide or supports this computation mode: the result masking and the carry shifting are performed implicitly by the hardware. It gets ugly with wider registers because no language (apart from ASM) lets you access the carry bit and you have to clean up after the computations.

    But we can save the masking by pre-shifting the values! For example, for a 32-bit register, the above code would work with the higher half, the 16 MSB. The new code would become :

    t = X + Y + C;
    Y = X ^ data;

    Wow ! We shaved one half of the code ! It's now even more simple and lean ! And we can deal with 16-bit data on the LSB half, out of reach from the dirty funnels. The carry propagation from the lower half will only happen every other time, with very very low chance of creating a funnel. It has not disappeared but as they say, "engineering is the art of pushing problems where they are not an issue".

    So we have the new split design, with the lower 16 bits getting the mixed data, the higher bits running the usual PRNG sequence with a few nudges, but avoiding the known theoretical traps, and a checksum that can be as wide as 64 bits now (when X and Y are concatenated).

    The scheme can be repeated with wider registers: for example it becomes easy to process 32 bits per step with a 64-bit register, and so on. As long as you keep the Pisano machinery in the MSB, you can arrange the data as you like. I'll let you figure how to deal with 48-bit chunks of data ;-)

    But wait ! Where has the C variable gone ? In the following line:

    t = X + Y + C;

    C is usually the carry generated by the addition, which becomes useful again with the new MSB alignment. So this is solved by reusing this carry flag. End of story, right ?

    Weeeeeellllllll, the carry should be inserted at the bottom of the Pisano field, at bit 16 for the 32 bits version, so the above code is misleading and no architecture can do that. x86 has a CMOV instruction and ARM has/had(?) predicated opcodes so it is still possible to conditionally add 0x10000 to Y for example.

    But in C, we don't get that sweet little carry bit. The wraparound could be detected by a hackmem-esque trick, such as

    C = ((t<X) && (t<Y))? 0x10000 : 0 ;


    C = (((t-X) & (t-Y)) >> 16) & 0x10000;

    But this increases the computation time and loop size, as well as introduce a few off-by-one issues.

    Furthermore we might miss the cases where a carry is generated in the intermediary result of the addition (because, apart from x86 with its LEA opcode, no other architecture provides a 3-addend addition and IIRC it doesn't even affect the carry flag).

    So once again we're back to the original code, where the mask&shift can work in parallel. And by not masking C we can feed some of the LSB sum back into the LSB to mix with the incoming data, to further increase the mix...

    t = X + Y + C;    Y = X + (uint16_t)data;
    C = t >> 16;      X = t & 0x7FFFFFFF;

    In this version, the 16-bit Pisano part would be in bits 15-30. Bit 31 would be the carry and the MSB of the 16 bits of data would eat into the LSB but that's not considered as a significant problem here.

    At this point, not masking the MSB of X is tempting but that would discard all the already-explored...

    Read more »

  • What is it ?

    Yann Guidon / YGDES05/15/2021 at 18:56 0 comments

    There is a question that bugs me for weeks now : what type of mathematical object is it ?

    I have a quite decent knowledge of the usual types : integers, reals, complex numbers, and even some Galois fields over Z², for example. Quaternions, vectors and matrices are other types that are well defined. But the object under consideration puzzles me. Is it a ring, a field, a group, or something else ?

    Let's start with the obligatory Wikipedia quote:

    In mathematics, rings are algebraic structures that generalize fields: multiplication need not be commutative and multiplicative inverses need not exist. In other words, a ring is a set equipped with two binary operations satisfying properties analogous to those of addition and multiplication of integers. Ring elements may be numbers such as integers or complex numbers, but they may also be non-numerical objects such as polynomials, square matrices, functions, and power series.

    Formally, a ring is an abelian group whose operation is called addition, with a second binary operation called multiplication that is associative, is distributive over the addition operation, and has a multiplicative identity element.

    So here is the new riddle : does addition even exist in this object ? The answer is easy for the previously cited types yet here, the elements are not numbers, but coordinates in a 2.5D closed space. Whether the carry counts as one individual dimension is still being debated. The description does not fit directly with a vector system. The behaviour is a bit reminiscent with a very weird Galois field but here, the "generator" generates only one half of a symmetrical set of coordinates (which I call orbits here). Really I know nothing of this kind.

    So what can we say so far about this family of objects ?

    • It is a family of sets defined by a single parameter (called here w, or width)
    • Each set is organised as a vector space with coordinates X=[0..2w-1], Y=[0..2w-1] and C=[0..1], hence cardinality of 2w+1 elements.
    • Each element has a predecessor and successor, defined by a reversible formula (with forward and backwards directions, already defined in other logs).
    • Elements (0,0,0) and (2w-1,2w-1,1) are their own precursor and successor.
    • Except for w=1, the space shows a symmetry and duality (invertibility?), where all elements (x,y,c) have an inverse value (2w-(x+1), .2w-(y+1), 1-c), with the same behaviour.
    • Since each point has a successor and a precursor, and the number of points is finite, the series of successors must be finite too, leading the series back to its first element, and this creates an "orbit".
    • Some values of w create a system with only two orbits (as calculated in this log) and these are the ones I'm seeking:
      2, 3, 4, 10, 16
      Other numbers may exist above 25 but have not been discovered so far.


    Another interesting detail is the way the successor is calculated: Y gets the value of X. So in a way, when we map the coordinates on a 2D plane, each step represents a sort of rotation on this plane. This system makes me think of the logistic map r×x×(1-x), such as:

    Of course it is not directly related, if only because here the values are integer, modulo 2w and every overflow causes an increment of x. The other detail is that there is no parabola, no multiply/squaring and the guiding function is Fibonacci's series. Is there a link I'm missing yet?

  • Further theorising

    Yann Guidon / YGDES05/14/2021 at 21:23 0 comments

    PS: this log has bogus data so forget about it... but 6 looks interesting anyway because it has 6 orbits of equal lengths.


    More than two weeks of computations for w=26 and w=27 have given some time to think on the larger scale and a better look at the available data. But it's getting reeeeeaaaaallllllllyyyyyyy long now and I'm mostly wishing that the w=26 and/or w=27 fail soon. At only 5%, 27 can still fail but so far 26 looks promising.

    For now we have this list of potentially valid sizes :

    1 2 3 4 6 10 16 (26?)   -> 6 is false...

    It already struck me that 6 + 10 = 16, and now 10 + 16 = 26. Going backwards, 10 = 6 + 4 and 6 = 4 + 2 (hello, kids' maths). Now it's obviously the Fibonaccy sequence, but doubled ! But what happens with 1 and 3 ? Why do they work but not 5 ?

    From there, we can postulate some predictions/hypothesis:

    • 32 is not a valid size. Too bad ! 16 is valid because it is the double of 8 which is a Fibonaccy number. No other number in the sequence is a power of two, sadly. Finding 16 on the first try was a very lucky draw!
    • The next valid sizes would be 42 (of course !), 68, 110 ...

    These speculations need to be verified but at least the theory progresses, and it is dearly needed. At this rate, completing w=26 will take two more months (and I don't imagine for w=27). I must find a trick to make it only one month. Furthermore, the longer it takes, the higher the chances of flukes/miscalculations accumulate. They are rare but not impossible at this scale

    So even halving the run time is very important and I'm back to the concepts of the last logs, where 2 cores compute roughly one half of the orbit and meet "somewhere" near the middle. And I think I have solved one major issue I had with the previous ideas: there is no need to store all the paths, we only need to compare the last N values. Indeed, the orbit is closed at one point, not potentially random values, and the sequence is run backwards and forward. Now the question is : how many points and which should be kept in the history ?

    We can synchronise the 2 threads so they know what the other has "consumed", and discard the elements that have been compared. My favourite sync method is not semaphores or read/write flags, but a pair of mailboxes where one thread write and the other can only read, which simplifies the cache coherency and updates.

View all 35 project logs

Enjoy this project?



Yann Guidon / YGDES wrote 05/24/2021 at 03:52 point

Oh, nice !

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates