Pisano-carry Checksum algorithm

Add X to Y and Y to X, says the song. And carry on.

Similar projects worth following
Another purely additive, non-cryptographic, fast and minimalistic checksum.

Aimed at being a 32-bit checksum for very low resources,
- very small, using only XOR and ADD,
- without the flaws of Adler32 and Fletcher, yet the same size and complexity,
- almost as good as CRC32, which it may replace when the memory footprint is critical,
- very easy to implement in digital circuits or microcontrollers with 16-bit accumulators,
- suitable for the detection of trivial corruption in network packets or files.

Disclaimer: This is a linear, reversible checksum so it's not appropriate for cryptography or hash tables. This is not even the fastest checksum ever.
But it's still quite good.

The basic application uses 2×16 bits. 2×32 bits and larger are possible with the non-masking algorithm.

If you wonder why, just click and read the log.

The original algo/project name was "Fibonacci checksum" but it later appeared that it was not the most accurate choice because Leonardo Pisano (the real name of Fibonacci) is the name associated to the periodicity of the sequence under modulo.


In 2012, Piano Magic released their album "Life Has Not Finished with Me Yet". One song contains a weird repeating pattern...

Glen Johnson's lyrics are often cryptic and more evocative than objective, but any geek's mind would cling on this mantra at the end:

Add X to Y and Y to X

This is really weird but... Why ? What's the point in this obscure "song" with no precise theme or clear subject ? And what does it do ? This last question is the most easily answered : just follow the damned algorithm.

C'est parti...

X=1, Y=0
Y+=X => 0+1=1
X+=Y => 1+1=2
Y+=X => 1+2=3
X+=Y => 2+3=5
Y+=X => 3+5=8
X+=Y => 5+8=13
X+=Y => 8+13=21
Y+=X => 13+21=34
X+=Y => 21+34=55

No need to go further, most of you should have recognised, the famous Fibonacci sequence.

This gave me a compelling idea to modify the old Fletcher & Adler algorithms, keeping their very low footprint and minimalistic computational complexity. Both of these well known algos use a pair of values and have a similar structure. The trick is that rearranging the data dependency graph provides the equivalent of a minimalistic polynomial checksum, because the result is fed back on itself, in a more robust way than Fletcher's algo.

At first glance, this new checksum loop's body becomes something like :

Y += ( X ^ datum[i  ] );
X += ( Y ^ datum[i+1] );

This loop body is totally trivial to unroll. As trivial is the design of the corresponding digital circuit. This early version seemed to contain the whole checksum entropy in the last computed value but now X and Y are part of the signature. And the really important enhancement is the use of the carry!

For superscalar 32 bits CPUs, the following code seems to work well though the missing carry hardware (and/or lack of language support) requires more instructions to emulate:

t = X + Y + C;    Y = X ^ data;
C = t >> 16;      X = t & 0xFFFF;

In this worst case, without support of a carry flag, that's 5 basic operations (not counting memory loads) that fit in 4 registers and 3 cycles, to process 2 bytes. Not too bad. I'll let you deal with alignment. But is it really safe or safer ?

The following logs will show how the idea evolves and the performance increases, through discussions about carry wrap-around, register widths, scheduling, avalanche, parallelism, orbits, structure, black holes...


1. The variable
2. Datapath
3. Adler32 weakness
4. Easy circuit
5. Periodicity
6. A promising 32-bit checksum
7. Progress
8. Why
9. Orbital mechanics
10. Orbits, trajectories and that damned carry bit
11. Two hairy orbits
12. How to deal with black holes
13. Anteriority
14. Moonlighting as a PRNG
15. A new name
16. More orbits !
17. First image
18. Structure and extrapolations
19. Even more orbits !
20. Start and stop
21. Some theory
22. Some more theory
23. Even more theory.
24. A little enhancement
25. sdrawkcab gnioG
26. Further theorising
27. What is it ?
28. A bigger enhancement
29. Going both ways
30. Can you add states ?
31. Please, GCC !
32. _|_||_||_|______|______|____|_|___|_________|||_|__
33. Carry on with GCC
34. A good compromise
35. The new non-masking algorithm
36. Add or XOR ?
37. A 32/64-bit version
38. Closing the trajectories



  • Endian (who uses Big Endian anyway today ?)
  • Alignment (align your memory access and/or blocks, instead, doh! )
  • 8-bit version/16-bit signature

Ironically, this structure reminds me a LOT of a 2-tap LFSR. So a more "powerful" version would use a Mersenne LFSR-inspired structure, with additional registers : Z, W, V... In several 2005 articles, I have explored "Mersenne Twisters" with 3...

Read more »


Compare the original (bit-limited) algo with the new (non-masking) algo.

x-csrc - 1.29 kB - 06/18/2021 at 00:42



Faster checksumming routine, with 16 and 32 bits versions.

x-csrc - 1.30 kB - 05/25/2021 at 11:42



runs both forward and backwards iterations in parallel, sometimes they meet in the middle.

x-csrc - 8.32 kB - 05/25/2021 at 04:10



Microbenchmark that compares the running time of forward and backwards computation.

x-csrc - 1.78 kB - 05/13/2021 at 03:21



same as orbit_count.c but with command line options to resume the search at a given point.

x-csrc - 1.99 kB - 05/07/2021 at 06:03


View all 20 files

  • Closing the trajectories

    Yann Guidon / YGDESa day ago 0 comments

    As the mapping of the whole orbit of w=26 has exceeded 80%, it might be too late to revive the "optimised" dual orbit walker but I am still puzzled by the cause of its flaw. Indeed the program pt_orbit_07.c does not work as intended : it fails to "meet in the middle" approximatively every other time. I thought my design was good though, so what happens ?

    Then I imagined the traces of 2 lines drawn on a bitmap. I supposed and assumed the algo would behave like this:

    But it appears that it can do so as well:

    So I guess one of the lines should have a slope of 1/2 instead of 1 as above, to ensure that nothing is missed. To put all the chances on our side, both lines would have a slope of 1/2 but even then, it's not enough :

    To make sure that the lines share points, at least one must have a width of 2 pixels. If both do, it's even better.

    Whatever the mutual alignment, there are 2 chances to catch the crossing.

    In the algo, the "width" is translated in the number of previous numbers to check, on top of the current result. So ideally, both the forward and backwards iterators should keep at least one previous value to compare.

  • A 32/64-bit version

    Yann Guidon / YGDES3 days ago 0 comments

    The new "non-masking" algorithm exploits some little tricks (explained in 34. A good compromise) to provide arbitrarily long checksums while using "only" a generator running on 26 bits (with a period of 2^51, or 2.251.799.813.685.248). When using 32-bit registers, it provides a baseline checksum of 32 bits and the extended result can bring the full checksum to 64 bits.

    This is appropriate for the (non-crypto) signature of files for example.

    The input data is mixed in chunks of 16 bits only because the generator has only 26 bits. Mixing larger chunks would not make as much sense in corner cases, such as when reading long sequences of constant data.

    The 2 new XOR gate increase the computation cost slightly but it's still far from a comparable CRC32 with or without tables for example. And the source code is inherently more portable than the earlier versions that rely on supporting the hardware carry signal.

    // The original lovely but masking algo:
    C += X + Y;  // ADC
    Y  = X + M_;
    X  = C & PISANO_MASK;
    C  = C >> PISANO_WIDTH;
    // The new non-masking algo:
    Cu += Xu + Yu;  // ADC
    p   = Xu ^ Yu;
    Yu  = Xu + M_;
    Xu  = Cu;
    Cu  = ((p ^ Cu) >> PISANO_WIDTH) & 1; // ROL 6

    The "masking" algo is amazingly short and beautifully efficient but limits the number of useful/used bits to 26×2. This means that only 0.05% of the available coding space is used. Not wasting these bits is worth a few extra instructions, I think, unless a better/higher number than 26 is found.

    The Xu = Cu line will be removed by more advanced, unrolled versions of the code where variable renaming optimisations are possible.

  • Add or XOR ?

    Yann Guidon / YGDES3 days ago 0 comments

    While studying the orbits of the Pisano-carry generator is good, let's not forget that the state will be mixed with more arbitrary bits. This "mixing" operation can be performed with a XOR or an ADD and both have their advantages and drawbacks, which are more or less prominent depending on the exact configuration.

    I tend to like the XOR because it sounds so simple, uses less silicon area and is generally faster (at least not slower) than ADD. I also like it because it does not generate a carry signal that then must be taken into account. In turn this makes the new "non-masking" algorithms more efficient because the loopback bit is not affected and this saves one XOR operation, though there is not real impact in hardware (XOR gates are smaller than DFF).

    OTOH the addition has more "avalanche effect" from the input data and, unless there is a wrap-around, the result is always above/greater than the addends, whereas the XOR can make the value jump up and down, particularly if the MSB are not yet set. In the worst case, this could prevent the whole state from evolving correctly. The ADD increases the spread of the checksums when the input data diverge very little (see RFC3309: Adler32 weakness).

    This last argument makes the ADD method the preferred choice.

  • The new non-masking algorithm

    Yann Guidon / YGDES5 days ago 0 comments

    The last log A good compromise proposed a new approach and data organisation that is reasonably safe, fast  and scalable.  I had to test it so I wrote this program: test_w26.c

    It almost went according to the plan, as you can see in the code below:

    // The original algo (optimised)
    C += X + Y;
    Y = X + M_;
    X = C & PISANO_MASK;
    C = C >> PISANO_WIDTH;
    // The new algo (crude)
    t = X + Y + C + M_;
    C = ((t ^ X ^ Y) >> PISANO_WIDTH) & 1;
    Y = X;
    X = t;

    Of course I wanted to test the equivalence so the new code is not optimised, and it's only a matter of time until it gets squished. The only disappointment is that there is one more value to XOR than I wanted, but it's a minor cost, particularly since this can be optimised out, because one XOR is redundant (just use one more temp var).

    But the good news is :

    It works as intended on (almost) the first try !

    With only 26 bits that are "known good", the variables can be 32 or 64 bits large, and the algo doesn't spill precious data/entropy in the void of a missed carry out flag. With 2 32-bit registers, you can be safe to checksum blocks containing 2^52 zeroes.

    Now I have to reorganise and optimise this.

  • A good compromise

    Yann Guidon / YGDES6 days ago 0 comments

    At this moment, the search for w26 is only at the 73% mark so it's not officially "good" but very probably so. I doubt I can ever confirm or refute w32 so, this far, w26 is the best I can have confidence in. And w32 is very unlikely "good" anyway. So what can I do with w26 ? And how ?

    I think it's not as useless as I originally thought, on the contrary, and even more so since my last encounters with GCC. I want something that is quite simple, robust, and which works anywhere without ridiculous compiler wrangling and weird gymnastics.

    Remember that even though w26 works on 26 bits, its period is double that ! So even with a pair of 32-bit registers, I get a 2⁵⁶ period without effort. This should be enough, right ? And with such a size, you'd be wiser to chunk your data, such as 16M blocks just to be safe.

    So let's see how to design a checksum that works well for 32 and 64 bits, using 32-bit wide registers.

    • First, we deal with a 26 bits wide accumulator/adder, but we can see that it's ridiculous to scrub the remaining 6 MSB, because they contain useful entropy. The existing codes mask the MSB, mostly for convenience (in shifting the carry back to the LSB), but this would work against the checksum's robustness, so let's use the whole 32 bits now. There is no harm in using a bit that is not the MSB for the carry loopback.
    • One would be tempted to use the whole 32 bits of the register to mix the input data stream, but even though it would double the bandwidth, I hesitate and I prefer sticking to 16 bits of mixing, despite the extra alignment/masking/shifting mess. With this arrangement, the system has one chance in approx. 2¹⁰ (1024) to fall in a "funnel" (when the state falls in a "trajectory" that becomes undistinguishable from an orbit, making the system non-reversible hence more likely to lose information).

    So far we have split the 32 bits into 3 zones :

    • 6 MSB : sort of "guard bits", they come for free and are good to keep around.
    • 10 middle bits : the MSB of the 26-bit Pisano iterator
    • 16 LSB : where the iterator gets its state mixed.

    There is some headroom, and should work well in most cases. But we have to solve one last issue : we can't get the carry bit.

    In fact we can't get it directly because the previous methods would either mask the "guard" MSB or shift the computation such that the carry bit appears on the carry flag. But this bit is not lost. The iterator virtually operates on the bits 0 to 25 so we could detect a carry when the bit #26 changes state. So the carry is computed by

     (((X^Y)>>26) & 1)

    which is a decent code sequence on any decent 32-bit CPU.

    In hardware, it's even more simple, particularly if you can access intermediary results from the adder's internal nodes.

    The corresponding source code adds some latency, there is a string of 3 dependent instructions which reduces the speed a bit, but it's not horrible either. Furthermore we can use ADDs everywhere, and not worry about missing a carry in the MSB anymore in case we add the carry separately. Overall, it's a good design gain and the whole thing, whether you use 32 bits (the last result) or 64 bits (the last result concatenated with the precedent) has a lot of headroom.

    From there, we can make this macro:

    Z = Z + W + dat16 + (((Z^W) >> 26) & 1)

    Then substitute Z and W for X and Y alternatively.

    PISANO_MACRO(Y, X, data[0])
    PISANO_MACRO(X, Y, data[1])

    There is still room for "enhancement" with the x86 breed, using the RCx (Rotate through Carry) opcodes. For example RCL 6 will put the bit 26 into the carry bit, and you just need to ADC to save one  ADD opcode. So the sequence would be

    MOV T, X
    XOR T, Y
    ADD Y, dat
    RCL T, 6
    ADC X, Y

    I wish I could avoid the first MOV... but I have to admit its inevitableness and test the hell out of it....

  • Carry on with GCC

    Yann Guidon / YGDES05/27/2021 at 17:16 0 comments

    The last logs made me discover that GCC has some builtins to manage overflow which might make a difference for the x86 users since the carry is so critical with this algorithm. So let's explore the results.

    The first thing is that it doesn't explicitly support adding with the carry flag itself, unless there is some unknown GCC magic under the hood. The ADDC instruction is an implicit ternary add, or "triadd" (sorry for the neologism) that is not supported on all processors though it plays a huge role in multi-precision maths. This is critical because the algo should perform C+X+Y in one operation only. This ensures the result is squarely bound in the 0-0x1FFFF range (and not 0x1FFE). If you perform it in two steps, you end up with two carries to manage and more code to write, hence a slower execution with justifies using a larger register...

    Some more web search shows that GCC can generate ADDC when doing double-precision adds but not as a single "triadd". This is interesting though because maybe using a double-precision variable for X and Y could solve the problem...

    Anyway GCC seems to keep one of its promises :

    uint32_t addmultiple(
      uint32_t a, uint32_t b, uint32_t c) {
      unsigned int t, o;
      o = __builtin_sadd_overflow (a, b, &t);
      return t+c+o;

    gives the following asm:

    gcc -Wall -g -Os -S carrybuiltin.c -o carrybuiltin.S =>
            xorl    %eax, %eax ==> clear the MSB in preparation for seto
            addl    %esi, %edi
            seto    %al        ==> set the LSB
            addl    %edi, %eax ==> instead of adcl edx, edi
            addl    %edx, %eax

    The use of SETO is a first step ! However GCC doesn't get the clue for the next ADDL, maybe because I explicitly use an idiom that puts the carry into an integer variable. GCC not dumb: GCC see "add integer variables", GCC add integer variables.  

    All hope is not lost though:

    #include <stdint.h>
    #include <inttypes.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    uint32_t addmultiple(
      uint32_t a, uint32_t b, uint32_t c) {
      unsigned int t, o;
      o = ( (t=a+b) > a);
      return t+c+o;
    leal    (%rdi,%rsi), %eax
    cmpl    %eax, %edi   \
    adcl    %edx, %eax   /  should be merged as addl

    Ah, that's much better! The carry is merged almost by magic! But this is not "there" yet: the CMP is redundant with the ADD instruction. And because GCC WANTS to return the result in EAX, it insists on using LEA for the computation, which does not affect the carry flag. Sometimes, being smart can play against you. So you must be extra-extra-smart, at the risk of coding some obscure kludge...

    So let's kludge even more !

    uint32_t addmultiple(
      uint32_t a, uint32_t b, uint32_t c) {
      return a+c+ (a<b); // BEWARE of the comparison direction!
    ====> gcc -Wall -g -Os -S carrybuiltin.c -o carrybuiltin.S
            addl    %esi, %edi
            movl    %edx, %eax ==> harmless move
            adcl    %edi, %eax

    There ! There ! I can't believe it but it's there ! There is an extra movl but this is not relevant because this ultra-short code snippet is not representative and will disappear in a larger function through countless register renaming.

    Just a warning though: get the comparison right! If you invert it, GCC can't optimise and will emit a longer instruction sequence.

    Unfortunately, it seems to be only a "one-trick poney" that even refuses to repeat its feat:

    uint32_t addmultiple(
      uint32_t a, uint32_t b, uint32_t c, uint32_t d) {
      a+= b;
      a+= c+ (a<b);
      a+= d+ (a<c);
      return a;
            addl    %esi, %edi
            adcl    %edx, %edi => Good !
            xorl    %eax, %eax
            cmpl    %edi, %edx
            seta    %al
            addl    %ecx, %edi
            addl    %edi, %eax => Bad !

    Come on, GCC !



    OK one reason why GCC couldn't use adcl a second time is because my source code was wrong. It must have been:

      a+= b;
      a+= c+ (a<b);
      e+= d+ (a < (c+ (a<b)));
      return a+e;

    That changes everything ! GCC correctly infers the common sub-expression and does not compute it twice but fails to infer adcl again anyway.

    Le sigh.

    I suppose I have to write in asm...

    Read more »

  • _|_||_||_|______|______|____|_|___|_________|||_|__

    Yann Guidon / YGDES05/25/2021 at 04:12 7 comments

    That's the surprisingly aesthetic output from the debug version of the latest program:

    This is halfway between the look of a cellular automaton and a procedurally-generated maze on an 8-bit console... Every "_" is a datum enqueued by the "slow backwards" thread and every "|" is enqueued by the "fast forward" thread.

    And it runs fast !

    $ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -Os pt_orbit_07.c -o pt_orbit && /usr/bin/time ./pt_orbit
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan backwards from Step 0 : X=1, Y=0, C=0 
    Starting to scan forward from Step 0 : X=1, Y=0, C=0 
      EUREKA !!!!  Backwards scan reached forward at X=45918 Y=0 C=0
    after 822785867 + 1324730547 = 2147516414 steps
     and 12706 + 20329 = 33035 crossings
    C'est fini.
    3.09user 0.00system 0:01.55elapsed 199%CPU

    The "fast-forward" loop alone runs at about 2.4s and running the slower version in parallel saved 0.9s ! Amdahl's law applies of course and it's not possible to get to 1.2s because the slower part will always hold everything else back. But my flexible synchronisation method with 2 FIFOs works well, at least to keep the speed maxed.

    However it doesn't work as expected all the time and there are many misses. I'll have to solve that but you can already play at home with pt_orbit_07.c.

    Enjoy !

  • Please, GCC !

    Yann Guidon / YGDES05/24/2021 at 08:38 0 comments

    I get tired of fighting GCC's weird code generator that won't get a clue from my source code. I'm (still) using gcc6.4.1 (2017, I know) but I don't know how to make it understand basic things... This is critical at my level because any little performance win saves many days of computation !

    I get variable performance tiers (sometimes 50% bumps) and I can't see how to attribute (or force) them. I made many tests and I simplified the source code to this point, trying to remove as many jumps as possible out of the inner loop:

      while (N < limit) {
        do {
          do {
            // Compute forward :
            t = X + Y + C;
            Y = X;
            C = t >> WIDTH;
            X = t & m;
          } while (Y != 0);
          if ((X == 1) && (C == 0)) {
            // Something something
            return NULL;
        } while(chunk < chunksize);
        printf("Forward Step %"PRIu64" : X=%d, Y=%d, C=%d \n",
            N, X, Y, C);

    I can see unexplained jumps in performance and I looked at the generated asm, where I found some good and some less good surprises...

    gcc -g -S -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_05.c -o pt_orbit.S

    And I get this:

            .loc 1 86 0
            cmpq    limit(%rip), %r15  => while (N < limit) {
            movl    12(%rsp), %r9d
            jnb     .L3                => go to end of main loop
            .loc 1 68 0 discriminator 1
            xorl    %esi, %esi   => esi = chunk = 0
            jmp     .L8          => wait ? Why jump here ??
            .p2align 4,,10
            .p2align 3
            .loc 1 112 0
            cmpq    %r14, %rsi   => Why compare here and not at the bottom of the loop as I wrote ?
            ja      .L20
    .L4:        => Oh and this unwanted prefix has eaten from the alignment...
            .loc 1 96 0 discriminator 1
            movl    %ebp, %ebx   => wait, why a circular assignation ?
            movl    %r12d, %ebp  => X into ebp
    .L8:      => and the loop starts only now ?
            .loc 1 93 0 discriminator 1
            leal    (%rbx,%r9), %eax    => t = Y + C
            .loc 1 91 0 discriminator 1
            addq    $1, %rsi            => chunk++
            .loc 1 93 0 discriminator 1
            addl    %ebp, %eax          => t += X
            .loc 1 95 0 discriminator 1
            movl    %eax, %r9d          => wait, another assignation ? t goes to r9 now ?
            .loc 1 96 0 discriminator 1
            movzwl  %ax, %r12d          => X = t & mask
            .loc 1 95 0 discriminator 1
            shrl    $16, %r9d           => C = t >> 16
            .loc 1 101 0 discriminator 1
            testl   %ebp, %ebp           => loop again if ebp is not zero.
            jne     .L4
            .loc 1 103 0
            addl    $1, %r13d
            .loc 1 104 0
            cmpl    $1, %r12d      => X == 1 ?
            jne     .L5
            testl   %r9d, %r9d    => C == 0 ?
            jne     .L5
            .loc 1 106 0
            movl    $.LC1, %edi   => prepare printf...
            .loc 1 105 0
            addq    %r15, %rsi
            .loc 1 106 0
            movl    %r13d, %edx
            xorl    %eax, %eax
            call    printf

    The movzwl thing is a good way to avoid moving the 16 bits of mask around, and it performs a move/copy to another register. But then WHY not exploit this and rename a few things around that ? EAX is not used after the movzwl so it should have saved one mov ! It should look like this :

       movzwl  %ax, %r12d          => X = t & mask
       shrl    $16, %eax           => C = t >> 16

    So the inner loop should look a bit like this:

    ; rsi => chunk
    ; r12 => X
    ; eax => t, C
    ; ebp => Y
       addl    %ebp,  %eax    => t = Y + C
       addq    $1,    %rsi    => chunk++
       addl    %r12d, %eax    => t += X
       movl    %r12d, %ebp    => Y = X
       movzwl  %ax,   %r12d   => X = t & mask
       shrl    $16,   %eax    => C = t >> 16
       testl   %ebp, %ebp     => loop again if ebp is not zero.
       jne     .L4

    As promised, that's 5 opcodes for the computation, 1 counter update and 2 conditional jump instructions. That one should fly, right ? But should I have to switch to LLVM to get this ?

    Meanwhile the orbit search for w=26 is still running at maybe half the potential top speed...*


    Grumble !!!!

    Same source code, different optimisation level:

    [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_05.c -o pt_orbit && /usr/bin/time ./pt_orbit
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan forward from Step 0 : X=1, Y=0, C=0 
    Forward scan reached init state after 2147516415 steps...
    Read more »

  • Can you add states ?

    Yann Guidon / YGDES05/23/2021 at 03:42 0 comments

    Going further on the path explained in log 27. What is it ?, I'm exploring the question of addition.

    The Pisano-Carry state space has the operation "next" and "previous" state but addition is not natural, or more precisely it does not correspond to something we already know. Getting the "successor" is not strictly equivalent to adding 1, because there is no such "number" in this object: states are coordinates and have 2 elements (and half). We already know addition between integers, between complex numbers, between numbers in a Galois field (such a N^m or polynomials over Z²)... But here it's something different.

    Another idea is that if you can compute "next", you can also compute "next next" and so on... This could maybe speed up the search/exploration of orbits but would that be faster than individual steps ? And how would we know that the intermediary steps would (or would not) visit a given state ?

    Let's go back to the code's definition of one step:

    t = X + Y + C;    Y = X ^ data;
    C = t >> 16;      X = t & 0xFFFF;

    In a more mundane way, this could be written as:

    #define SIZE (1 << WIDTH)
    t = X + Y + C;
    Y = X;
    C = t / SIZE;
    X = t mod SIZE;

    Computing 2 steps though is not so easy because X and Y are swapped all the time:

    #define SIZE (1 << WIDTH)
    t  = X + Y + C;
    Y1 = X;
    C1 = t / SIZE;
    X1 = t % SIZE;
    u  = X1 + Y1 + C1;
    Y2 = X1;
    C2 = u / SIZE;
    X2 = u % SIZE;

    The div/mod operations don't make the sequence easy to simplify but, as in the Adler32 advanced version, there should be a way to perform it at the very end of the sequence only. If this is true, then the general algorithm could be sped up further...

    (To be continued...)

  • Going both ways

    Yann Guidon / YGDES05/19/2021 at 01:30 0 comments

    I am now writing a program that runs both backwards and forwards iterations simultaneously on 2 cores, using the POSIX threads library.

    The performance seems reasonable, I included some little scheduling tricks to save a fraction of performance here and there.

    Forward alone:

    [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan forward from Step 0 : X=1, Y=0, C=0 
    Forward scan reached init state after 2147516414 steps and 33035 crossings
    2.72user 0.00system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 1664maxresident)k
    0inputs+0outputs (0major+71minor)pagefaults 0swaps

     Backwards alone :

    [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan backwards from Step 0 : X=1, Y=0, C=0 
    Backwards scan reached init state after 2147516414 steps and 33035 crossings
    4.63user 0.00system 0:04.64elapsed 99%CPU (0avgtext+0avgdata 1572maxresident)k
    0inputs+0outputs (0major+68minor)pagefaults 0swaps

    Backwards and forward simultaneously:

    [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_02.c -o pt_orbit && /usr/bin/time ./pt_orbit 
    Measuring the first orbit of the Pisano-Carry state space
     for WIDTH = 16 bits, or about 2147516416 iterations.
    Starting to scan forward from Step 0 : X=1, Y=0, C=0 
    Starting to scan backwards from Step 0 : X=1, Y=0, C=0 
    Forward scan reached init state after 2147516414 steps and 33035 crossings
    Backwards scan reached init state after 2147516414 steps and 33035 crossings
    7.63user 0.00system 0:04.76elapsed 160%CPU (0avgtext+0avgdata 1540maxresident)k
    0inputs+0outputs (0major+70minor)pagefaults 0swaps

    The overhead is marginal but explodes when other computations occupy the remaining 2 threads. There is nothing to do there though.

    The big question is how to synchronise the cores ? How do we know the orbit is closed and where ?

    So far the strategy is to run as fast as possible until X==1. This happens approximately 2^(W-1) during a whole orbit. However the orbit has approx. 2^(2W-1)  steps so the "crossings" get rarer as W increases.

    For now, in this "rare" event, I test for Y==0 to end the loop, but this is where the threads could compare their respective results, namely the Y tagged with the iteration count. C is most probably 1 so not worth registering.

    Anyway this cuts down the number of comparisons and for W=26 this would happen in average every 2²⁵ iteration, or 32M steps. At almost 1G steps per second, that's about 30 times per second. A buffer of one second will help smooth things for the moments when a lot of crossings occur on one thread while the other is in a long trajectory between distant crossings. So let's have a FIFO with 64 or 128 values.

    Semaphores should be avoided as much as possible because they block the other thread which could perform useful computations. However safe synchro primitives must be used to prevent the usual problems in parallelism.


    The desired behaviour is that the fastest thread sends a sort of message (step number and value of Y) to the slower thread, which will compare Y with its own series of values. The messages are materialised and queued by a FIFO, which the "faster/sender" writes and the "slower/receiver" reads. This matches what Wikipedia describes in Non-blocking algorithm:

    [...] some non-blocking data structures are weak enough to be implemented without special atomic primitives. These exceptions include:
    • a single-reader single-writer ring buffer FIFO, with a size which evenly divides the overflow of one of the available unsigned integer types, can...
    Read more »

View all 38 project logs

Enjoy this project?



Yann Guidon / YGDES wrote 05/24/2021 at 03:52 point

Oh, nice !

  Are you sure? yes | no

Does this project spark your interest?

Become a member to follow this project and never miss any updates