0%
0%

# Pisano-carry Checksum algorithm

Add X to Y and Y to X, says the song. And carry on.

Similar projects worth following
651 views
Another purely additive, non-cryptographic, fast and minimalistic checksum.

Aimed at being a 32-bit checksum for very low resources,
- very small, using only XOR and ADD,
- without the flaws of Adler32 and Fletcher, yet the same size and complexity,
- almost as good as CRC32, which it may replace when the memory footprint is critical,
- very easy to implement in digital circuits or microcontrollers with 16-bit accumulators,
- suitable for the detection of trivial corruption in network packets or files.

Disclaimer: This is a linear, reversible checksum so it's not appropriate for cryptography or hash tables. This is not even the fastest checksum ever.
But it's still quite good.

The most common application uses 2×16 bits. The 2×8 bits version doesn't work, and 2×32 bits seems untestable for now.

If you wonder why, just click and read the log.

The original algo/project name was "Fibonacci checksum" but it later appeared that it was not the most accurate choice because Leonardo Pisano (the real name of Fibonacci) is the name associated to the periodicity of the sequence under modulo.

-o-O-0-O-o-

In 2012, Piano Magic released their album "Life Has Not Finished with Me Yet". One song contains a weird repeating pattern...

Glen Johnson's lyrics are often cryptic and more evocative than objective, but any geek's mind would cling on this mantra at the end:

`Add X to Y and Y to X`

This is really weird but... Why ? What's the point in this obscure "song" with no precise theme or clear subject ? And what does it do ? This last question is the most easily answered : just follow the damned algorithm.

C'est parti...

```X=1, Y=0
Y+=X => 0+1=1
X+=Y => 1+1=2
Y+=X => 1+2=3
X+=Y => 2+3=5
Y+=X => 3+5=8
X+=Y => 5+8=13
X+=Y => 8+13=21
Y+=X => 13+21=34
X+=Y => 21+34=55
...```

No need to go further, most of you should have recognised http://oeis.org/A000045, the famous Fibonacci sequence.

This gave me a compelling idea to modify the old Fletcher & Adler algorithms, keeping their very low footprint and minimalistic computational complexity. Both of these well known algos use a pair of values and have a similar structure. The trick is that rearranging the data dependency graph provides the equivalent of a minimalistic polynomial checksum, because the result is fed back on itself, in a more robust way than Fletcher's algo.

At first glance, this new checksum loop's body becomes something like :

```Y += ( X ^ datum[i  ] );
X += ( Y ^ datum[i+1] );
i+=2;```

This loop body is totally trivial to unroll. As trivial is the design of the corresponding digital circuit. This early version seemed to contain the whole checksum entropy in the last computed value but now X and Y are part of the signature. And the really important enhancement is the use of the carry!

For superscalar 32 bits CPUs, the following code seems to work well though the missing carry hardware (and/or lack of language support) requires more instructions to emulate:

```t = X + Y + C;    Y = X ^ data;
C = t >> 16;      X = t & 0xFFFF;```

In this worst case, without support of a carry flag, that's 5 basic operations (not counting memory loads) that fit in 4 registers and 3 cycles, to process 2 bytes. Not too bad. I'll let you deal with alignment. But is it really safe or safer ?

The following logs will show how the idea evolves and the performance increases, through discussions about carry wrap-around, register widths, scheduling, avalanche, parallelism, orbits, structure, black holes...

-o-O-0-O-o-

-o-O-0-O-o-

NoFix:

• Endian (who uses Big Endian anyway today ?)
• 8-bit version/16-bit signature

-o-O-0-O-o-

Ironically, this structure reminds me a LOT of a 2-tap LFSR. So a more "powerful" version would use a Mersenne LFSR-inspired structure, with additional registers : Z, W, V... In several 2005 articles, I have explored "Mersenne Twisters" with 3 and 4 taps but even that would be overkill for such a basic application where throughput is the only point. The only effect would be to delay the feedback and reduce the growth factor from φ=1.6something down to maybe 1.3. The most important point is that...

### orbits_forwd2.c

Microbenchmark that compares the running time of forward and backwards computation.

x-csrc - 1.78 kB - 05/13/2021 at 03:21

### orbit_resume.c

same as orbit_count.c but with command line options to resume the search at a given point.

x-csrc - 1.99 kB - 05/07/2021 at 06:03

### Pisano16x2.c

A simple function that computes the checksum of a buffer.

x-csrc - 815.00 bytes - 04/29/2021 at 01:21

### test_Pisano16x2.c

a dumb main() to verify that Pisano16x2.c works.

x-csrc - 469.00 bytes - 04/29/2021 at 01:21

### orbit_count.c

Measures the first orbit of the Pisano-Carry state space (the long way)

x-csrc - 1.28 kB - 04/27/2021 at 05:08

• ### A bigger enhancement

Yann Guidon / YGDES13 hours ago 0 comments

The log 24. A little enhancement explored the idea of a marginally "safer" version which would compensate for the altered carry bit that would lead to a "funnel". The result was not stellar and a huge hit on performance. Meanwhile I'm still looking for a method to deal with 32-bit wide data to further enhance the performace-per-cycle but the theory is pessimistic on this front.

Today, a new idea percolated and it might bring both "safety" and "performance" together by reusing the existing building blocks in new, clever ways, with the magic of the fractional format. Indeed when we look at the classic code:

```t = X + Y + C;    Y = X ^ data;
C = t >> 16;      X = t & 0xFFFF;```

We see that the useful values are in the least significant bits. There is no need to think further if the platform is 16-bit wide or supports this computation mode: the result masking and the carry shifting are performed implicitly by the hardware. It gets ugly with wider registers because no language (apart from ASM) lets you access the carry bit and you have to clean up after the computations.

But we can save the masking by pre-shifting the values! For example, for a 32-bit register, the above code would work with the higher half, the 16 MSB. The new code would become :

```t = X + Y + C;
Y = X ^ data;```

Wow ! We shaved one half of the code ! It's now even more simple and lean ! And we can deal with 16-bit data on the LSB half, out of reach from the dirty funnels. The carry propagation from the lower half will only happen every other time, with very very low chance of creating a funnel. It has not disappeared but as they say, "engineering is the art of pushing problems where they are not an issue".

So we have the new split design, with the lower 16 bits getting the mixed data, the higher bits running the usual PRNG sequence with a few nudges, but avoiding the known theoretical traps, and a checksum that can be as wide as 64 bits now (when X and Y are concatenated).

The scheme can be repeated with wider registers: for example it becomes easy to process 32 bits per step with a 64-bit register, and so on. As long as you keep the Pisano machinery in the MSB, you can arrange the data as you like. I'll let you figure how to deal with 48-bit chunks of data ;-)

But wait ! Where has the C variable gone ? In the following line:

```t = X + Y + C;
```

C is usually the carry generated by the addition, which becomes useful again with the new MSB alignment. So this is solved by reusing this carry flag. End of story, right ?

Weeeeeellllllll, the carry should be inserted at the bottom of the Pisano field, at bit 16 for the 32 bits version, so the above code is misleading and no architecture can do that. x86 has a CMOV instruction and ARM has/had(?) predicated opcodes so it is still possible to conditionally add 0x10000 to Y for example.

But in C, we don't get that sweet little carry bit. The wraparound could be detected by a hackmem-esque trick, such as

`C = ((t<X) && (t<Y))? 0x10000 : 0 ;`

or

`C = (((t-X) & (t-Y)) >> 16) & 0x10000;`

But this increases the computation time and loop size, as well as introduce a few off-by-one issues.

Furthermore we might miss the cases where a carry is generated in the intermediary result of the addition (because, apart from x86 with its LEA opcode, no other architecture provides a 3-addend addition and IIRC it doesn't even affect the carry flag).

So once again we're back to the original code, where the mask&shift can work in parallel. And by not masking C we can feed some of the LSB sum back into the LSB to mix with the incoming data, to further increase the mix...

```t = X + Y + C;    Y = X + (uint16_t)data;
C = t >> 16;      X = t & 0x7FFFFFFF;```

In this version, the 16-bit Pisano part would be in bits 15-30. Bit 31 would be the carry and the MSB of the 16 bits of data would eat into the LSB but that's not considered as a significant problem here.

At this point, not masking the MSB of X is tempting but that would discard all the already-explored...

• ### What is it ?

Yann Guidon / YGDES21 hours ago 0 comments

There is a question that bugs me for weeks now : what type of mathematical object is it ?

I have a quite decent knowledge of the usual types : integers, reals, complex numbers, and even some Galois fields over Z², for example. Quaternions, vectors and matrices are other types that are well defined. But the object under consideration puzzles me. Is it a ring, a field, a group, or something else ?

 In mathematics, rings are algebraic structures that generalize fields: multiplication need not be commutative and multiplicative inverses need not exist. In other words, a ring is a set equipped with two binary operations satisfying properties analogous to those of addition and multiplication of integers. Ring elements may be numbers such as integers or complex numbers, but they may also be non-numerical objects such as polynomials, square matrices, functions, and power series. Formally, a ring is an abelian group whose operation is called addition, with a second binary operation called multiplication that is associative, is distributive over the addition operation, and has a multiplicative identity element.

So here is the new riddle : does addition even exist in this object ? The answer is easy for the previously cited types yet here, the elements are not numbers, but coordinates in a 2.5D closed space. Whether the carry counts as one individual dimension is still being debated. The description does not fit directly with a vector system. The behaviour is a bit reminiscent with a very weird Galois field but here, the "generator" generates only one half of a symmetrical set of coordinates (which I call orbits here). Really I know nothing of this kind.

• It is a family of sets defined by a single parameter (called here w, or width)
• Each set is organised as a vector space with coordinates X=[0..2w-1], Y=[0..2w-1] and C=[0..1], hence cardinality of 2w+1 elements.
• Each element has a predecessor and successor, defined by a reversible formula (with forward and backwards directions, already defined in other logs).
• Elements (0,0,0) and (2w-1,2w-1,1) are their own precursor and successor.
• Except for w=1, the space shows a symmetry and duality (invertibility?), where all elements (x,y,c) have an inverse value (2w-(x+1), .2w-(y+1), 1-c), with the same behaviour.
• Since each point has a successor and a precursor, and the number of points is finite, the series of successors must be finite too, leading the series back to its first element, and this creates an "orbit".
• Some values of w create a system with only two orbits (as calculated in this log) and these are the ones I'm seeking:
`2, 3, 4, 10, 16`
Other numbers may exist above 25 but have not been discovered so far.

.

Another interesting detail is the way the successor is calculated: Y gets the value of X. So in a way, when we map the coordinates on a 2D plane, each step represents a sort of rotation on this plane. This system makes me think of the logistic map r×x×(1-x), such as:

Of course it is not directly related, if only because here the values are integer, modulo 2w and every overflow causes an increment of x. The other detail is that there is no parabola, no multiply/squaring and the guiding function is Fibonacci's series. Is there a link I'm missing yet?

• ### Further theorising

Yann Guidon / YGDES2 days ago 0 comments

PS: this log has bogus data so forget about it... but 6 looks interesting anyway because it has 6 orbits of equal lengths.

-o-O-0-O-o-

More than two weeks of computations for w=26 and w=27 have given some time to think on the larger scale and a better look at the available data. But it's getting reeeeeaaaaallllllllyyyyyyy long now and I'm mostly wishing that the w=26 and/or w=27 fail soon. At only 5%, 27 can still fail but so far 26 looks promising.

For now we have this list of potentially valid sizes :

`1 2 3 4 6 10 16 (26?)   -> 6 is false...`

It already struck me that 6 + 10 = 16, and now 10 + 16 = 26. Going backwards, 10 = 6 + 4 and 6 = 4 + 2 (hello, kids' maths). Now it's obviously the Fibonaccy sequence, but doubled ! But what happens with 1 and 3 ? Why do they work but not 5 ?

From there, we can postulate some predictions/hypothesis:

• 32 is not a valid size. Too bad ! 16 is valid because it is the double of 8 which is a Fibonaccy number. No other number in the sequence is a power of two, sadly. Finding 16 on the first try was a very lucky draw!
• The next valid sizes would be 42 (of course !), 68, 110 ...

These speculations need to be verified but at least the theory progresses, and it is dearly needed. At this rate, completing w=26 will take two more months (and I don't imagine for w=27). I must find a trick to make it only one month. Furthermore, the longer it takes, the higher the chances of flukes/miscalculations accumulate. They are rare but not impossible at this scale

So even halving the run time is very important and I'm back to the concepts of the last logs, where 2 cores compute roughly one half of the orbit and meet "somewhere" near the middle. And I think I have solved one major issue I had with the previous ideas: there is no need to store all the paths, we only need to compare the last N values. Indeed, the orbit is closed at one point, not potentially random values, and the sequence is run backwards and forward. Now the question is : how many points and which should be kept in the history ?

We can synchronise the 2 threads so they know what the other has "consumed", and discard the elements that have been compared. My favourite sync method is not semaphores or read/write flags, but a pair of mailboxes where one thread write and the other can only read, which simplifies the cache coherency and updates.

• ### sdrawkcab gnioG

Yann Guidon / YGDES3 days ago 0 comments

The log 22. Some more theory explored the idea of computing the sequence in reverse, with the hope of halving the time needed to compute a whole orbit. Today I made this microbenchmark: orbits_forwd2.c.

This is the result of several basic optimisations targeting OOO CPUs such as the i7 of my laptop, and I got pretty interesting results. First let's look at the basic backwards algorithm:

```// reverse the split
tb = (C << WIDTH) | X;
Xb = Y;
Yb = tb - Y;

Cb = Xb - Yb;
Cb = ((int32_t) Cb) >> WIDTH;
Yb += Cb; // N + -1 = N - 1
Cb &= 1;```

So far it's quite good, as it avoids branch prediction horrors by computing C directly. There is no access to the carry bit so it is computed by hand. The result of the comparison is sign-extended into the full register, giving either 0 or -1. This is added to selectively decrement Y and we're done.

But it's long and slow so there is no hope or benefit on this front, except for the claim and confirmation that indeed, the computation is reversible (as was supposed) and the theory is confirmed: if there is only one successor and one predecessor, then the values must loop and create an orbit. I checked with several programs up to w=16 and the algo is good.

The story doesn't end here though. The comparison part introduces more delay through dependencies so I tried to break them through substitutions. I was not disappointed and after some progressive enhancements, I came up with the version included in orbits_forwd2.c.

```    // reverse the split
t = C | X;
X = Y;
//  C = X - Y;  // normal comparison
//  C = Y - (t - Y); // Substitution
C = (2*Y) - t;  // more substitution
Y = t - Y;

t = ((int32_t) C) >> WIDTH;
Y += t; // N + -1 = N - 1
C &= SIZE;```

Among the enhancements, some lines have been moved up or down, t is used 2×, C's value is left unshifted for a direct use "in place" and successive substitutions provide a formulat that can be run in parallel, breaking the dependency as expected. This creates a weird situation where C and Y compute the opposite values : Y-t and t-Y. Little by little, this makes the formula look more rich and interesting than the forward formula...

Now let's speak numbers : through the optimisations, the backwards version is usually 50% slower than the forward version. Some more tweaks (smoother interactions with the loop body) bring the overhead to about 1/3 but this is never going to be as fast as the forward version. The fastest runs for w=16 I had are 2.88s for forward and 3.88 for backwards.

Furthermore, the computation seems to saturate the i7's integer pipelines and the 2nd thread's performance is affected. For example, if I run the microbenchmark while the 2 cores are already running the orbit search with w=26 and w=27, the microbenchmark's runtime is doubled. Or I don't understand Intel's "hyperthreading" and it's more lousy than I thought.

This means that for the orbit search to work efficiently, it must use one core to walk forward and another to walk backwards at full speed. Synchronisation then becomes a nightmare : how do we know when and where the two trajectories meet ?

The first idea is naturally to compute both directions in "lock-step": compute forward, compare, compute backwards, compare, rinse, repeat. This implies that all computations happen in the same CPU core but there is the risk of saturation of the units. So this might not accelerate the whole search.

Another idea is to run two cores in parallel, one in each direction, but synchronisation is a big issue. The comparisons can be minimised when Y=0 and for w=27, this could fit into 2^24=16M bytes, but still larger than my L3 cache so it's going to thrash a lot. Worse : cache coherency will further slow everything down and let's forget about the "huge TLB" support.

Anyway : the computation is reversible with a reasonable amount of effort but it doesn't seem to provide a practical advantage, considering today's computers architecture.

• ### A little enhancement

Yann Guidon / YGDES7 days ago 0 comments

As we have already explored, a system with width=w will have

• w bits for X
• w bits for Y
• 1 bit for C

The whole state space is also divided into two groups :

1. the orbits (2 ideally, of equal length)
2. the trajectories, which lead to one orbit at the next step.

The 1-step trajectories might irk some purists because they would be seen as a "funnel".

In a way this is true because this makes the system non-reversible, as half of the alterations (from the user datastream) will make the state fall in the "diagonal" band of states that leads to the funnel. Switching orbit is not an issue since it is reversible.

There is one solution though : we have already seen that the orbits always fall in a "triangle", where C=0 when X>Y and C=1 when X<Y. So the trick would be to adjust/correct C after Y is altered, so the state remains in either of the orbits. But where would this correction fall in this code ?

```t = X + Y + C;    Y = X ^ data;
C = t >> 16;      X = t & 0xFFFF;```

In the cases I see, this will simply override, or overwrite, C. At first glance it seems to throw one bit of entropy away but if we consider the condition of C, it is not really lost since it depends on X and Y so if Y changes, C should too. So we could write

`C = sign(X-Y)`

But not anywhere, because 2 parallel operations are taking place simultaneously. However C is updated on the 2nd line and since it will be overwritten, that statement can be replaced and re-organised:

```t = X + Y + C;    Y = X ^ data; // ^ or +, as you like
X = t & 0xFFFF;
C = (((X-Y) >> 16) & 1```

I'm not sure yet if/how I can run the masking of X simultaneously with the comparison. x86 has comparison instructions that would greatly simplify the last line, which might be obscure to the compilers. Let's also try with:

```t = X + Y + C;    Y = X ^ data;
X = t & 0xFFFF;
C = (X<Y)?1:0```

That last line should be translated in two x86 opcodes. If carries are used in asm, just move things around:

```C = sign(X-Y);   // just a CMP
t = X + Y + C;   // ADDC
Y = X ^ data;    // XOR
X = t & 0xFFFF;  // might be avoided in 16-bits mode
```

This is obviously slower than the classic version, with 3 cycles, and must be verified as equivalent with the original code.

-o-O-0-O-o-

What are the costs and benefits ?

• The immediate cost is the added subtraction that compares X and Y, slowing the computation down in the critical datapath.
• Another cost is the loss of one diagonal line, since C is now explicitly computed but with a partial formula
• The benefit is having a truly reversible computation and a potentially more solid error detection but this assertion must be tested and verified.

In practice, the balance must be carefully chosen. So far, before this alternate version is thoroughly tested and compared, it seems that the original, simple version, is still the most attractive, at least for short or medium data streams. The "enhanced version" without the funnels must show a significantly increased robustness to justify the added speed cost.

.

• ### Even more theory.

Yann Guidon / YGDES7 days ago 0 comments

The last log 22. Some more theory examined the orbits of the systems with w=0, w=1 and w=2. Let's look at the orbits of w=3 now:

This system has 2 orbits with 35 steps each and shows more complexity than the previous ones. The ranges are [0..7] now:

 Orbit 1```Starting at 1,0,0 1 - 1,1,0 2 - 2,1,0 3 - 3,2,0 4 - 5,3,0 5 - 0,5,1 6 - 6,0,0 7 - 6,6,0 8 - 4,6,1 9 - 3,4,1 10 - 0,3,1 11 - 4,0,0 12 - 4,4,0 13 - 0,4,1 14 - 5,0,0 15 - 5,5,0 16 - 2,5,1 17 - 0,2,1 18 - 3,0,0 19 - 3,3,0 20 - 6,3,0 21 - 1,6,1 22 - 0,1,1 23 - 2,0,0 24 - 2,2,0 25 - 4,2,0 26 - 6,4,0 27 - 2,6,1 28 - 1,2,1 29 - 4,1,0 30 - 5,4,0 31 - 1,5,1 32 - 7,1,0 33 - 0,7,1 34 - 0,0,1 35 - 1,0,0``` Orbit 2 ```Starting from 7,0,0 1 - 7,7,0 2 - 6,7,1 3 - 6,6,1 4 - 5,6,1 5 - 4,5,1 6 - 2,4,1 7 - 7,2,0 8 - 1,7,1 9 - 1,1,1 10 - 3,1,0 11 - 4,3,0 12 - 7,4,0 13 - 3,7,1 14 - 3,3,1 15 - 7,3,0 16 - 2,7,1 17 - 2,2,1 18 - 5,2,0 19 - 7,5,0 20 - 4,7,1 21 - 4,4,1 22 - 1,4,1 23 - 6,1,0 24 - 7,6,0 25 - 5,7,1 26 - 5,5,1 27 - 3,5,1 28 - 1,3,1 29 - 5,1,0 30 - 6,5,0 31 - 3,6,1 32 - 2,3,1 33 - 6,2,0 34 - 0,6,1 35 - 7,0,0```

This time, most of the points on the lower diagonal (n,n,0) belong to orbit 1 and the others to orbit 2, but this feature is rather unique to this configuration.

Furthermore I don't see a clear pattern to identify the middle of the orbit. With an odd number of steps, 35/2=17.5 falls between (0,2,1) and (3,0,0), which looks a lot like 2^(w-1)-1 but this looks like a fluke so far.

.

• ### Some more theory

Yann Guidon / YGDES05/08/2021 at 17:31 0 comments

The last log is only scratching the surface and the redaction was interrupted so I resume here.

• A216783: Number of maximal triangle-free graphs with n vertices.
1, 2, 3, 4, 6, 10, 16, 31, 61, 147,...

• A329695: Number of excursions of length n with Motzkin-steps avoiding the consecutive steps UD, HU and DH.
1, 2, 2, 3, 4, 6, 10, 16, 28, 48, 85, 152,...

I don't know if they are really related but somewhere, something should match and these are some of the least implausible. I have not found a matching underlying theory yet so I'm fishing for whatever can be found. The results of the next brute-force searches will tell us more and refine the fishing expedition.

So a lot of responsibility remains on the shoulders of the brute-force search. And the "theory" could help reduce the strain. For example, the "backwards equation" can double the speed of the the exhaustive search for orbits !

Consider this: if the orbit has a forward and a backwards direction, then both ways can be explored simultaneously and meet at the middle. This is not really a win when taken from the perspective of the total amount of computations. However the steps can be run in parallel !

The computation is still very serial in nature (particularly because each step of both computations must be compared) but that's where Out Of Order Execution shines, because it can run several instructions in parallel and reorganise the execution flow at will. Indeed the pipeline will be better used than in the purely serial code that is currently running on my old dual-core i7. So I should focus on the backwards equation. The diagram below shows the principle, where both paths are one half shorter:

But then, one more idea comes: What if we could determine the coordinate of the middle of the orbit in advance ? From there, it would be easier to only compute one half and meet at the quarter of the orbit, which would further half the computation effort.

Then, from there, what keeps me from going further and bissecting the orbit further and further and... After all the orbit length is close to a power of two so it would be a good match. But this would work only if an ideal orbit exist, so what makes these orbits ideal in the first place ?

There is another way to see it, not with the "top-down" approach but "bottom-up". Let's start with the simplest, smallest system and see how it evolves.

-o-O-0-O-o-

The case with width=0 only contains 2 stuck states so there is nothing much to learn there.

-o-O-0-O-o-

With width=1, we have 8 states with X and Y oscillating between 0 and 1.

`(1,0,0)=>(1,1,0)=>(0,1,1)=>(0,0,1)=>(1,0,0)`

Wait, what ? We have a 4-long orbit ? Anyway the binary sequence looks legit. If we exclude (0,0,0) and (1,1,1) the other orbit should be 2-long, right ?

```(1,0,1)=>(0,1,1)=>(0,0,1)=>(1,0,0)
(0,1,0)=>(1,0,0)=>(1,1,0)=>(0,1,1)=>(0,0,1)=>(1,0,0)```

No, it falls on the "diagonal" and there is only one orbit, and the two trajectories enter the loop at a different place.

-o-O-0-O-o-

For width=2, there are 32 states, 2 orbits and X is modulo 4 (spans 0..3), 2 symmetrical orbits of length=9 and 2×6 trajectories.

Let's look at both orbits in parallel:

```(1,0,0)=>(1,1,0)=>(2,1,0)=>(3,2,0)=>(1,3,1)=>(1,1,1)=>(3,1,0)=>(0,3,1)=>(0,0,1)=>(1,0,0)
(2,0,0)=>(2,2,0)=>(0,2,1)=>(3,0,0)=>(3,3,0)=>(2,3,1)=>(2,2,1)=>(1,2,1)=>(0,1,1)=>(2,0,0)```

Neat ! Now let's rotate the 2nd orbit a little to start with (2,3,1):

`(2,3,1)=>(2,2,1)=>(1,2,1)=>(0,1,1)=>(2,0,0)=>(2,2,0)=>(0,2,1)=>(3,0,0)=>(3,3,0)=>(2,3,1)`

If you invert the carry ( 0<=>1 ) and the X and Y ( 0<=>3, 1<=>2 ) you find the same sequence as the first orbit ! This was not possible or true with width=1. QED.

-o-O-0-O-o-

What else can we learn from this ? Another noticeable feature is that going backwards in the orbit is possible, despite the carry. The last log 21. Some theory has deduced the antecedent from...

• ### Some theory

Yann Guidon / YGDES05/02/2021 at 13:54 0 comments

This computer is still searching for the loop length for the w26 and w27 configuration but it only goes as far as 7,5% of the expected range in 3 days & half. At 2% per day, we'll still be there in a month and half... Which leaves some time to think about the theory.

First, let's have a look at OEIS and refine the search because we have new data:

• 2 is actually valid (I didn't bother checking before), I'll have to check with 1
• All widths from 17 to 25 have been invalidated.

While looking at https://oeis.org/search?q=2,3,4,10,16 I can exclude the sequences that contain 17 to 25. One curious find is A293632 :

Least integer k such that k/Fibonacci(n) >= 3/4 : 1, 1, 2, 3, 4, 6, 10, 16, 26, 42, 67, 108, 175

The Fibonacci word is mentioned but I have no idea what this sequence means or if it is actually relevant. However several sequences with the same prefix have the number 26 so I'm hopeful that the current computation will provide a positive result. But it will be too long... Something else must be done while my laptop endlessly churns numbers. So let's go to the basics.

In the log 18. Structure and extrapolations, I mentioned obvious geometric/topologic features. For example in the w4 version :

The upper half is almost identical to the lower half, because they correspond to the result of the same operation. There is a shift because to get to the lower part with C=1, X has an extra +1.

Then we also have this complementarity, with a central symmetry, within one square/half:

(Here I have cropped the top-most line to make it more obvious and get an odd number of lines)

So in the best/desired case, once get have one coordinate for one orbit, we also have the corresponding point for the other orbit at the opposite from the centre.

The above picture is taken from the upper square, where we see that the upper-right triangle contains the diagonal, which explains why we find another power of two in the length of the orbits.

```                        orbit
Width     states       length    ratio(%)    decomposition
2             32            9     28.125     2^3 +  2^1 -1
3            128           35     27.34      2^5 +  2^2 -1
4            512          135     26.36      2^7 +  2^3 -1
10       2097152       524799     25.02     2^19 +  2^9 -1
16    8589934592   2147516415     25.0003   2^31 + 2^15 -1```

For a width w, the orbits has 2^(2w-1) + 2^(w-1) -1 states, the middle term is the diagonal line and the -1 is the stuck point.

When it is multiplied  by two, to account for the two complementary orbits, the formula is 2^(2w) + 2^w -2 and this is coherent with the diagram where the diagonal (with only trajectories that lead to the orbits) is removed:

And we find another central symmetry, coming from the already existing symmetry within a square and the copy/shift of the lower triangle.

This duality (in the "ideal case") works because the 2^w range behaves like a "ring" where 0 is equivalent to 2^w -1. How/why, I don't know, and I have no idea why the carry makes the orbits fill the space, and only for specific values of w, while this does not happen in the classic Pisano periods (where the main orbit fills only 3w/2 states).

-o-O-0-O-o-

There is another crucial question : does every point have an antecedent/predecessor ? If there is a definitive positive answer, then all versions must have orbits. This is also very important because the answer will also tell if the function has funnels, per Bob Jenkins' definition (go read his article if you don't know it already !).

We already know that each point has only one successor, given by the very definition of the sequence:

```X' = X + Y (mod 2^w)
Y' = X```

So, given X' and Y', we can deduce the antecedent:

```X = Y'
Y = X' - Y' (mod 2^w)```

In the example below, we see a geometric construction that finds the antecedent of (20,6) through some projections:

Because X goes into Y in the following iteration, there is not much choice in the antecedent. This is a very precious because from there, we can deduce a few useful things !

-o-O-0-O-o-

For example, let's look at the a point...

• ### Start and stop

Yann Guidon / YGDES04/28/2021 at 20:13 0 comments

I have already explored all the "low hanging fruits" I have found but I'll need more time to come up with a proper perspective and theory. However, practical implementations have some coding constraints that I will address in this log.

Let's have a look at this revised dependency graph, which I revised to consider the aspects covered here:

I tried to minimise the total number of operations. This would probably be adapted for a hardware implementation, of course, but let's first design a first working software version.

The first thing to consider is the last one, usually called the "Finish" step, where the final result is handled. Already, X, Y and C contain the whole state for the system but

1. C is usually discarded, so should be better combined with the remaining bits
2. X could be OK by itself but adding it to Y should increase the Hamming distance in the case of an alteration that hits the last word and/or the checksum.

So the Finish contains a simple addition with carry in, and no carry out.

• If you only require 16 bits of signature, this sum is all you need.
• However if you desire 32 bits, you can concatenate X and Y.

Going up from there, we have the loop body. The X-Y swap has been moved up but this doesn't change much here.

Then we have the Start step where the variables are initialised. As we have learned, the carry must be set to a value that is coherent with Y, in order to prevent a "stuck state". So in fact, any value is OK except  { Y=FFFF, C=1} and { Y=0, C=0}. I have used 0x1234 to prove my point but you may choose another non-forbidden value.

One input value is pre-loaded into X and the pointer can be pre-incremented in the loop to save one instruction and prevent an off-by-one access (and/or TLB lookup for a block that will not be used).

From there, writing the code is quiet like a walk in the woods...

```#include <stdint.h>
#include <inttypes.h>

#define PISANO_WIDTH (16)
#define SIZE  (1<<PISANO_WIDTH)

uint32_t Pisano_Checksum(
uint16_t *buffer,  // points to the data to scan
int size           // number of 16-bit WORDS to scan
) {
uint32_t
// Start
X = *buffer,
Y = 0x1234, // anything except 0 or FFFF
C = 1,
t;

// Loop
while (size) {
size--;
buffer++;

t = X + Y + C;   X = Y ^ *buffer;
C = t >> PISANO_WIDTH;  Y = t & MASK;
}
// Finish
return C+Y+(X << PISANO_WIDTH);
}
```

Aaaand that's it.

I provided a tiny main() to verify the basics.

```gcc -Wall test_Pisano16x2.c -o test_pisano && ./test_pisano
Checksum: 8573B7DA```

What remains now to be done ?

• ### Even more orbits !

Yann Guidon / YGDES04/27/2021 at 05:11 0 comments

With a new smaller program orbit_count.c, I try to check the length of the first orbit for widths larger than 16.

Remember what we have found previously : for the configuration to be "good", the first orbit must have at least 2^(width*2 -1) different states, or > 25% of all possible states.

The first 4 are disappointing:

• 17 : 52.942.979 (0.6% of the 8.589.934.592 taht would be required to pass the test)
• 18 : 172.662.654  (instead of 34.359.738.368 to pass, or 0.5%)
• 19 : 171.681.219 (0.12%, much less than 137.438.953.472)
• 20 : 317.330.936 (0.06%, barely a blip compared to 549.755.813.888)
• 21 gets longer (4%) but is still far from 2199.023.255.552 :
`1m53s for 87.960.719.859 iterations`

22 and 23 might be interesting... I have no idea how long they'll run but it's good to have 4 cores on the laptop, so I can run 2 heavy tasks while writing this comfortably. However I can only run about 30G iterations per core per minute and w22 might run about 8800G (4h and half) and w23 about 35000G iterations (17h)...

-o-O-0-O-o-

Later :

• 22 is approx. one half of the expected length (48.27% of 8.796.093.022.208):
```Orbit length: 4.246.390.747.269
8353.49user 1.13system 2:19:54elapsed 99%CPU ```
• 23 failed half-way (almost precisely 50%) from 35.184.372.088.832
```Orbit length: 17.592.032.552.735
30717.44user 4.18system 8:33:59elapsed 99%CPU```
• 24 didn't last long : 0,95% of 140.737.488.355.328
```Orbit length: 1.340.357.111.846
4427.70user 0.05system 1:14:00elapsed 99%CPU```
• 25 disappointed too: 0.0015% !
```WIDTH = 25 bits, or about 562.949.953.421.312 iterations.
Orbit length: 8.882.847.330.803
17867.77user 3.16system 5:52:36elapsed 84%CPU```
• 26 is running, but I have no idea how much time it will take to try 2.251.799.813.685.248 iterations.
6191m55s, Step 202.653.736.894.464 : X=60945700, Y=54013062, C=0 (9%)
8224m44s, Step 270.686.018.863.104 : X=30001905, Y=43460744, C=1
9578m24s, Step 316.109.592.985.600 : X=24653047, Y=21525614, C=0  (14%)
10089m46s, Step 332121231065088 : X=39691475, Y=26670666, C=0
... 605320.20user 67.81system 230:16:09elapsed 73%CPU
1226m08s, Step 374.383.709.257.728 : X=56485448, Y=64077162, C=1  (16.6%)
3416m13s,  Step 447.501.232.504.832 : X=7984208, Y=8052197, C=1
• 5057m39s ,Step 503.782.483.951.616 : X=14136462, Y=63349888, C=1  (22.3%)
6569m12s, Step 558.345.748.480.000 : X=18544557, Y=11503616, C=0

• 27 is running too, maybe it will stop before 9.007.199.254.740.992... That's 9 million billions.
5948m55s, Step 195.094.594.453.504 : X=13025821, Y=12504916, C=0 (2,1%)
7981m58s, Step 263.126.876.422.144 : X=85800334, Y=53796446, C=0
9335m43s, Step 308.550.450.544.640 : X=132594465, Y=19377793, C=0 (3,4%)
9847m10s, Step 324.493.369.147.392 : X=109974339, Y=52277567, C=0
... 590770.05user 61.77system 226:12:32elapsed 72%CPU
1029m41s, Step 354.661.219.434.496 : X=70991304, Y=49909254, C=0  (3.9%)
3115m51s  Step 414447164194816 : X=18026315, Y=45706791, C=1
4757m31s Step 463169273200640 : X=71156624, Y=61825695, C=0 (5.14%)

.

Stay tuned. But so far I have not found another "good configuration". I'd love if Width=32 was good but for now there is no reasonable way to prove this except with the brute-force approach.

20210507 : computations started on 20210427 and now I must reboot my computer. Fortunately I have saved here some checkpoint values but I have no program that accepts them. Sigh.

....

OK it was  pretty straight-forward! I made orbit_resume.c  so I could resume the computations. Log on !

______

Note: the search runs on a i7-2620M CPU@ 2.70GHz (with boost speed claimed at 3.4GHz, only 3.2G if 2 cores are active). OK this is already 10 years old but the cache is 4MB and the speed is better than the i5 of the same generation.

Now the question is : how does Intel share the integer pipelines with its "hyperthreading".

Share

## Similar Projects

Project Owner Contributor

### Recursive Range Reduction (3R) HW&SW CODEC

Yann Guidon / YGDES

Project Owner Contributor

### Game of Life bit-parallel algorithm

Yann Guidon / YGDES

Project Owner Contributor

### 3RGB image lossless compression format

Yann Guidon / YGDES

Project Owner Contributor

### σα code

Yann Guidon / YGDES

# Does this project spark your interest?

Become a member to follow this project and never miss any updates