French article in GNU/Linux Magazine France

If you wonder why, just click and read the log.

For the what, check Definitions and Draft as well as PEAC vs LFSR for a comparison.

The original algo/project name was "Fibonacci checksum" but it later appeared that it was not the most accurate choice because Leonardo Pisano (the real name of Fibonacci) is the name associated to the periodicity of the sequence under modulo. All I did was add the tiny detail of "end-around carry", or "1-complement addition", which changed everything.

 
-o-O-0-O-o-
 

As for how I came to this system...

In 2012, Piano Magic released their album "Life Has Not Finished with Me Yet". One song contains a weird repeating pattern...

Glen Johnson's lyrics are often cryptic and more evocative than objective, but any geek's mind would cling on this mantra at the end:

"Add X to Y and Y to X"

This is really weird but... Why ? What's the point in this obscure "song" with no precise theme or clear subject ? And what does it do ? This last question is the most easily answered : just follow the damned algorithm.

C'est parti...

X=1, Y=0
Y+=X => 0+1=1
X+=Y => 1+1=2
Y+=X => 1+2=3
X+=Y => 2+3=5
Y+=X => 3+5=8
X+=Y => 5+8=13
X+=Y => 8+13=21
Y+=X => 13+21=34
X+=Y => 21+34=55
...

No need to go further, most of you should have recognised http://oeis.org/A000045, the famous Fibonacci sequence.

This gave me a compelling idea to modify the old Fletcher & Adler algorithms, keeping their very low footprint and minimalistic computational complexity. Both of these well known algos use a pair of values and have a similar structure. The trick is that rearranging the data dependency graph provides the equivalent of a minimalistic polynomial checksum, because the result is fed back on itself, in a more robust way than Fletcher's algo.

At first glance, this new checksum loop's body becomes something like :

Y += ( X + datum[i  ] );
X += ( Y + datum[i+1] );
i+=2;

This loop body is totally trivial to unroll. As trivial is the design of the corresponding digital circuit. This early version seemed to contain the whole checksum entropy in the last computed value but now X and Y are part of the signature. And the really important enhancement is the use of the carry!

For superscalar 32 bits CPUs, the following code seems to work well though the missing carry hardware (and/or lack of language support) requires more instructions to emulate:

t = X + Y + C;    Y = X + data;
C = t >> 16;      X = t & 0xFFFF;

A more efficient variation exists which does not use a temporary variable:

C += X + Y;
Y = X + data;
X = C & MASK;
C = C >> WIDTH;

In this worst case, without support of a carry flag, that's 5 basic operations (not counting memory loads) that fit in 4 registers and 3 cycles, to process 2 bytes. Not too bad. I'll let you deal with alignment. But is it really safe or safer?

The following logs will show how the idea evolves and the performance increases, through discussions about carry wrap-around, register widths, scheduling, avalanche, parallelism, orbits, structure, and escaping black holes...

 
-o-O-0-O-o-
 

Logs:
1. The variable
2. Datapath
3. Adler32 weakness
4. Easy circuit
5. Periodicity
6. A promising 32-bit checksum
7. Progress
8. Why
9. Orbital mechanics
10. Orbits, trajectories and that damned carry bit
11. Two hairy orbits
12. How to deal with black holes
13. Anteriority
14. Moonlighting as a PRNG
15. A new name
16. More orbits !
17. First image
18. Structure and extrapolations
19. Even more orbits !
20. Start and stop
21. Some theory
22. Some more theory
23. Even more theory.
24. A little enhancement
25. sdrawkcab gnioG
26. Further theorising
27. What is it ?
28. A bigger enhancement
29. Going both ways
30. Can you add states ?
31. Please, GCC !
32. _|_||_||_|______|______|____|_|___|_________|||_|__
33. Carry on with GCC
34. A good compromise
35. The new non-masking algorithm
36. Add or XOR ?
37. A 32/64-bit version
38. Closing the trajectories
39. Stats about missed crossings
40. Application : Displace LFSR in BIST
41. I need a big cluster.
42. Another theoretical rambling
43. An individual job and all the necessary processing
44. Test jobs
45. A job in VHDL
46. The problem with SIMD
47. Protocol and format
48. Just a test
49. Multiple ways, and rings
50. The PolarFire way
51. The train of thoughts and the red tail light
52. Ring de-buffering
53. Think Tetris
54. Adding addressing
55. Theorizing again
56. Better mixing
57. The PolarFire jobs
58. Face-off - how it could happen
59. What about EAC ?
60. Cheap tricks for better results
61. The case for additive mixing might be closed.
62. Additions and consequences
63. PEAC seen as a collection of 2-tap LFSRs
64. The new version of PEAC16x2
65. Modular equivalence
66. Swap, parity etc.
67. At last !
68. Comparing the entropy with LFSR
69. To reach 10TOPS, CUDA is required.
70. Another strategy to scan the statespace
71. Scanning with multiple processors
72. PEAC is NOT Pisano !
73. Explaining the mathematical basis of PEAC, the easy way.
74. Marsaglia was (almost) there
75. A different increment
76. More efficient scan of the state space
77. Application as a scrambler
78. PEAC generator: the French video
79. Taxonomy
80. And the English version !
81. PEAC vs LFSR
82. Avalanche of the scrambler
83. A practical use of PEAC : packet transmission
84. Compute better, compute more
85. How to double the scan rate
86. A more convenient definition of arcs
87. Semi-arcs: they work!
88. Back to the jobs
89. Packing and unpacking numbers
90. Post-processing
91. Post-processing: the strategy is coming
92. Post-processing: the strategy looks good
93. Going SIMD
94. New packing: p7k
95. New packing at work
96. Parallel scan, at the thread level
97. w16 in .79s
98. The road to w32
99. Of the difficulty to microbenchmark on modern processors
100. Fusion
101. Summing it up...
102. w26 is (even more likely) maximal !
103. Fusion 2: the manual tinkering
104. Fusion 3: the better algorithm
105. Fusion 4: it works
106. More mathematics
107. Aiming for w27
108. TODO
109. w27: the results
110. What about w28?
111. There could be a way... or not.
112. w32 in CUDA
113. Look ma! more mathing!
114. wow.
115. Hashing with npo2 PEAC
116. Thank you Bob!
117. I just realised...
118. Hashing with generalised PEAC (ep.2)
119. Hashing with generalised PEAC (ep.3)
120. Hashing with generalised PEAC (ep.4)
121. Hashing with generalised PEAC (ep.5)
122. Funky modulo
123. More questions on gPEAC
124. Definitions
125. Just another hypothesis.
126. Line scrambling with PEAC
127. Funnels and dead ends
128. Related ideas
129. New scanner record format
130. Do you speak (insert programming language name here) ?
131. Draft
132. Currently in (French) kiosks
133. Composition
134. Parity with Fibonacci
135. The second (as originally intended) application
136. A weird mystery is almost solved.
137. More scrambling
138. Coarse/retry scanning
139. Heuristic breakthrough
140. gPEAC to 1 million
141. The end of benchmarking.
142. Userland relative benchmarking on POSIX
143. Differential benchmarking
144. Branchless gPEAC
145. More results !
146. More sieving
147. C-ving
148. More C-ving
149. Tighter C-ving
150. The new gPEAC scanner
151. Even smarter C-ving, towards a reverse sieve
152. Boosted C-ving
153. 500000 is maximal
154. w32 "Just in case"
155. More candidates and 741454
156. The good numbers
157. A missed opportunity
158. Getting to 1 million
159. Delta coding and some stats
160. 999999 is maximal
161. PEAC for line encoding
162. French article(s)
.
.

 
-o-O-0-O-o-
 

NoFix:

 
-o-O-0-O-o-
 

Ironically, this structure reminds me a LOT of a 2-tap LFSR. So a more "powerful" version would use a Mersenne LFSR-inspired structure, with additional registers : Z, W, V... In several 2005 articles, I have explored "Mersenne Twisters" with 3 and 4 taps but even that would be overkill for such a basic application where throughput is the only point. The only effect would be to delay the feedback and reduce the growth factor from φ=1.6something down to maybe 1.3. The most important point is that ALL input bits must have an effect that lasts until the end of the checksum loop, or else errors can easily creep in (and this is where Adler is better than Fletcher).
(update: see log 63. PEAC seen as a collection of 2-tap LFSRs )