Close
0%
0%

PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

Similar projects worth following
Another purely additive, non-cryptographic, fast and minimalistic checksum.

Computes a 32-bit checksum with a pair of 16-bit registers,
- very small, using only ADD,
- without the flaws of Adler32 and Fletcher, yet the same size and complexity,
- almost as good as CRC32, which it may replace when the memory footprint is critical,
- very easy to implement in digital circuits as well as microcontrollers,
- suitable for the detection of trivial corruption in network packets or files.
- not patented
- inappropriate for cryptography or hash tables.

2×32 bits and larger are possible with the non-masking algorithm using w26.

If you wonder why, just click and read the log.

The original algo/project name was "Fibonacci checksum" but it later appeared that it was not the most accurate choice because Leonardo Pisano (the real name of Fibonacci) is the name associated to the periodicity of the sequence under modulo. All I did was add the tiny detail of "end-around carry", or "1-complement addition", which changed everything.

 
-o-O-0-O-o-
 

In 2012, Piano Magic released their album "Life Has Not Finished with Me Yet". One song contains a weird repeating pattern...

Glen Johnson's lyrics are often cryptic and more evocative than objective, but any geek's mind would cling on this mantra at the end:

"Add X to Y and Y to X"

This is really weird but... Why ? What's the point in this obscure "song" with no precise theme or clear subject ? And what does it do ? This last question is the most easily answered : just follow the damned algorithm.

C'est parti...

X=1, Y=0
Y+=X => 0+1=1
X+=Y => 1+1=2
Y+=X => 1+2=3
X+=Y => 2+3=5
Y+=X => 3+5=8
X+=Y => 5+8=13
X+=Y => 8+13=21
Y+=X => 13+21=34
X+=Y => 21+34=55
...

No need to go further, most of you should have recognised http://oeis.org/A000045, the famous Fibonacci sequence.

This gave me a compelling idea to modify the old Fletcher & Adler algorithms, keeping their very low footprint and minimalistic computational complexity. Both of these well known algos use a pair of values and have a similar structure. The trick is that rearranging the data dependency graph provides the equivalent of a minimalistic polynomial checksum, because the result is fed back on itself, in a more robust way than Fletcher's algo.

At first glance, this new checksum loop's body becomes something like :

Y += ( X ^ datum[i  ] );
X += ( Y ^ datum[i+1] );
i+=2;

This loop body is totally trivial to unroll. As trivial is the design of the corresponding digital circuit. This early version seemed to contain the whole checksum entropy in the last computed value but now X and Y are part of the signature. And the really important enhancement is the use of the carry!

For superscalar 32 bits CPUs, the following code seems to work well though the missing carry hardware (and/or lack of language support) requires more instructions to emulate:

t = X + Y + C;    Y = X ^ data;
C = t >> 16;      X = t & 0xFFFF;

A more efficient variation exists which does not use a temporary variable:

C += X + Y;
Y = X ^ data;
X = C & MASK;
C = C >> WIDTH;

In this worst case, without support of a carry flag, that's 5 basic operations (not counting memory loads) that fit in 4 registers and 3 cycles, to process 2 bytes. Not too bad. I'll let you deal with alignment. But is it really safe or safer?

The following logs will show how the idea evolves and the performance increases, through discussions about carry wrap-around, register widths, scheduling, avalanche, parallelism, orbits, structure, and escaping black holes...

 
-o-O-0-O-o-
 

Logs:
1. The variable
2. Datapath
3. Adler32 weakness
4. Easy circuit
5. Periodicity
6. A promising 32-bit checksum
7. Progress
8. Why
9. Orbital mechanics
10. Orbits, trajectories and that damned carry bit
11. Two hairy orbits
12. How to deal with black holes
13. Anteriority
14. Moonlighting as a PRNG
15. A new name
16. More orbits !
17. First image
18. Structure and extrapolations
19. Even more orbits !
20. Start and stop
21. Some theory
22. Some more theory
23. Even more theory.
24. A little enhancement
25. sdrawkcab gnioG
26. Further theorising
27. What is it ?
28. A bigger enhancement
29. Going both ways
30. Can you add states ?
31. Please, GCC !
32. _|_||_||_|______|______|____|_|___|_________|||_|__
33. Carry on with GCC
34. A good compromise
35. The new non-masking algorithm
36. Add or XOR ?
37. A 32/64-bit version
38. Closing the trajectories
39. Stats about missed crossings
40. Application : Displace LFSR in BIST
41. I need a big cluster.
42. Another theoretical rambling
43. An individual job and all the necessary processing
44....

Read more »

zoumout2.tgz

files to compute and render the 16 zoom-out videos

x-compressed-tar - 2.82 kB - 09/14/2021 at 04:19

Download

Marsiglia _ add with carry RNG 1177005878.pdf

Some great theory from 1991

Adobe Portable Document Format - 1.76 MB - 09/14/2021 at 04:17

Preview
Download

html_slides.tgz

files for log73.

x-compressed-tar - 1.27 kB - 09/06/2021 at 01:18

Download

zoumout.tgz

all the files (except the music) necessary to rebuild the zoom-out video.

x-compressed-tar - 2.91 kB - 08/16/2021 at 10:49

Download

bidir_orbit_01.c

Finally a bidirectional scanner that works.

x-csrc - 5.96 kB - 08/15/2021 at 08:49

Download

View all 38 files

  • Marsaglia was (almost) there

    Yann Guidon / YGDES3 days ago 0 comments

    I am re-reading the 1991 paper from Marsaglia & Zaman "A new class of random numbers generators" and it is very good.

    There are some important details that are very useful and also many significant differences, but the overlap is worth it.

    Let's start with the differences :

    • The authors focus on extremely long sequences, suitable for Monte-Carlo simulations, so the sequence lengths are in the order of 10^100 to 10^500. OTOH  I'm still limited to 2^52 or so...
    • They focus on "Lagged Fibonacci" with 20 to 50 intermediate registers to reach these extreme lengths, while I focus on only 2 registers (the special case with lag of 1 and 2).
    • The focus on extremely long sequences makes them miss the case for CRC/checksums.
    • What I call PEAC, is called "add-with-carry generators".

    That explains why they don't focus on the same configurations as I do.

    However they expose a lot of very pertinent ideas.

    • Part 2. starts with the Fibonacci sequence modulo 10 (without calling it Pisano, which might be a later/recent naming convention nuance) and this is almost exactly what is done with PEAC, except for the modulus. A lot of things match the PEAC generator until they change the lags for much longer sequences (sadly).
    • They bring more types of iteration types : subtract-with-borrow and more, including one with the potential to fill the whole statespace. I'm not interested because that would increase the complexity of the checksum algo.
    • They show why this system works : each digit is like a new development for the approximation of a rational number. And for PEAC, the ratio is Phi !
    • They prove why the sequences have 2^(2n) + 2^n - 2 states (because the "lags" are 2 and 1)

    Overall they bring a serious mathematical background though I'm not sure it would be useful for my case and I wouldn't know how to use it anyway. That would be useful for characterising w32 or even w64 but they seem to treat lag=2 as an exception, somehow.......

    I feel somehow vindicated and it contains many similarities with the script I'm working on for a soon-to-be-released video.

    I'm still in uncharted territory but if Marsaglia went near, 30 years earlier, it is very encouraging.

  • Explaining the mathematical basis of PEAC, the easy way.

    Yann Guidon / YGDES09/05/2021 at 23:24 0 comments

    I'm trying to make more videos to show the maths behind the algorithm. It's not easy but I'm progressing. The script of the video is easy to write but can easily go awry. The trick seems to be to prepare all the support material and then comment on them.
    So I've been programming some HTML/JS pages to use later.

    It seems that the easiest path to explain PEAC is by first going back to Fibonacci, then Pisano and modify it to get PEAC. Then the link with Fletcher is easy to make.

    1) http://ygdes.com/~whygee/PEAC/fibo.html shows the Fibonacci sequence. There is nothing new there, it is very well know already, but I set the stage for the next slides with two things:

    • I show the sequence of numbers in a X-Y graph with increasingly sparse points on a diagonal line with a constant slope.
    • I show the convergence of the ratio of consecutive numbers toward φ=1.6180339887498948482045868343656381177203091798057628621...

    This is a reminder that φ is a transcendental number, like π. In fact, by its definition, it's the "most transcendental" because its continuous fraction is the simplest. Transcendentals are very interesting because they can be the source of (almost) infinite sequences of finite values.

    From there, there is hope that a pseudo-random number generator (PRNG) could be derived from the Fibonacci sequence.

    2)  http://ygdes.com/~whygee/PEAC/pisano.html is a modification of the above page, with the integer values bound in the 0..15 range: this is the Pisano sequence modulo 16. This value was chosen because we computer guys love powers of 2.

    By representing the sequence of values in X-Y form, we see the dots filling only a small proportion of the whole statespace. Here, X is the current value and Y is the last value. Except near the origin, most dots look equally spaced. Larger moduli will not give better fill results because it is well known that for our cases (powers of 2), the sequence loops after 3/2×modulus. Indeed, the program stops after 24 cycles for a modulus of 16. This is far from being a good pseudo-random number generator... Other moduli can be tested but none creates a sequence near modulus squared. See sequence A000045 in the OEIS, and the related Wikipedia page.

    But we're almost there !

    3) Now let's modify the algorithm a little. http://ygdes.com/~whygee/PEAC/pisanoEAC.html adds a little twist:

    Every time the sum wraps around, the next computation adds 1 to the sum.

    This single trick bumps the sequence length to 135 values, which is a bit more than 16²/2=128.

    The sequence fills only half of the state space, the other half is filled by an identical/complementary sequence. Although it is not ideal, it is "more than good enough" for many applications.

    The real problem is to use moduli where the state space contains only 2 orbits. For the powers of two, only a few have been proved so far:
    2², 2³, 2⁴, 2¹⁰, 2¹⁶.  The video below shows some of these "good" sizes, with each color corresponding to an individual sequence, or "orbit":

    Other sizes (non-power of 2) are not practical, and sizes beyond 2²⁵ get extremely hard to compute. However the width of 16 bits is very convenient and is the basis for the PEAC checksum algorithm. More sizes are being explored.

    When a suitable width is found, it can be used as the basis for a checksum. A "good" checksum is a PRNG that is perturbed by external data. This perturbation has been studied (the addition is the safest method) and the stability has been proved, so PEAC could be an enhanced replacement to the classic Fletcher checksum algorithms, which have a very similar structure and complexity, but worse behaviour in many corner cases.

  • PEAC is NOT Pisano !

    Yann Guidon / YGDES08/29/2021 at 23:11 0 comments

    The Pisano period is defined by taking the Fibonacci sequence, modulo some integer.

    One could think that PEAC is different because the modulo is not applied in a classical way. End-Around-Carry adds an offset of one to the modulo result. But it goes much further than that.

    If you compute EAC on a 8-bit byte, you get the equivalent of modulo 255 (offset by 1).

    Now, if you collapse the statespace and remove the diagonal (that is not visited), you end up with 257 values !

    so PEAC works modulo 1+2^n !

    The above w4 is 16 wide (0 to 15) and 17 (0 to 16) high.

    This is explained by the formula to draw this diagram: X is horizontal (mod 16), but vertically, this is Y (mod 16) +C (mod 2).

    C is not collapsed immediately, which increases the size of the statespace. This separation of C makes a real difference and explains why this object is not easy to describe with the common mathematical tools.

    From there is sounds harder to believe that PEAC can be broken down into already existing cyclic groups, meaning that PEAC seems to "stand on its own" as a separate kind of field (?).

    I also notice that 2^n +1 looks like a Mersenne number (except Mersenne is 2^n -1). In the above example, we have w=4=>17 which is prime, and w8=>257 which is prime too, but w4 is ideal while w8 is not. Another counterexample is w10=>1025 which is obviously not prime. Prime numbers don't have predictive value for the quality of the statespace...

  • Scanning with multiple processors

    Yann Guidon / YGDES08/29/2021 at 22:29 0 comments

    Today I made a new prototype to explore and display the behaviour of the program that scans the PEAC statespace from many starting points. Have a look at http://ygdes.com/~whygee/a2_02.html

    Here are a couple of videos to show the behaviour when the number of parallel jobs is modified. In the worst case, with only two jobs, it takes a lot of time (all the steps to scan the whole orbit).


    When more jobs are possible, it becomes interesting... Here, each colour represents a job and not an orbit.

    Some arcs are extremely short, others -rare- are very long, so they waste most of the time.

  • Another strategy to scan the statespace

    Yann Guidon / YGDES08/20/2021 at 21:57 0 comments

    The bidirectional scan is nice and a great exercise... but the backwards code is slower so the speedup is below 2.

    On top of that, It is not scalable to more CPU when I could use 4 easily in SMP, and I just got my hands on a 12-threads CPU.

    A different approach is needed.

    For the "massively parallel" systems (mainly FPGA or RPi) I considered a totally parallel algorithm with jobs, with a direct association of an arc for each job. Each arc would be characterised by Y=0, a start X and end X, and a length (cycle count). This has the advantage that no intercommunication is required : the arcs can be attributed to a compute node then a compute core, with a central authority and even a hierarchy of sub-attribution. The drawback is the crazy size of the reports and the slow algorithm to sort everything at the end.

    With a common memory pool, using SMP or GPGPU, another approach is possible where the sorting/post-processing takes place almost in parallel. The first major change is that each compute node does longer jobs, they don't stop when they find Y=0. They send the data (new X and length) and continue. The jobs are stopped when the manager encounters a collision, then reallocates a new starting point.

    All the steps go forward, for two reasons:

    1. This is faster (the backwards version is maybe 30% slower)
    2. This makes the collision detection safer and easier, because the manager just has to compare with the starting X of the other jobs.

    This is scalable from 2 to many more CPU, the critical parts are the synchronisation and scheduling. One thread has to manage all the messages from all the compute nodes, in a serial way to prevent race conditions, at least for small clusters.

    A sort of hash table, with a size a few times larger than the number of jobs, will hold the starting X and length of the currently running jobs and also will manage the merging of arcs. When a job finds a X that matches the start of another job, this first job gets cancelled, the other job gets updated with the starting parameter of the first job. It's more or less how you handle a linked list...

    This saves a lot of memory space because it is proportional to the number of compute nodes, not the size of the statespace.

    A 2-level table seems appropriate :

    1. The first level is a hash table. Low collision rates are desired so the empty space should be maximised, it should fit in L2 or L3 though. The only value to store is the corresponding thread number. For a small system, less than 255 CPUs, this fits in one byte. One megabyte will have a collision rate of about 1M/nCPU, or 1/4096 worst case. For CUDA, a 16-bit word is required.
    2. A table of thread/jobs (starting at index 1 because 0 would mean "not used" ?) where the temporary length and start are stored.

    However, to perform a better allocation for the new threads, a "scoreboard" of all the visited Y=0 should be maintained. A simple bit is enough. For w32, this means 4Gb, or 512MB. That's going to thrash the memory like crazy...  For w26 that's 8MB, which fits in the L3 cache of the i7-10750 (with 4MB for other uses).

    To prevents the memory latency from stalling the computations, a decoupled/cancellable model lets the computation jobs enqueue their results in their FIFO and move on, without any other consideration.

  • To reach 10TOPS, CUDA is required.

    Yann Guidon / YGDES08/20/2021 at 00:28 0 comments

    I broke the bank when I found a 2nd hand laptop with integrated RTX3070. The expected speedup could be around 2000 ! That brings the w32 computation in the 6 months ballpark.

    Already, going from a 2nd Ger. 2-core/4-thread i7-2620M to a 10th gen i7-10750H  with 6 cores that can run even faster, this is a huge improvement. Having 12 threads at my disposal directly triples my usual throughput.

    Unfortunately the RAM didn't scale so well, I'm still stuck to 16GB but it is claimed to be upgradeable well beyond that. When I have money. And DDR4 should be a bit faster than DDR3.

    The dual SSD in RAID-0 is a nice and welcome touch. This will help a LOT for FPGA synthesis. The 12MB L3 cache will also help on that application.

    The king of the show, though, is the onboard RTX 3070 with 8GB of dedicated GDDR6. Its power totally dwarves everything else in my proximity, in a sleek laptop format...

    The 5888 CUDA cores run at 1.7GHz, letting me expect a peak 10 tera adds per second. That's about 2^43/s. For comparison, the bidirectional scan can compute 630M loops per second, in the order of 2^29. That's a jump of 2^14, though I have no working code at the moment but it gives a good idea of the insane speedup.

    From there, I don't see how I'll be able to get even better performance. To even double that, I'd have to buy another similar system. That's not interesting. Maybe rent a bitcoin mining rig or something like that. But for now, this totally blows the other approaches and I'm going to shelf the FPGA and RPi projects for now. The PolarFire FPGA is at least 10 times slower and could swallow w26 in maybe 6 hours, while the RTX will be finished in 20 to 30 minutes.

    The most significant factor for performance will be the efficiency of the code. I'll have to dig into the CUDA specs and tools. That will make the difference between 2 and 3 months of computation.

    The problem will be the operating system. I don't know if/how I'll be able to dual-boot it. I need the W10 to run the FPGA/EDA tools but I also need a bare Linux for the rest.

    sigh...

  • Comparing the entropy with LFSR

    Yann Guidon / YGDES08/17/2021 at 11:35 0 comments

    A strange consideration occurred to me: a LFSR is characterised by its (maximal) period of (2^n)-1, for n bits. The LFSR is meant to output one bit per cycle so it generates a stream of (2^n)-1 bits. So far, it's alright.

    The PEAC algo uses 2n+1 bits to generate a sequence of 2^(2n-1)+ 2^(n-1)-1 numbers. Each number has n bits, so it generates n×(2^(2n-1)+ 2^(n-1)-1) bits, or 2^(2n)+ 2^(n)-n. That's a bit over 2^2n.

    In practice,

    • CRC32 uses 32 bits to generate 4.294.967.295 bits per unique sequence
    • PEAC16x2 uses 33 bits to generate one of two sequences of 2.147.516.415 numbers. But these numbers are 16 bits wide, so this is equivalent to 34.360.262.640 bits, right ?

    Well, this is not fair because the 32-bit LFSR can also be used to generate a sequence of period 4294967296×32=137.438.953.472 bits by chunking them in packets of 32. However this sequence would simply be the basic sequence replicated 32 times with a different alignment. There is no extra data or entropy, it's simply a presentation trick, which can be useful in some very-uncritical cases.

    PEAC behaves differently, as it does not seem to work in GF(2). The log 63. PEAC seen as a collection of 2-tap LFSRs has built a bridge with GF(2) but I suspect it works in GF(65535) even though 65535 is not prime. Can there be other useful applications for Galois Fields with non-prime modulus ?

    Anyway, PEAC16x2 does not need a re-framing trick to generate 34.360.262.640 bits, though I have not looked yet if there are internal repetitions.

    Another detail to remember is that PEAC16x2 does have 2 orbits, and the remaining states directly lead to one of them, so there is some untapped potential.

  • At last !

    Yann Guidon / YGDES08/15/2021 at 11:23 0 comments

    Finally I made it work ! You can download and run bidir_orbit_01.c and configure it to cover the orbit of your choice. It will use 2 CPU/cores/theads to step forward and backwards at the same time.

    This time, I did my best to keep it simple. I implemented only one FIFO but it would still miss some joins. What is the solution ? Since I code the FIFO from scratch, I can modify its behaviour and keep the last element in the queue. This ensures that all new values are compared with the last one!

    All the joins are correctly detected now so I can finally think of running it on w26, after I added a progress bar and other resume features. And this time I'll make it run on a different computer to prevent accidental interruptions.

    It's about time w26 gets validated !!!

  • Swap, parity etc.

    Yann Guidon / YGDES08/12/2021 at 04:59 0 comments

    Loop unrolling should provide a significant speedup, at least for Pentium-class computers (superscalar and such). The big question so far is how many steps should a loop contain, and in particular, to echo the end of the last log, should this number be odd or even ?

    For more advanced core families, the bottleneck will be the memory bandwidth, and the cast of the memory words into 16 bits. Though even this can be mitigated on a 32-bit platform by doing the chunking ourselves.

    Let's take the basic element of the unrolled loop, that computes X and Y :

    // odd
    Y += X;
    X += *(buffer+n);
    // even
    X += Y;
    Y += *(buffer+n+1);
    

    Both pairs of instructions can execute in parallel without hazard, but this relies on the x86's ability to fetch memory then compute in the same instruction. This hits a limitation however : x86 can't manage both 32 and 16 bit wide data in the same instruction easily or fast (damned opcode prefixes !!!). In fact, I know of no architecture that can do this natively, without using some sort of adaptation instruction.

    Even the Pentium has this limitation, where the dreaded 0x66 and 0x67 prefixes inhibit superscalar execution. In fact, the mixing step mixes 2 sizes in one line, adding a uint16_t to a uint32_t accumulator. That's a ticking bomb...

    Fortunately this is easy to solve with the use of a temporary memory-caching register:

    uint32_t M = *(uint32_t *)(buffer+n);
    // odd
    Y += X;
    X += (M & 0xFFFF);
    M >>= 16;
    // even
    X += Y;
    Y += M;
    

    Here I assume a Y2K+ x86 or comparable processor core. There are more instructions but they only explicit and split the previous version into actual operations, and they actually speed things up on 32-bit cores by efficiently removing weird size mixings.

    This could even be easily extended to 64 bits, loading 4 16-bit words at a time and then shuffling them ourselves. The memory interface unit has less pressure but this is moved to the ALUs.

    1a: M = *(buffer+n);  => LOAD [basereg + cstoffset]
    1b: Y = Y + X;        => ADD
    2a: t = M & 0xFFFF;   => AND or equivalent
    2b: M = M >> 16       => SHL or equivalent
    3a: X = X + t;        => ADD
    4a: X = X + Y;        => ADD
    4b: Y = Y + M;        => ADD
    

    That's 7 basic opcodes to process 4 bytes, or almost 2 opcodes per byte. Due to the data dependencies, at least 4 cycles are required, so it's 1 cycle per byte (ideally). In order to get 2 opcodes per byte in practice, the above block should be unrolled at least 7 times.

    This now creates a new conflict : if the "swap" idea is used at the end of the loop to renormalise both X and Y in turn, the extra word will misalign the pointer. The performance penalty could be higher than the renormalisation of both X and Y.

    Anyway, what is this "swap idea" ? It's simply a way to renormalise only one variable per loop, just as is already done with the reference code though not explicitly. Each loop swaps X and Y at the end of the renormalisation of only one variable, so adding some more blocks as above would be pretty efficient.

    Now, this makes the loop size odd, as discussed earlier. This is not desirable but if the load alignment constraints are taken into account, the load/store unit might slow every other unrolled loop down. I suspect most modern cores should handle unaligned loads gracefully these days but I'm not willing to risk the potential penalty on simple cores, or for backwards compatibility or any other reason.

    And now I see that the log's title is out of sync with the contents.

  • Modular equivalence

    Yann Guidon / YGDES08/09/2021 at 09:43 0 comments

    The last log 64. The new version of PEAC16x2 has shown one "reference version" with EAC applied on both registers at every cycle, which looks quite inefficient and breaks the promise of "high performance". That slow version however serves to establish a baseline for the equivalence of hardware and software implementations. The slow "baseline" version is equivalent to the previous faster version, which is harder to implement with classic circuits.

    OTOH, thanks to modular properties, we know that (( x mod m) + y ) mod m =  (x+y)mod m so many operations of "renormalisation" can be delayed and postponed. This is one of the known optimisations for Adler and Fletcher's algos. This is also shown in Pisano16x2_v5.c with 3 versions of the same algorithm.

    And during the unrolling, something new happened: one of the register's values explodes by missing renormalisation. The result remains good as long as the computation is short but it clearly does not behave as expected.

    That's when I realised that the original algorithm, step by step, would renormalise X and Y in turn, ensuring that the results would not overflow. Now, with the steps duplicated, the variable swaps cancel out and only one variable is renormalised.

    • The basic solution is to renormalise both variables at the end of each loop, though the performance would be bad, reducing the benefits of unrolling.
    • The next alluring solution is to unroll by an odd number, like 3, 5 or 7 steps per loop. 3, 7 and 15 are Mersenne numbers which are covered by http://homepage.divms.uiowa.edu/~jones/bcd/mod.shtml but the modulo computations are still chunky...
    • A power of two is certainly desirable so the remaining solution is to be careful of the variable renaming. Some graph colouring is now necessary. But is there a solution ?

    Aaaand... there is no solution, because despite using 3 variables in the loop, only 2 perform the actual computations. The only way to handle this is with an odd number of steps per loop. I'll have to use tricks from  https://homepage.divms.uiowa.edu/%7Ejones/bcd/divide.html...

      .

View all 74 project logs

Enjoy this project?

Share

Discussions

Yann Guidon / YGDES wrote 5 days ago point

George Marsaglia and Arif Zaman got quite a lot figured out in 1991:
https://projecteuclid.org/download/pdf_1/euclid.aoap/1177005878

Why didn't this catch up, but they focus on extremely large sequences with tens of registers of lag, and didn't look enough at just a pair of registers. They also didn't look at the link with checksums.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 09/08/2021 at 07:28 point

Funny !

https://en.wikipedia.org/wiki/Lagged_Fibonacci_generator

Toward a universal random number generator, G.Marsaglia, A.Zaman

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 23:24 point

https://www.researchgate.net/profile/Michael_Greenwald/publication/3334567_Performance_of_checksums_and_CRCs_over_real_data/links/53f4b15f0cf2888a749113c5/Performance-of-checksums-and-CRCs-over-real-data.pdf

"In practice, TCP trailer sums outperform even Fletcher header sums."

Ouch.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/08/2021 at 00:51 point

"Ones complement (with EAC) Fletcher, however, has a weakness that sometimes offsets its probabilistic advantage: since bytes containing either 255 or 0 are considered identical by the checksum, certain common pathological cases cause total failure of Fletcher-255."

Re-ouch.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 23:09 point

http://users-tima.imag.fr/cdsi/guyot/Cours/Oparithm/english/Modulo.htm and others say that special adders exist to compute mod 2^n-1.

But this is not necessary because the carry is explicitly integrated in the next cycle.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 22:20 point

https://www.cs.auckland.ac.nz/compsci314s1t/resources/Checksums.pdf  is a nice beginner-level overview of error detection basics. Polynomial checks are mentioned, as well as 1s complement / end-around-carry and Fletcher is well explained but no sight of Fibonacci.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 05/24/2021 at 03:52 point

Oh, nice !

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html#Integer-Overflow-Builtins

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates