Close
0%
0%

PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

Similar projects worth following
Another purely additive, non-cryptographic, fast and minimalistic checksum.

Computes a 32-bit checksum with a pair of 16-bit registers,
- very small, using only ADD,
- without the flaws of Adler32 and Fletcher, yet the same size and complexity,
- almost as good as CRC32, which it may replace when the memory footprint is critical,
- very easy to implement in digital circuits as well as microcontrollers,
- suitable for the detection of trivial corruption in network packets or files.
- not patented
- inappropriate for cryptography or hash tables.

2×32 bits and larger are possible with the non-masking algorithm using w26.

If you wonder why, just click and read the log.

The original algo/project name was "Fibonacci checksum" but it later appeared that it was not the most accurate choice because Leonardo Pisano (the real name of Fibonacci) is the name associated to the periodicity of the sequence under modulo. All I did was add the tiny detail of "end-around carry", or "1-complement addition", which changed everything.

 
-o-O-0-O-o-
 

In 2012, Piano Magic released their album "Life Has Not Finished with Me Yet". One song contains a weird repeating pattern...

Glen Johnson's lyrics are often cryptic and more evocative than objective, but any geek's mind would cling on this mantra at the end:

"Add X to Y and Y to X"

This is really weird but... Why ? What's the point in this obscure "song" with no precise theme or clear subject ? And what does it do ? This last question is the most easily answered : just follow the damned algorithm.

C'est parti...

X=1, Y=0
Y+=X => 0+1=1
X+=Y => 1+1=2
Y+=X => 1+2=3
X+=Y => 2+3=5
Y+=X => 3+5=8
X+=Y => 5+8=13
X+=Y => 8+13=21
Y+=X => 13+21=34
X+=Y => 21+34=55
...

No need to go further, most of you should have recognised http://oeis.org/A000045, the famous Fibonacci sequence.

This gave me a compelling idea to modify the old Fletcher & Adler algorithms, keeping their very low footprint and minimalistic computational complexity. Both of these well known algos use a pair of values and have a similar structure. The trick is that rearranging the data dependency graph provides the equivalent of a minimalistic polynomial checksum, because the result is fed back on itself, in a more robust way than Fletcher's algo.

At first glance, this new checksum loop's body becomes something like :

Y += ( X ^ datum[i  ] );
X += ( Y ^ datum[i+1] );
i+=2;

This loop body is totally trivial to unroll. As trivial is the design of the corresponding digital circuit. This early version seemed to contain the whole checksum entropy in the last computed value but now X and Y are part of the signature. And the really important enhancement is the use of the carry!

For superscalar 32 bits CPUs, the following code seems to work well though the missing carry hardware (and/or lack of language support) requires more instructions to emulate:

t = X + Y + C;    Y = X + data;
C = t >> 16;      X = t & 0xFFFF;

A more efficient variation exists which does not use a temporary variable:

C += X + Y;
Y = X + data;
X = C & MASK;
C = C >> WIDTH;

In this worst case, without support of a carry flag, that's 5 basic operations (not counting memory loads) that fit in 4 registers and 3 cycles, to process 2 bytes. Not too bad. I'll let you deal with alignment. But is it really safe or safer?

The following logs will show how the idea evolves and the performance increases, through discussions about carry wrap-around, register widths, scheduling, avalanche, parallelism, orbits, structure, and escaping black holes...

 
-o-O-0-O-o-
 

Logs:
1. The variable
2. Datapath
3. Adler32 weakness
4. Easy circuit
5. Periodicity
6. A promising 32-bit checksum
7. Progress
8. Why
9. Orbital mechanics
10. Orbits, trajectories and that damned carry bit
11. Two hairy orbits
12. How to deal with black holes
13. Anteriority
14. Moonlighting as a PRNG
15. A new name
16. More orbits !
17. First image
18. Structure and extrapolations
19. Even more orbits !
20. Start and stop
21. Some theory
22. Some more theory
23. Even more theory.
24. A little enhancement
25. sdrawkcab gnioG
26. Further theorising
27. What is it ?
28. A bigger enhancement
29. Going both ways
30. Can you add states ?
31. Please, GCC !
32. _|_||_||_|______|______|____|_|___|_________|||_|__
33. Carry on with GCC
34. A good compromise
35. The new non-masking algorithm
36. Add or XOR ?
37. A 32/64-bit version
38. Closing the trajectories
39. Stats about missed crossings
40. Application : Displace LFSR in BIST
41. I need a big cluster.
42. Another theoretical rambling
43. An individual job and all the necessary processing
44....

Read more »

fusion_20220109.tgz

contains dump7ck, pscan_10 and fusion, good enough to test up to w26.

x-compressed-tar - 7.13 kB - 01/09/2022 at 13:34

Download

log.22.p7k.bz2

p7k log of w22 compressed with bzip2.

x-bzip - 37.18 MB - 01/08/2022 at 00:30

Download

logs_p7k_03-21.tbz

p7k logs of w3 to w21 (compressed with bzip2).

x-bzip-compressed-tar - 34.55 MB - 01/08/2022 at 00:27

Download

dump7ck.tgz

dump a binary p7k file into a human-readable ASCII stream to stdout

application/x-compressed-tar - 2.06 kB - 01/03/2022 at 09:30

Download

pscan_20211229.tgz

enhanced version of the threaded scanner

x-compressed-tar - 5.09 kB - 12/29/2021 at 08:39

Download

View all 51 files

  • There could be a way... or not.

    Yann Guidon / YGDESa day ago 0 comments

    Several logs, such as 106. More mathematics, have tried to "crack" the problems with "classic maths", using equations and such. I'm not gifted for this approach but I do my best and so far it has not been fruitful. The big problem is that there is no nice and tidy equation, as with the Fibonacci and Pisano cases. There is always this increment that comes up at seemingly random interval, though with a 1/2 probability.

    Using the probability, it might be possible to estimate or approximate a result, though it falls apart "very quickly" as the errors accumulate. So I can't find one equation for the value of the Nth step.

    However I just realised that there could be a way to short-circuit a lot of uncertainties. The main goal so far is to determine IF a given width provides a pair of orbits of maximal length. And we happen to know the main characteristics of such an orbit, in particular the length (L). So we can define a cyclic equation, given a hypothetical function PEAC(n) that returns the value of the nth step:

    PEAC(n) = PEAC(n+L)

    This should be true for all values of n from 0 to L-1. Let's now borrow the sampling idea from prime number folks: given a sufficiently large number of random n (10s ? 100s ?), we can have a sufficiently high confidence that PEAC(n) is cyclic with a period of L (or one of its multiples, as might be the case for w6 and w24) without scanning it completely...

    The real problem is how to compute PEAC(n). It is derived from the Pisano formula, with the random increment. The trick would be to be sure of the number of increments during a full orbit. From my early explorations, it seems that the number of crossings of 0 is not exactly 1/2 but it would converge soon enough. So at least, we would have a sufficiently close estimate, and sampling more points / n would increase the confidence.

    Another problem is that the Binet formula contains powers of φ, which is irrational, so computing its nth power increases the required precision to an irrealistic point, where computation and storage of the value exceed reason. Buuuuut.... φ is irrational, not transcendental, and contains sqrt(5). I just realised that all its even powers make a perfectly nice integer ! It is not a significant limitation to only consider even powers so yeah, I am relieved: no infinite precision arithmetic...

    Exponentiation in modulo arithmetic is well known (again, it's used all the time in the prime number domain, for RSA computations for example). So there might be a way after all, even though I have not collected all the clues. Yet.

    _____________________________________

    Oh I might have overlooked a little detail. Let's look at the known equations:

    F(n) = ( φ^n - (-φ)^n ) / √5
         = ( ((1+√5)/2)^n  - ((1-√5)/2)^n ) / √5
         = ( (1+√5)^n − (1−√5)^n ) / (√5 * 2^n)

    and

    Pis(n,m) = F(n) mod m
                  = (( φ^n - (-φ)^n ) / √5 ) mod m

    That little pesky irrational divisor breaks my plans again. But there might be a way again, if we work modulo m×√5... or something like that.

    Maybe the pattern of the carries can be deduced from the Continued fraction expansion...

    OK I'm so bad at mathing that I overlooked that (1+√5)^n would still be irrational, even for even powers, simply because of the added 1, and (a+b)²=a²+2ab+b². The 2ab term is what keeps the sum from ever being an integer...

    Damnit. Another dead end.

  • What about w28?

    Yann Guidon / YGDES2 days ago 0 comments

    w26 was swift and w27 was a walk in the park. w28 should be similar, but longer... Is it useful however ?

    In itself, I don't expect anything from w28, just as with w27 and w29. It's not the result that counts, but how it is obtained, since I develop the tools further each time and can better extrapolate and refine the ideas. Oh and if w28 has some unexpected property, it's still good to know. I have the hunch that w28 is not remarkable but it's important to be sure, as well.

    What can I predict ? From what I have seen, the fusion would take about 12-15 minutes, the complete log file will be almost 7GB and the scan will take about 40 days using the 2 hexaprocessors (24 threads in parallel overall).

    What can be done to speed this up again ? Well, a lot has already been done:

    • crunched the theory to the point where only one half of the steps need to be computed, and going backwards was not efficient enough (and breaks the assertions of fusion).
    • optimised the scheduling of the instructions (though even that is quite fluctuating)
    • parallelised the heck out of it, using both multiple computers and multiple threads, thanks to the "separation" of the computations.

    What's left ? AVX2 and then CUDA. The first will bring a 4× to 8× speedup, and the second will bring the last boost.

    Considering that w28 will take about 40 days as is, and the SIMD/AVX2 version will bring 4x, w28 would take only about 10 days or a week if all goes well. This buys me 30 days to convert my code to break even. There are some challenges to adapt the existing scalar algorithm to SIMD, so the intermediary step of "multi-scalar" lets me get the housekeeping algos right.

    I have no doubt that AVX2 will be easy to implement. w28, w29 and w30 are reachable. But CUDA is a totally different beast... I'm watching tutorials on YouTube for now.

  • w27: the results

    Yann Guidon / YGDES4 days ago 0 comments

    In the previous log Aiming for w27 I tell how the scanning was performed. 3489660928 bytes (3.5GB) were generated by two 12-threads computers in scalar mode... Here are the results!

    Let's start with the basics: the sum.

    $ ./dump7ck -S log.27.p7k
    opened log.27.p7k
    Sum = 9007199321849850
    

    We would expect this value:

    $ echo $(( 2**((2*27)-1) + 2**(27-1) -1 ))
    9007199321849855

    Pretty close ! There is a very very small difference, but it's still here.

    Now let's fuse all the semi-arcs...

     LOOP FOUND: len=545890867990900 arcs=8134767 at 134217705
     LOOP FOUND: len=545890867990900 arcs=8136577 at 134217724
     LOOP FOUND: len=272945433995450 arcs=4068239 at 134217710
     LOOP FOUND: len=545890867990900 arcs=8134487 at 134217718
     LOOP FOUND: len=545890867990900 arcs=8130516 at 134217723
     LOOP FOUND: len=545890867990900 arcs=8130560 at 134217725
     LOOP FOUND: len=545890867990900 arcs=8138879 at 134217726
     LOOP FOUND: len=545890867990900 arcs=8135180 at 134217632
     LOOP FOUND: len=272945433995450 arcs=4066985 at 134217666
     LOOP FOUND: len=545890867990900 arcs=8133247 at 134217720
     LOOP FOUND: len=545890867990900 arcs=8136918 at 134217648
     LOOP FOUND: len=545890867990900 arcs=8133306 at 134217711
     LOOP FOUND: len=272945433995450 arcs=4066387 at 134217721
     LOOP FOUND: len=545890867990900 arcs=8134107 at 134217659
     LOOP FOUND: len=545890867990900 arcs=8134882 at 134217719
     LOOP FOUND: len=545890867990900 arcs=8132133 at 134217727
     LOOP FOUND: len=545890867990900 arcs=8136446 at 134217695
     LOOP FOUND: len=545890867990900 arcs=8134112 at 134217709
     65 passes
    789.18user 55.77system 6:08.18elapsed 229%CPU
    

    There are 3 pairs of orbits  of length 272945433995450 and 15 pairs of orbits of length 545890867990900.

    And that's it. w27 is almost well behaved but there are 5 or so states that have not been reached, probably forming a very short pair of orbits... It's not recommended since w26 behaves better.

    And this heavy run took only about 6 minutes. This is because I used a pair of SSD over USB3, one for reading and the other for writing, and vice versa. The bulk of the time was taken by sort, while the fusion program was clearly I/O bound. Maybe, later, I'll try with 3 SSD because fusion reads 2 streams...

  • TODO

    Yann Guidon / YGDES6 days ago 0 comments

    While w27 is about to complete, the same question arises again: what to do next ?

    The most important thing to do is to transform the code to "multi-scalar" and then SIMD (SSE3, AVX2), as a way to prepare the CUDA code. This is slowly progressing though because other practical aspects become clear as I ramp up the computational power.

    The next big thing is a way to seamlessly manage all the computation resources. A networked dispatcher would be good since I already use 3 computers and must allocate the ranges by hand. I have a bunch of Raspberry Pi boards, that still need to be clustered (#Clunky McCluster) and it will be a mess to manage them or let them participate. So here is #CHOMP. But efficient distribution of the workload is not a really critical issue.

    A third development axis is to make a better fusion program. The existing one has a constraint: it can only work on a set of semi-arcs that is already complete. Running the fusion program on intermediate, incomplete sets could help soon, as I get partial contents from various computers. A partial fusion would reduce a bit the size of the data before amalgamating the various parts from different computers. The script must be changed too, because it stops only when the output file is empty, but partial fusion will give a file with an unchanged length. Which is equivalent to the existing case when the file remains empty...

    This last development axis seems to be the simplest/fastest to complete, in maybe a few days, and can help a bit during very large computations, since partial fusion can be run progressively.

  • Aiming for w27

    Yann Guidon / YGDES01/13/2022 at 21:42 0 comments

    Now that w26 is solved, I enter the vast void of the gap until w32. w27 is not exciting in itself but it's a good way to test the tools and methods, right ? So this time I'm trying to distribute the workload. I have the 12-thread i7 and the older 8-thread i7 that could offload some work...

    The .p7k files are trivially concatenated and the fusion program might even gracefully drop eventual duplicate semi-arcs, and I'll see if I can run the fusion program on a partial output (I don't expect much but it will be a small speedup in the end).

    At this moment, I'm benchmarking the i7 10750H with a fraction of the total work: 2^22, or 4 million semi-arcs. w27 is 128 millions, so the total is 32-1 times more work. I want to know in advance how much of the dataset I will dedicate to each computer. Alone, the 10750H would take about two weeks so maybe we could shorten this by one third, or 5 days ?

    The i7-10750H is running a stock Fedora 35 and the governor is "powersave", limiting the clock to 3.8GHz. So I'm trying to see how to configure this relatively recent feature of Linux, on a pretty recent platform... Can I get to 4.8GHz ? It is advertised as being able to reach 5GHz on a single thread though I don't know how the TDP is managed on this particular laptop.

    So I installed the kernel-tools package and can now run cpupower.

    cpupower frequency-set --governor performance

    does nothing,  the clock is still between 3.7 and 3.8GHz...

    • the i7-7600U runs at about 33 semi-arcs per second
    • the i7-10750H runs at approx. 120-130 semi-arcs per second

    AAAaaand this is the moment where I realise that the little Thinkpad has only a dual-core, 4-threaded processor... So despite the same clock speed, it can only provide 1/3 of the big boy's speed. Sigh. Anyway, overall this makes 16 threads running in parallel, with the puny Thinkpad getting 1/4th of the overall work...

    Maybe by the time this run is over, I'll have developed a SIMD version, which promises 4 to 8× speedup.

    _____________________________________________________

    The 10750H finished 4Mi semi-arcs (4194304) in 10h 30 minutes, the 7600U is 1/4 though... I'll now schedule partial runs of one or two days, on each computer, as they finish each.

    ____________________________________________________

    I7-7600U took 34h45m to complete the 4Mi semi-arcs (4194304), which is 3.3× slower... while the 10750H took 62h52m to compute 24Mi (25165824) semi-arcs.

    Since I just got another hexacore, there is no point in using the 7600U again. I'll run the 10750H again on 32Mi more semi-arcs while I configure the Ryzen 5600H to run the last half of the computations.

    ____________________________________________________

    The Ryzen is operational. On w27 it runs about 7140 semi-arcs per minute... I'll see how fast it can complete the 33554432 semi-arcs.

    ____________________________________________________

    w27 is completely scanned ! I will not give run times because they are totally messed up (I started some programs in parallel to attempt to make them all finish at the same time) and that's why I should try to develop #CHOMP. I have 6 partial files that are easy to concatenate.

    The Ryzen had a first run of 33554432 semi-arcs that took 113h at 882%CPU because I started another run in parallel : 16777216 semi-arcs in 71h @700%CPU.

    Meanwhile the 10750H completed 33554432 semi-arcs in 85h at 1172%CPU then 16777216 semi-arcs in 44h @1148%CPU (non overlapping then).

    And I wasn't even there when they both finished...

  • More mathematics

    Yann Guidon / YGDES01/13/2022 at 04:43 0 comments

    Some astute observers might still wonder why I insist on taking the "long road", with explicit computation of all terms of the series, calculating all the steps of every (semi-)arc. Well, to the best of my ability and knowledge, it is NOT possible to easily predict the length of an orbit, to name a few of the many limitations I have found. And it is important to know why, since evaluating w32 is still a huge investment in time, resources and development.

    Well, PEAC is a modified Pisano cyclic series, which itself is a Fibonacci series under a given Modulo.

    From there, it would be easy to predict the behaviour of a system because Fibonacci is simply an exponential derived from a series with a growth of φ. φ is irrational so there can never be a precise value. The first main problem with predicting PEAC is the uncertainty between the integer/discrete value and the irrational-based prediction.

    The Binet formula gives the value of the Nth Fibonacci number. It is great when you want to calculate this but is not easily adapted when we want to find a precise, different value, given a set of pretty stringent conditions.

    There would be 2 adaptations to do :

    • adapt Binet to the Pisano situation, where values are mod 2^w
    • then adapt this to consider the carries.

    And that's quite a mess... I don't have the maths tools to solve that kind of nasty equations.

    However I can do some pen&papering where I observe the behaviour of small systems. As the system size increases, estimates get harder and harder, to the point of not being easily decidable.

    There are several types of questions we may want to answer:

    • how many orbits does one system contain ?
    • what is the length of a given orbit ?
    • How to predict the Nth value of one orbit
    • How long an arc, or semi-arc, is

    Let's start again with the very basics: the Fibonacci series is characterised by a growth factor of φ. The rate is thus φ raised to the power of the number of steps we want to walk. So it goes roughly like this:

    F(n) = φ^n

    More precisely, let's look at Binet's formula:

    F(n) = ( φ^n - (-φ)^n ) / √5
         = ( ((1+√5)/2)^n  - ((1-√5)/2)^n ) / √5
         = ( (1+√5)^n − (1−√5)^n ) / (√5 * 2^n)

    So far, so good. It's a well known matter.
    The formula has been applied to real numbers and even complex numbers.

    You could try to solve for F(n)=x for example.

    Going from Fibonacci to Pisano is less documented however. Given a modulo m, this would be

    Pis(n,m) = F(n) mod m
                  = (( φ^n - (-φ)^n ) / √5 ) mod m

    https://en.wikipedia.org/wiki/Pisano_period explains the period of the corresponding sequence using "general linear group GL2(ℤn) of invertible 2 by 2 matrices in the finite ring ℤn of integers modulo n." Ouch.

    Given the period p=π(m), we get :

    Pis(n,m) = Pis(n+p,m)

    The system is turned from a linear/exponential to a discrete cyclic series. So we can reduce the study to the range of n = 0 to p-1

    But how do we get π(m) ? What is the formula ? Obviously we are concerned by computer "words", of k bits, or k'th power of 2, and there is this special case:

    If m = 2^k, then π(m) = 3×2^(k–1) = (3×2^k)/2 = 3m/2

    This is well explained in the Wikipedia article and easy to test.

    By definition, the Pisano series start with the Fibonacci numbers, until one of them >= m. The last numbers in the sequence are also easy to compute btw. And between these extremes, this is where the headache starts.

    Binet's formula is less powerful here. You can "compute" the n'th value of the series but how can you know, for example, how many times a given value appears, or where ?

    For example for the Pisano series of 2^4,

    • the values 4, 6, 10, 12 and 14 do not appear
    • but 3, 7, 11 and 15 appear 1×
    • 0 and 8 appear 2×
    • 1, 5, 9 and 13 appear...
    Read more »

  • Fusion 4: it works

    Yann Guidon / YGDES01/09/2022 at 10:58 0 comments

    So here is the (slightly edited) log of the first run with the new program:

    $ cat pass.sh 
    #!/bin/bash
    
    # arguments :
    # $1 = filename of source .p7k
    # $2 = filename of destination .p7k
    
    TMP_SRC="tmp_src.p7k"
    TMP_DST="tmp_dst.p7k"
    LC_ALL=C sort -z -k 1.1,1.5  $1 > $TMP_SRC
    LC_ALL=C sort -z -k 1.6,1.10 $1 > $TMP_DST
    ./fusion U $TMP_SRC $TMP_DST $2
    
    
    $ cat fuse03.sh
    ./pass.sh log.03.p7k log.03.pass1.p7k
    ./pass.sh log.03.pass1.p7k log.03.pass2.p7k
    ./pass.sh log.03.pass2.p7k log.03.pass3.p7k
    ./pass.sh log.03.pass3.p7k log.03.pass4.p7k
    ./pass.sh log.03.pass4.p7k log.03.pass5.p7k
    
    $ . fuse03.sh
    index=0 :  fuse
    index=1 :  copy src
    index=2 :  fuse
    index=3 :  copy dst
    index=4 :  fuse
    index=5 :  nothing
    index=6 :  nothing
    index=7 :  nothing
     writing 5  
    
    index=1 :  fuse
    index=3 :  fuse
    index=5 :  copy dst
    index=6 :  nothing
    index=7 :  nothing
     writing 3  
    
    index=5 :  fuse
    index=6 :  copy dst
    index=7 :  nothing
     writing 2  
    
    index=6 :  fuse
    index=7 :  nothing
     writing 1  
    
    index=7 :  LOOP FOUND: len=35 arcs=8 at 7
     writing 0  
    

    I have to run more tests but it seems to work very nicely.

    A better version:

    $ cp log.03.p7k tmp.p7k
    
    $ while [ -s tmp.p7k ]; do
    >   ls -al tmp.p7k
    >   ./pass.sh tmp.p7k tmp.p7k
    > done
     208  9 janv. 12:16 tmp.p7k
     130  9 janv. 12:16 tmp.p7k
     78  9 janv. 12:16 tmp.p7k
     52  9 janv. 12:16 tmp.p7k
     26  9 janv. 12:16 tmp.p7k
     LOOP FOUND: len=35 arcs=8 at 7
    

    The loop is ready for working with other sizes :-)

    [yg]$ cat fuse.sh 
    
    TMP="tmp.p7k"
    TMP_SRC="tmp_src.p7k"
    TMP_DST="tmp_dst.p7k"
    
    cp log.05.p7k tmp.p7k
    
    while [ -s $TMP ]; do
     # ls -al $TMP
      LC_ALL=C sort -z -k 1.1,1.5  $TMP > $TMP_SRC
      LC_ALL=C sort -z -k 1.6,1.10 $TMP > $TMP_DST
      ./fusion U $TMP_SRC $TMP_DST $TMP
    done
    
    [yg]$ ./fuse.sh 
     LOOP FOUND: len=21 arcs=1 at 5
     LOOP FOUND: len=21 arcs=1 at 10
     LOOP FOUND: len=21 arcs=1 at 15
     LOOP FOUND: len=21 arcs=1 at 30
     LOOP FOUND: len=21 arcs=2 at 25
     LOOP FOUND: len=84 arcs=4 at 21
     LOOP FOUND: len=84 arcs=3 at 22
     LOOP FOUND: len=84 arcs=5 at 28
     LOOP FOUND: len=84 arcs=8 at 29
     LOOP FOUND: len=84 arcs=6 at 31
    
    

    I seem to be getting one half of the orbit numbers :

    $ cp ../ref.p7k/log.06.p7k tmp.p7k
    $ ./fuse.sh 
     LOOP FOUND: len=693 arcs=23 at 62
     LOOP FOUND: len=693 arcs=21 at 63
     LOOP FOUND: len=693 arcs=20 at 60
    

    In the log 16. More orbits ! I found 6 orbits for w6 but only 3 are reported here.

    This is explained by the fact that the new scanner scans the semi-arcs, that is : each semi-arc walks 2 arcs at the same time. So the number of loops/orbits must be doubled.

    $ cp ../ref.p7k/log.07.p7k tmp.p7k; ./fuse.sh
     LOOP FOUND: len=45 arcs=1 at 79
     LOOP FOUND: len=234 arcs=2 at 110
     LOOP FOUND: len=234 arcs=2 at 121
     LOOP FOUND: len=390 arcs=6 at 114
     LOOP FOUND: len=234 arcs=7 at 99
     LOOP FOUND: len=1170 arcs=13 at 119
     LOOP FOUND: len=1170 arcs=20 at 126
     LOOP FOUND: len=1170 arcs=16 at 117
     LOOP FOUND: len=1170 arcs=15 at 122
     LOOP FOUND: len=1170 arcs=20 at 118
     LOOP FOUND: len=1170 arcs=26 at 127
    

    w7: 22 orbits instead of 26.

    $ cp ../ref.p7k/log.08.p7k tmp.p7k; ./fuse.sh |wc -l
    167
    

    w8 can find only 167 of the  estimated 572 orbits, all of length 115. But it doesn't matter since we're only interested in statespaces with only one couple of orbits.

    $ cp ../ref.p7k/log.09.p7k tmp.p7k; ./fuse.sh |wc -l
    53

    w9 only shows orbits with 200, 260 and 2600 steps.

    $ cp ../ref.p7k/log.10.p7k tmp.p7k; ./fuse.sh
     LOOP FOUND: len=524799 arcs=1024 at 1023

    w10 looks perferct, as expected. 

    $ cp ../ref.p7k/log.11.p7k tmp.p7k; ./fuse.sh
     LOOP FOUND: len=2922 arcs=5 at 1795
     LOOP FOUND: len=1046076 arcs=1038 at 2047
     LOOP FOUND: len=1046076 arcs=1005 at 2046
    

    w11 finds only 6 orbits out of 8 but the lengths do not match 5844 or 179 (1046076 is correct though)

    $ cp ../ref.p7k/log.12.p7k tmp.p7k; ./fuse.sh
     LOOP FOUND: len=71107 arcs=34 at 3894
     LOOP FOUND: len=71107 arcs=35 at 4071
     LOOP FOUND: len=2062103 arcs=1032 at 4093
     LOOP FOUND: len=2062103 arcs=986 at 4094
     LOOP FOUND: len=2062103 arcs=990 at 4095
    ...
    Read more »

  • Fusion 3: the better algorithm

    Yann Guidon / YGDES01/08/2022 at 08:52 0 comments

    After playing by hand with real data in the log 103. Fusion 2: the manual tinkering, I have a better grasp of what I'm dealing with, and how. The odd/even idea is interesting but in the end, the algorithm boils down to: "Is a semi-arc already output ?" and each new entry, with a pair of semi-arcs, can output nothing, one of the semi-arcs, or the fused pair. That's as simple as that.

    Meanwhile, since not all semi-arcs pairs are fused, the rate of shrinking of the files is... lower than 50% per pass. But still probably more than 25%. This doubles the number of passes but that's still OK. I want the fusion program to be simple and efficient enough that it does not require crazy optimisations that will take ages to fine-tune, just so it could run a bit faster for a run that would take a few hours at most.

    Anyway, the data show that most merges occur at the beginning of the list, the first entries have a much higher chance of being merged, and the last ones often linger... The idea to perform a pair of passes with sorts in increasing and decreasing directions looks good, in this light.

    The RAM requirements are reasonable, since only 1 bit per semi-arc is necessary. 4Gi points require 512MiB for w32, so I'm good up to w36 with 16GiB installed. Each semi-arc is identified by its starting point so that's all that counts... A priori, no need of extra bonus flags with this new version.

    But if the algo works well for for "maximal orbits", the behaviour is not yet clear for the weird cases of multiple orbits with various lengths. I'll have to test it for w5 for example!


    So let's just do it.

    By the way, w5 has 16 orbits, of lengths 4, 42 and 84.

    sorted by src:
     0  1 1  1
     1 14 1 14
     2 28 1 14
     3  8 1  8
     4 21 1  6
     5  5 1 21
     6 16 1  8
     7 29 1 13
     8 22 1 10
     9 24 1  8
    10 10 1 21
    11 18 1 57
    12 31 1  7
    13 19 1 19
    14 26 1 20
    15 15 1 21
    16 17 1  3
    17  7 1 12
    18  4 1 17
    19 27 1  5
    20 25 1 16
    21 11 1  4
    22  3 1 66
    23  9 1 29
    24  2 1  9
    25 20 1  5
    26 12 1 41
    27  6 1 17
    28 23 1 24
    29 13 1  7
    30 30 1 21
    31  0 1  1
    
    Sorted by dst:
    31  0 1  1
     0  1 1  1
    24  2 1  9
    22  3 1 66
    18  4 1 17
     5  5 1 21
    27  6 1 17
    17  7 1 12
     3  8 1  8
    23  9 1 29
    10 10 1 21
    21 11 1  4
    26 12 1 41
    29 13 1  7
     1 14 1 14
    15 15 1 21
     6 16 1  8
    16 17 1  3
    11 18 1 57
    13 19 1 19
    25 20 1  5
     4 21 1  6
     8 22 1 10
    28 23 1 24
     9 24 1  8
    20 25 1 16
    14 26 1 20
    19 27 1  5
     2 28 1 14
     7 29 1 13
    30 30 1 21
    12 31 1  7
    
    side by side:
     0  1 1  1    31  0 1  1
     1 14 1 14     0  1 1  1
     2 28 1 14    24  2 1  9
     3  8 1  8    22  3 1 66
     4 21 1  6    18  4 1 17
     5  5 1 21     5  5 1 21
     6 16 1  8    27  6 1 17
     7 29 1 13    17  7 1 12
     8 22 1 10     3  8 1  8
     9 24 1  8    23  9 1 29
    10 10 1 21    10 10 1 21
    11 18 1 57    21 11 1  4
    12 31 1  7    26 12 1 41
    13 19 1 19    29 13 1  7
    14 26 1 20     1 14 1 14
    15 15 1 21    15 15 1 21
    16 17 1  3     6 16 1  8
    17  7 1 12    16 17 1  3
    18  4 1 17    11 18 1 57
    19 27 1  5    13 19 1 19
    20 25 1 16    25 20 1  5
    21 11 1  4     4 21 1  6
    22  3 1 66     8 22 1 10
    23  9 1 29    28 23 1 24
    24  2 1  9     9 24 1  8
    25 20 1  5    20 25 1 16
    26 12 1 41    14 26 1 20
    27  6 1 17    19 27 1  5
    28 23 1 24     2 28 1 14
    29 13 1  7     7 29 1 13
    30 30 1 21    30 30 1 21
    31  0 1  1    12 31 1  7
    

    C'est parti.

    Let's use a real scoreboard now, by the way:

    0                 15                  31
    |                 |                   |
    .... .... .... .... .... .... .... ....
    
     0  1 1  1    31  0 1  1    =>  31 1 2 2
    +... .... .... .... .... .... .... ...*
     1 14 1 14     0  1 1  1     =>   1 14 1 14
    ++.. .... .... .... .... .... .... ...*
     2 28 1 14    24  2 1  9   => 24 28 2 23
    +++. .... .... .... .... .... *... ...*
     3  8 1  8    22  3 1 66  =>  22 8 2 74
    ++++ .... .... .... .... ..*. *... ...*
     4 21 1  6    18  4 1 17  => 18 21 2 23
    ++++ +... .... .... ..*. ..*. *... ...*
    
     5  5 1 21     5  5 1 21   => orbit closed ! len=21
    ++++ ++.. .... .... ..*. ..*. *... ...*
     6 16 1  8    27  6 1 17  =>  27 16  2 25
    ++++ +++. .... .... ..*. ..*. *..* ...*
     7 29 1 13    17  7 1 12   => 17 29 2 25
    ++++ ++++ .... .... .**. ..*. *..* ...*
     8 22 1 10     3  8 1  8  => 8 22 1 10  (3 already out)
    ++++ ++++ +... .... .**. ..*. *..* ...*
     9 24 1  8    23  9 1 29  =>  23 24 2 37
    ++++ ++++ ++.. .... .**. ..** *..* ...*
    10 10 1 21    10 10 1 21  => Orbit closed ! len=21
    ++++ ++++ +++....
    Read more »

  • Fusion 2: the manual tinkering

    Yann Guidon / YGDES01/08/2022 at 03:02 0 comments

    That log's going to be a long one...

    The log 100. Fusion was about the ideas and principles but all the mental simulation will not replace actual data. So I started to play with small logs to get an idea of the correctness of my assertions.

    And... you can guess that it didn't go as planned. I had overlooked a nuance or two that break the idealistic algorithm I had in mind. It's not totally thrown away though it must be refined and updated with new principles.

    It took a few tries to get something off the ground but at least, now, I see what was missing from my first idea: the principle of odd/evenness of the position of a semi-arc. If you fuse all pairs of semi-arcs, then you have to be careful because they are provided in random order. So if you process them as they come, every other 2 semi-arc you encounter may be at an "odd" position, thus comprising a start and end index that will be (or has been) already eliminated through fusion.

    Imagine, you have these elements that make an orbit :

    1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-1

    You can pair them as they come :

    • 1-2 paired with 2-3, eliminate 2 and get 1-3
    • 3-4 4-5 => 3-5, eliminate 4
    • 5-6 6-7 => 5-7, remove 6
    • 7-8 8-1 => 7-1, eliminate 8

    We get 1-3 3-5 5-7 7-1 and with the above algo, we get fewer sequences: 1-5 (drop 3) 5-1 (drop 7), and the last pass finds that we get to 1-1 which is a closed orbit.

    But the numbers are not provided in such a neat sorted way. It's a pretty good random order, in fact, which creates some problems, particularly in the context where 1) we can only access the list of semi-arcs sequentially 2) RAM can only contain a limited amount of data compared to the stream size. So let's have a look at a simple case: w3.

    The list of semi-arcs sorted by the "source" or beginning is:

    0 1 1  1
    1 6 1  6
    2 7 1 10
    3 2 1  5
    4 5 1  3
    5 3 1  4
    6 4 1  5
    7 0 1  1

    When sorted in order of "destination", or end of the semi-arc:

    7 0 1  1
    0 1 1  1
    3 2 1  5
    5 3 1  4
    6 4 1  5
    4 5 1  3
    1 6 1  6
    2 7 1 10

    Together, side by side :

    0 1 1  1    7 0 1  1
    1 6 1  6    0 1 1  1
    2 7 1 10    3 2 1  5
    3 2 1  5    5 3 1  4
    4 5 1  3    6 4 1  5
    5 3 1  4    4 5 1  3
    6 4 1  5    1 6 1  6
    7 0 1  1    2 7 1 10

    This is what the fusion program will see, and will process sequentially.

    The basic algorithm is quite simple:

    if the entry is valid and contains no removed value, then
       *  output the new entry calculated as
             dst.src  src.dst  src.arcs+dst.arcs  src.len+dest.len
       *  mark entries that contain src.src to be removed/ignored

    But it won't work. So let's see how things would be ideally, by ignoring the constraint of "only going forward", and let's fuse the entries by hand:

    0 1 1  1    7 0 1  1   fuses to 7  1  2 2 , remove 0
    let's get the line with dest.src=1:
    6 4 1  5    1 6 1  6   => 1  4  2 11,  remove 6
    Now let's find the line with dest.src=4:
    5 3 1  4    4 5 1  3   => 4  3  2  7  remove 5
    and now, dest.src=3:
    2 7 1 10    3 2 1  5   => 3  7  2 15  => remove 2
    

    The start and end are the same number = 7 so we're good. But in practice, in the order of the stream, this gives:

    0 1 1  1    7 0 1  1   => 7  1  2 2  => remove 0
    1 6 1  6    0 1 1  1  removed by 0
    2 7 1 10    3 2 1  5   => 3  7  2 15  => remove 2
    3 2 1  5    5 3 1  4  removed by 5
    4 5 1  3    6 4 1  5  removed by 6
    5 3 1  4    4 5 1  3   => 4  3  2  7  remove 5
    6 4 1  5    1 6 1  6   => 1  4  2 11  remove 6
    7 0 1  1    2 7 1 10  removed by 2

    The result is a shuffled mess, as well:

    7  1  2  2
    3  7  2 15 
    4  3  2  7
    1  4  2 11

    So how do we get there from the original sequence ?

    First, we have to define an origin, a standard starting point. This is typically the entry starting at 0, so it ends at 1, and implicitly it is fused with the semi-arc that starts at MASK (the equivalent of -1 if signed numbers existed).

    Second, we must use a few inferences to remove some entries from partial information. Let's fuse the first 2 lines:

    src           dst
    0 1 1  1    7 0 1  1     => 7  1  2  2  =>   remove 0
    1 6 1  6    0 1 1  1   dst.src removed by 0
    

    Since 0-1 on the 2nd line is already processed by the first line (since that line contains dst.src < src.src), not only can we drop that line, but we can...

    Read more »

  • w26 is (even more likely) maximal !

    Yann Guidon / YGDES01/04/2022 at 00:06 0 comments

    As I supposed, w26 is "good", or "maximal" !

    The computation of the 64 million semi-arcs is complete and the sum is precisely 2251799847239679, as expected and according to the formula.

    Since the sum is perfect, I'm tempted to say that the fusion is not required for confirmation but w24 too has a perfect sum yet is known to not be maximal. OTOH, during the linear scan, w26 has shown signs that it is maximal so the fusion should be another confirmation rather than a surprise.

    Anyway this is a great milestone that is finally achieved (let's see the results of the fusion), yet w32 is 4096 times more difficult to reach. The quest is far from over.

    .

    Update 20210109:

    Look at 105. Fusion 4: it works for the final confirmation.

View all 111 project logs

Enjoy this project?

Share

Discussions

Ezra Wolf wrote 6 days ago point

Wow !

  Are you sure? yes | no

Yann Guidon / YGDES wrote 2 days ago point

Hi Ezra, feel free to elaborate :-)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 01/17/2022 at 08:49 point

Correction to some of my statements here and there (and my videos) : φ is not transcendental but an irrational number that is a solution to the quadratic equation x² − x − 1 = 0. Mea culpa, but at least I learned something from one of the Bogdanoff brothers' countless errors and confusions :-D

  Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2022 at 16:55 point

Oh my !

https://en.wikichip.org/wiki/x86/adx

Now, THAT's a hack :-)

And I can see how it could increase the speed of some algorithms, even this one...

  Are you sure? yes | no

Tim wrote 01/14/2022 at 19:44 point

On ever other architecture, yes. On x86 is does not add notably to the entropy :)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 01/15/2022 at 12:04 point

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/23/2021 at 21:33 point

Interesting !

https://research.nccgroup.com/2021/10/15/cracking-random-number-generators-using-machine-learning-part-1-xorshift128/

https://research.nccgroup.com/2021/10/18/cracking-random-number-generators-using-machine-learning-part-2-mersenne-twister/

  Are you sure? yes | no

[deleted]

[this comment has been deleted]

Yann Guidon / YGDES wrote 10/13/2021 at 14:39 point

so far, half a year and 50£.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 09/14/2021 at 04:16 point

George Marsaglia and Arif Zaman got quite a lot figured out in 1991:
https://projecteuclid.org/download/pdf_1/euclid.aoap/1177005878

Why didn't this catch up, but they focus on extremely large sequences with tens of registers of lag, and didn't look enough at just a pair of registers. They also didn't look at the link with checksums.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 09/08/2021 at 07:28 point

Funny !

https://en.wikipedia.org/wiki/Lagged_Fibonacci_generator

Toward a universal random number generator, G.Marsaglia, A.Zaman

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 23:24 point

https://www.researchgate.net/profile/Michael_Greenwald/publication/3334567_Performance_of_checksums_and_CRCs_over_real_data/links/53f4b15f0cf2888a749113c5/Performance-of-checksums-and-CRCs-over-real-data.pdf

"In practice, TCP trailer sums outperform even Fletcher header sums."

Ouch.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/08/2021 at 00:51 point

"Ones complement (with EAC) Fletcher, however, has a weakness that sometimes offsets its probabilistic advantage: since bytes containing either 255 or 0 are considered identical by the checksum, certain common pathological cases cause total failure of Fletcher-255."

Re-ouch.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 23:09 point

http://users-tima.imag.fr/cdsi/guyot/Cours/Oparithm/english/Modulo.htm and others say that special adders exist to compute mod 2^n-1.

But this is not necessary because the carry is explicitly integrated in the next cycle.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 22:20 point

https://www.cs.auckland.ac.nz/compsci314s1t/resources/Checksums.pdf  is a nice beginner-level overview of error detection basics. Polynomial checks are mentioned, as well as 1s complement / end-around-carry and Fletcher is well explained but no sight of Fibonacci.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 05/24/2021 at 03:52 point

Oh, nice !

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html#Integer-Overflow-Builtins

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates