Close
0%
0%

PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

Similar projects worth following
Another purely additive, non-cryptographic, fast and minimalistic checksum.

Computes a 32-bit checksum with a pair of 16-bit registers,
- very small, using only ADD,
- without the flaws of Adler32 and Fletcher, yet the same size and complexity,
- almost as good as CRC32, which it may replace when the memory footprint is critical,
- very easy to implement in digital circuits as well as microcontrollers,
- suitable for the detection of trivial corruption in network packets or files.
- not patented
- inappropriate for cryptography or hash tables.

2×32 bits and larger are possible with the non-masking algorithm using w26.

If you wonder why, just click and read the log.

The original algo/project name was "Fibonacci checksum" but it later appeared that it was not the most accurate choice because Leonardo Pisano (the real name of Fibonacci) is the name associated to the periodicity of the sequence under modulo. All I did was add the tiny detail of "end-around carry", or "1-complement addition", which changed everything.

 
-o-O-0-O-o-
 

In 2012, Piano Magic released their album "Life Has Not Finished with Me Yet". One song contains a weird repeating pattern...

Glen Johnson's lyrics are often cryptic and more evocative than objective, but any geek's mind would cling on this mantra at the end:

"Add X to Y and Y to X"

This is really weird but... Why ? What's the point in this obscure "song" with no precise theme or clear subject ? And what does it do ? This last question is the most easily answered : just follow the damned algorithm.

C'est parti...

X=1, Y=0
Y+=X => 0+1=1
X+=Y => 1+1=2
Y+=X => 1+2=3
X+=Y => 2+3=5
Y+=X => 3+5=8
X+=Y => 5+8=13
X+=Y => 8+13=21
Y+=X => 13+21=34
X+=Y => 21+34=55
...

No need to go further, most of you should have recognised http://oeis.org/A000045, the famous Fibonacci sequence.

This gave me a compelling idea to modify the old Fletcher & Adler algorithms, keeping their very low footprint and minimalistic computational complexity. Both of these well known algos use a pair of values and have a similar structure. The trick is that rearranging the data dependency graph provides the equivalent of a minimalistic polynomial checksum, because the result is fed back on itself, in a more robust way than Fletcher's algo.

At first glance, this new checksum loop's body becomes something like :

Y += ( X ^ datum[i  ] );
X += ( Y ^ datum[i+1] );
i+=2;

This loop body is totally trivial to unroll. As trivial is the design of the corresponding digital circuit. This early version seemed to contain the whole checksum entropy in the last computed value but now X and Y are part of the signature. And the really important enhancement is the use of the carry!

For superscalar 32 bits CPUs, the following code seems to work well though the missing carry hardware (and/or lack of language support) requires more instructions to emulate:

t = X + Y + C;    Y = X + data;
C = t >> 16;      X = t & 0xFFFF;

A more efficient variation exists which does not use a temporary variable:

C += X + Y;
Y = X + data;
X = C & MASK;
C = C >> WIDTH;

In this worst case, without support of a carry flag, that's 5 basic operations (not counting memory loads) that fit in 4 registers and 3 cycles, to process 2 bytes. Not too bad. I'll let you deal with alignment. But is it really safe or safer?

The following logs will show how the idea evolves and the performance increases, through discussions about carry wrap-around, register widths, scheduling, avalanche, parallelism, orbits, structure, and escaping black holes...

 
-o-O-0-O-o-
 

Logs:
1. The variable
2. Datapath
3. Adler32 weakness
4. Easy circuit
5. Periodicity
6. A promising 32-bit checksum
7. Progress
8. Why
9. Orbital mechanics
10. Orbits, trajectories and that damned carry bit
11. Two hairy orbits
12. How to deal with black holes
13. Anteriority
14. Moonlighting as a PRNG
15. A new name
16. More orbits !
17. First image
18. Structure and extrapolations
19. Even more orbits !
20. Start and stop
21. Some theory
22. Some more theory
23. Even more theory.
24. A little enhancement
25. sdrawkcab gnioG
26. Further theorising
27. What is it ?
28. A bigger enhancement
29. Going both ways
30. Can you add states ?
31. Please, GCC !
32. _|_||_||_|______|______|____|_|___|_________|||_|__
33. Carry on with GCC
34. A good compromise
35. The new non-masking algorithm
36. Add or XOR ?
37. A 32/64-bit version
38. Closing the trajectories
39. Stats about missed crossings
40. Application : Displace LFSR in BIST
41. I need a big cluster.
42. Another theoretical rambling
43. An individual job and all the necessary processing
44....

Read more »

scan_1x_02.c

A basic single-threaded scanner for the smaller widths.

x-csrc - 2.53 kB - 12/04/2021 at 23:17

Download

faster_05.c

A reference code for the scan algorithm: unrolled 2× and handles semi-arcs. Condition bug is solved.

x-csrc - 1.77 kB - 12/03/2021 at 21:57

Download

p4ck.c

packs 2 decimal digits in one byte

x-csrc - 1.37 kB - 12/03/2021 at 18:15

Download

kc4p.c

unpacks a byte into two decimal digits

x-csrc - 1.04 kB - 12/03/2021 at 18:15

Download

snippets_v3.c

basic PRNG, checksum, scrambler/descrambler (sans and avec carry) using PEACw16 in C

x-csrc - 4.39 kB - 10/10/2021 at 14:28

Download

View all 44 files

  • Post-processing: the strategy looks good

    Yann Guidon / YGDES2 hours ago 0 comments

    I slept on the last log 91. Post-processing: the strategy is coming  and I just realised that the join/coalesce/merge/fusion program can work in any direction, so there is only one program to write, yay. There is only one command line option to add, to allow fusions with forwards or backwards arcs. Otherwise, since there are 2 synchronised streams, the fusion program doesn't care if it's sorted in increasing or decreasing order.

    So far, the rest of the post-processing algorithm relies heavily on the features of the stock GNU sort program. It provides all the required options (-n, -k) and a load test just showed it uses all the available processors, without having to enable a flag/option. The only remaining issue is that it only deals with text data, which more than doubles the required storage space. Later, I'll see if/how I can re-encode the logs with the "MIDI-like" format (with 7 bits of data per byte).

    Anyway, the algorithm to process w32 spends maybe 4 or 5 passes with files that exceed the RAM size, each time halving the size of the dataset. I use an external SSD for development, I could add a second one to double the bandwidth. I should check the laptop's ports if enough high speed links are available.

    Each pass contains a pair of merge operations:

    • First, sort the data set from the primary SSD in increasing order of Xorigin
    • Create a copy that is sorted in increasing order of Xend
    • Run the fusion program (to be coded): it "streams" the contents of both files above to create a new stream with half the size on the secondary SSD. The program contains a RAM-resident scoreboard: 1 bit set for every entry that has been processed/removed from the output stream.
    • Sort the result again, this time in decreasing order
    • Run the fusion again.
    • If the output stream still contains arcs, loop again.

    The burden of development is reasonable, I suppose that most of the time will be spent on sorting the huge log files. Sadly sort can't work on p4k files directly, I must develop another file/encoding format... p7ck ?

    The post-processing method seems clear enough and I can return to develop the parallel scanners ;-)

  • Post-processing: the strategy is coming

    Yann Guidon / YGDES17 hours ago 0 comments

    The last log Post-processing mentions some constraints for the processing required after the parallel scans. I have been thinking more about this lately and there is some progress.

    First, the format of the logs: each line now contains 4 decimal values,

    1. The destination/end of the semi-arc
    2. The origin/start of the semi-arc
    3. The length/number of steps of that semi-arc
    4. The count of semi-arcs.

    For w4: (slightly edited/formatted for readability)

    $ cat w4.log
    3   1 18 1
    11  2  6 1
    7   3  5 1
    10  4 14 1
    6   5  4 1
    14  6  5 1
    8   7 10 1
    9   8  3 1
    13  9 24 1
    12 10  4 1
    5  11  8 1
    2  12 16 1
    4  13  7 1
    15 14  9 1
    0  15  1 1

    The 2nd column is just the increasing sequence of integers, starting at 1 because we start with C=0, and X=0 Y=0 will loop back to this "stuck state".

    Corollary: the reported length of the orbit (sum of all the semi-arcs) is off-by-one since one semi-arc is missing, this is the special case going from C=1 X=0 Y=0 to C=0 X=1 Y=0.

    The last column will be incremented little by little as lines are coalesced. This is is the hard part....

    One critical aspect is that coalescence should be easily "streamed" (or "strip-mined") because the logs are way larger than my 16GB of RAM. Operations should be performed on files, streams of linear data. To ease this, I'll use existing POSIX tools where possible (to reduce coding efforts), in particular sort (once again, edited/formatted)

    $ sort -g w4.log
     0 15  1 1
     2 12 16 1
     3  1 18 1
     4 13  7 1
     5 11  8 1
     6  5  4 1
     7  3  5 1
     8  7 10 1
     9  8  3 1
    10  4 14 1
    11  2  6 1
    12 10  4 1
    13  9 24 1
    14  6  5 1
    15 14  9 1

    (see ? no line at 1 but it's implicit)

    But even sort has limits. So the files should have a limited size, 1GB for example for each chunk.

    For w32, with a total log size that could reach 100GB, the working dataset should be spread into 256 files (500MB each approx.). It's easy to program : take the 8MSB of the origin index and send to the corresponding file.

    256 is also a good value because Linux has a limit on the number of open files (see ulimit -n) that is typically 1024. Working with 256 files per dataset (input and output) gives enough headroom without having to manually tweak the OS/configuration.

    Sorting all the lines by their destination is very important because by taking both unsorted files and sorted files together, we can then stream the main operation of fusing consecutive semi-arcs. Having both sorted and unsorted versions will double the required storage capacity but the size of the dataset should shrink quickly through the fusion process... Eventually, an unsorted block could be sorted "in place" through pipes that stream data directly to our program. UNIX has a few tricks available for us but I would like to keep it simple.

    The algorithm revolves around sort to do the hard work so we can handle the fusion of semi-arcs with a simple program (at first). The goal is to avoid lookups or seek() through the whole dataset. Here is how it is possible: let's look at the first lines of the unsorted and sorted files.

    $ cat w4.log      $ sort -g w4.log
    3   1 18 1        0 15  1 1
    11  2  6 1        2 12 16 1
    7   3  5 1        3  1 18 1
    

    The first line has a little problem because it is missing the tricky transition from C=1 X=0 Y=0 to C=0 X=1 Y=0. We'll keep it as is to mark the "primary orbit" (which is open). So let's look at the 2nd line: X=2 is arrived by coming from X=12. So we can replace those 2 lines with one that reads:

    11 12 22 2

    X=12 goes to X=11 in 6+16=22 steps and 1+1=2 semi-arcs.

    However the fusion means that the unsorted stream will see the entry a second time and it should not be processed again. Our program should contain a scoreboard that marks the value as used so it will be skipped. For w32, it could be held in 512MB (1 bit per value) and leave enough room for the rest. Interestingly : the scoreboard is written out-of-order but read back in-order so the penalty is not enormous.

    Fusion can only go forwards because otherwise the stream would have to modify its own history. It's not impossible if...

    Read more »

  • Post-processing

    Yann Guidon / YGDES3 days ago 0 comments

    At this moment, the process of validating huge statespaces goes like this, in 2 phases:

    • Scanning the semi-arcs: the heavy and parallel phase can distribute ranges of arcs to scan, among whatever processing units are available. Each range will then create a log file, that contains the length of each consecutive arc, as well as the coordinates of crossing.
    • The post-processing phase takes all the log files and reconstructs the orbit(s) by concatenating arcs. This is where we get to know if the corresponding statespace is "maximal" or ideal.

    The latter part is the subject of this log: once we get all those log files (I estimate 50GB for w32), what can be done ?

    The first and easiest process that comes to mind is to compute the sum of the length of all the semi-arcs. This sum should have a certain value that we have already validated. This sum however is not a fail-proof method because we have seen many cases (such as w6 and its 6 orbits) where there are no small orbits isolated in a sea of huge orbits (such as w12 or w14).

    Summing all the arcs should be easy, using a script that computes the sum for each log file, then summing all these sums. The filename itself could even contain the sum, computed when the scanner writes the results/log to disk.

    To get a taste of what might happen, I tested with the smallest widths:

    $ for i in 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ; do
      gcc -Wall -Os -DWIDTH=$i scan_1x_02.c -o scan
      ./scan
    done
    
    Width=2 Total= 8 vs 8 (missing 0)
    Width=3 Total= 34 vs 34 (missing 0)
    Width=4 Total= 134 vs 134 (missing 0)
    Width=5 Total= 524 vs 526 (missing 2)
    Width=6 Total= 2078 vs 2078 (missing 0) <=== hey !
    Width=7 Total= 8156 vs 8254 (missing 98)
    Width=8 Total= 19204 vs 32894 (missing 13690)
    Width=9 Total= 130659 vs 131326 (missing 667)
    Width=10 Total= 524798 vs 524798 (missing 0)
    Width=11 Total= 2095073 vs 2098174 (missing 3101)
    Width=12 Total= 8390625 vs 8390654 (missing 29)
    Width=13 Total= 33558524 vs 33558526 (missing 2)
    Width=14 Total= 134225868 vs 134225918 (missing 50)
    Width=15 Total= 536775077 vs 536887294 (missing 112217)
    Width=16 Total= 2147516414 vs 2147516414 (missing 0)
    Width=17 Total= 8589998504 vs 8590000126 (missing 1622)
    Width=18 Total= 34359868344 vs 34359869438 (missing 1094)
    

    Except for w6, all the widths with a mismatched sum have more than one pair of orbits. This means that the sum is not the definitive answer but it is definitely part of it: if the sum doesn't match the expected total, this means that there are orbits that don't cross the Y=0 line and the corresponding statespace is not maximal.

    ...

    The second part is required, since the first part is only a first check. It is less easy to implement, particularly when the logs exceed the size of the RAM. Partial lookups become necessary, over multiple passes.

    Thanks to YouTube's suggestions for remining me this nice video:

    See you in the next log.

  • Packing and unpacking numbers

    Yann Guidon / YGDES4 days ago 0 comments

    The log 88. Back to the jobs considered the need and means to compress the scanning logs because the files could exceed 50B for w32. I looked at the idea of a base94 encoding but...

    I think I have found a better method, packing two decimal digits in a single byte, as a pair of BCD digits. This is better than base94 because the latter would still use one full byte for each separator (space, carriage return etc.).

    The resulting byte stream will still have some potential for compression, with bzip2 for example. The delicate part is how to map the extra codes but ASCII made it rather easy.

    $ cat 1234
    0123456789
    1.2,3 4
    $ ./p4ck 1234 > 1234.p4k
    $ od -An -v -w1 -t x1 1234.p4k 
     10
     32
     54
     76
     98
     1a
     2e
     3c
     4b
     fa
    $ ./kc4p 1234.p4k 
    0123456789
    1.2,3 4
    ?

    It seems to work. Mapping is managed with minimal effort, as noted in the source file:

      A simple filter that packs 2 decimal digits into one byte,
       by truncating the MSB.
    
      notes:
       * LF=0x0A is directly mapped, equivalent to "*", ":", "J", "Z"...
       * CR=0x0D is directly mapped, equivalent to "]", "-", "=", "M"...
       * Space is 0x20 and collides with "0", remapped to ";"
       * '.' is mapped to '>'
       * ',' is mapped to '<'
    
      The last char maybe be mapped to 15/0xF/ShiftIn
        if the input stream's size is odd. The decoder
        will output a single '?' in this case.

    You can find the source code in p4ck.c and kc4p.c. If you need a different mapping, edit the source code or use tr.

    .

    Edit:

    I had to test and it's quite convincing. I generated a log using w16, that looks like:

    ...
    25537 27445
    60251 8174
    52725 16132
    56907 1430
    16989 55241
    45423 47572
    48670 58543
    8348 15748
    18630 18818
    14930 9559
    5439 8871
    34938 2839
    20149 58986
    ...
    

    Then applied p4ck, which directly cut the size in half. Then I compressed them with gzip and bzip2:

    $ l w16.log*
    758979  3 déc.  23:41 w16.log
    316391  3 déc.  23:41 w16.log.bz2
    365463  3 déc.  23:41 w16.log.gz
    379490  3 déc.  23:42 w16.log.p4k
    312992  3 déc.  23:42 w16.log.p4k.bz2
    334504  3 déc.  23:42 w16.log.p4k.gz
    

    The size difference between w16.log.p4k.bz2 and w16.log.bz2 is marginal (about 1%) so bzip2 gets very close to the entropy right from the beginning. gzip works but remains farther from the entropy, since it gets boosted by p4ck.

    The size difference between w16.log.p4k.bz2 and w16.log.pak is less insignificant but still small (17%). bzip2 spends a lot of efforts to remove 1/6th of the extra entropy, when simply repacking the digits immediately brings us near the entropy.

    Conclusion: p4ck is a practical, efficient and lightweight method/filter to compress the arc logs for archival. For transmission of huge archives, where every gigabyte counts, the files could be packed further by tar/bzip2.

  • Back to the jobs

    Yann Guidon / YGDES6 days ago 0 comments

    The exploration and validation of w26 and w32 has been the focus of the last 6 months and the progress is such that I consider w26 in my reach: I have better software, which is going to be even better, and better computers. I must put all the pieces together to complete the puzzle though.

    Let's look at the past logs about the subject:

    41. I need a big cluster.
    43. An individual job and all the necessary processing
    44. Test jobs
    46. The problem with SIMD
    47. Protocol and format
    48. Just a test
    70. Another strategy to scan the statespace
    71. Scanning with multiple processors

    After the debacle of w26, it appears that the computations should be piecewise verifiable, easy to start, restart, or distribute... So I stick to the first approach, where each computer receives an allocated range to compute. Each semi-arc is sequentially computed in that range and the results are stored in files.

    This reduces the need for a central scoreboard, with its own save/restore/synchronisation issues. OTOH the post-processing stage remains daunting: there is a trade-off to find between the compactness of the result files and the ease of processing or examination. It is possible by using higher bases than decimal or even hexadecimal. A look at the ASCII table shows that all the codes after Space (0x20) and before DEC (0x7F) are printable so a text editor can examine them and standard tools like grep and sort help with the processing.

    Another desirable feature is that the result files can be seamlessly concatenated. So the format is simple: one line per semi-arc.

    With 94 codes per byte, the density is closer to binary than decimal representation, though a space is still required to separate the fields. Going from 3 fields to 2 is easily performed by sort and sed, first sorting the Xinit, then removing it. The carriage return can't be removed either but bzip2 can squeeze some more room when needed.

    I have configured a system where I can store uncompressed logs easily and rather fast (400MB/s read&write), so post-processing will still be possible with 50GB of logs for example.

    At this moment, it seems best to simply start the computations and define a simple output format, to get useful data that can be later processed and compared. I'll design the processing scripts/programs later...

  • Semi-arcs: they work!

    Yann Guidon / YGDES7 days ago 0 comments

    In this project, we are now used to seeing the pattern of w4, as shown below. Go to http://ygdes.com/~whygee/PEAC/a2_07.html  to run it yourself.

    The list on the right contains all the "crossings", where Y=0 or Y=15 (or MASK in the code and other cases). The 135 iterations loop after 16 crossings, with semi-arcs
    of the following lengths: 18, 5, 10, 3, 24, 7, 14, 4, 16, 6, 8, 4, 5, 9, 1, 1

    Definition: Semi-arcs are like the previous arcs but they can end at Y=0 or Y=15

    In the previous logs, we have explored and checked the "complementarity" of the dual orbits. Due to the central symmetry, we have X'=MASK-X, Y'=MASK-Y and C'=1-C. The orbits evolve strictly identically.

    So it should make sense that, at any time, the coordinates could be "complemented" and the orbit's properties will not be affected. In the following code, the coordinates are complemented for each crossing where Y=MASK. Look for youself at http://ygdes.com/~whygee/PEAC/a2_07b.html

    It works as expected even though to me it sounds counter-intuitive. But the numbers spoke. The patterns have changed significantly but the semi-arcs have the same lengths as in the original case. The central symmetry is still here, as well. It's uncanny but it's here.

    The big improvement though is that both orbits are covered with only one orbit worth of computation. And ALL the semi-arcs start with the same Y value. This makes the scan 2× more efficient and this still gives the same results as if the 2 complementary orbits were scanned.

    The "crossing" pattern is quite different from the original orbit but I have tested the equivalence with other sizes and the maths match again. Compare for W10:

    http://ygdes.com/~whygee/PEAC/a2_06.html (normal)
    http://ygdes.com/~whygee/PEAC/a2_06b.html (semi-arcs)

    The simplification is significant for the scheduling algorithm. We don't have to check extra conditions: the new "semi-arcs" start at Y=0 and end at Y=0 or Y=MASK, Y is cleared during scheduling and the information can be discarded. Just don't forget to invert X of course.

    From there the obvious consequence is simple: if all semi-arcs can be joined to a single loop to create one orbit then the width is maximal. We don't even need to count the lengths though they are useful for further studies and validations. The validation of this property has other far-fetching implications that will become obvious later, little by little...

  • A more convenient definition of arcs

    Yann Guidon / YGDES11/28/2021 at 01:47 0 comments

    The last log ends on a fundamental question about the arcs and their dual/complementary nature. I wonder if one part of the difficulty comes from the initial arbitrary decision to define an arc as a sequence that starts and ends at Y=0, because the complementary becomes Y=MASK, so there are 2 tests.

    This could be simplified with the condition X=Y because the diagonal is a special case: it is possible with both C=1 and C=0. The 2 exceptions are of course X=Y=C=0 and X=Y=MASK when C=1. This makes the loop stop 2× more often but still doesn't solve the problem, and I have already tested that 2 tests instead of 1 doesn't slow the loop down. Let's add to this that the condition X=Y is the next step to the condition Y=0 so we're back to the initial definition (though the value of C is ignored because Y=0 only happens when C=1, while X=Y is the only case with C with either value)

    The real question is: what to do when an arc crosses its complementary arc, and how to detect it.

    Another more fundamental question is "what direction" does the complementary orbit go, compared to its reference orbit ? I suppose it's the reverse direction because of the central symmetry. This could help speed up things as well because "going backwards" is possible but slower. Oh wait, it was already covered in the log Some more theory 7 months ago. The answer is no: both orbits "go" in the same direction, so computing one orbit (and arc) is the same as computing its complementary. See the following video

    The code runs there: http://ygdes.com/~whygee/PEAC/a2_04.html

    Do ALL arcs cross their complementary first, or can they finish sometimes without crossing theirs complementary ?

    According to http://ygdes.com/~whygee/PEAC/a2_05.html, this is mixed.

    For w4, the ratio of blue vs green is quite dramatic and it would certainly be more balanced for larger configurations. Indeed, for w10, look at http://ygdes.com/~whygee/PEAC/a2_06.html and see the ratio getting better, at 532/493.

    In the context of a parallel, multithreaded scan, there is the possibility that an arc started, finds a complementary crossing and this crossing has already been started by another thread, so the first thread should be stopped anyway. This also makes it a bit more complex to pre-allocate a set of initial X to a thread, to be queued in advance to prevent stalls. But it sounds possible...

  • How to double the scan rate

    Yann Guidon / YGDES11/26/2021 at 17:43 0 comments

    As noted in 76. More efficient scan of the state space and reminded in 84. Compute better, compute more, there is a simple way to double the scan speed within a single loop, when trying to cover the whole orbit. To see how, let's look at a diagram from the old log 21. Some theory:

    In the video, I used a different picture:

    The symmetry and complementarity are apparent, which mean that when one orbit is being scanned, the other is implicitly scanned as well.

    Let's say we have an orbit defined by the width W and the coordinates X, Y and C. The complementary orbit has coordinates X', Y' and C'. Because of the central symmetry, we have X + X' = 2^W -1 and Y + C + Y' = 2^W (because C+C'=1)

    The existing code tests for intersection of the orbit with Y=0 (which is just a delayed version of X=0):

    do {
        arclen++;
        C += X + Y;    Y = X;
        X = C & MASK;  C = C >> WIDTH;
    } while (Y != 0);

    This needs to be modified to test for 2 intersections: Y=0 and Y=MASK.

    Oh and the Y = X part could... should be removed, and the variables renamed, thus unrolling the loop a bit.

    do {
        arclen++;
        C += X + Y;
        X = C & MASK;  C = C >> WIDTH;
    // test Y here
        arclen++;
        C += X + Y;
        Y = C & MASK;  C = C >> WIDTH;
    // test X here
    } while (1);

    So it's back to the original "add X to Y and Y to X" dance. And it's already paying off. Do you remember the parallel version with forward and backward iterations? It scanned 2147516415 steps in 1.71 seconds. The unrolled version gets us close to that:

    $ gcc -g -Wall -DWIDTH=16 -Os faster_02.c -o faster &&  \
    > /usr/bin/time ./faster
    Ended odd after 2147516415 steps
    2.07user 0.00system 0:02.07elapsed 99%CPU 

    Pretty good, right ? Here is the code.

      do {
        C += X + Y;
        Y = C & MASK;  C = C >> WIDTH;
        if (X==0) {
          if (Y==1) {
            arclen++;
            printf("Ended odd after %"PRIu64" steps\n", arclen);
            break;
          }
        }
    
        arclen+=2;
        C += X + Y;
        X = C & MASK;  C = C >> WIDTH;
        if (Y==0) {
          if (X==1) {
            printf("Ended even after %"PRIu64" steps\n", arclen);
            break;
          }
        }
      } while (1);
    

    The unrolled version is pretty efficient, despite a longer, redundant code.

    Compare the unrolled version to the following, earlier version, that ran in 2.23s:

      do {
        do {
          N++;
          C += X + Y;
          Y = X;
          X = C & MASK;
          C = C >> WIDTH;
        } while (Y != 0);
      } while (X != 1);

    10% speed increase is nice to have.

    Even arclen++; has been simplified to free up an adder in the pipeline. I increment by 2 on each loop, and "fix" on the exit of the odd condition. Unfortunately there was no noticeable effect on my platform. I keep it around just for the comfort of "if ever..."

    But we're not yet there: despite the speedup, it's still missing one half of the comparisons. The comparisons keep execution units busy so more comparisons means more time to compute, but the OOO keep these out of the critical datapath so the performance should not be significantly affected.

    .

    .

    .

    Good news everyone !

    The new version of the loop core with multiple tests has no effect on its speed.

    loopstart:
        C += X + Y;
        Y = C & MASK;  C = C >> WIDTH;
        if (X==0)
          goto endofloopodd;
        if (X==MASK)
          goto endofloopodd_comp;
    
    loopeven:
        C += X + Y;
        arclen+=2;
        X = C & MASK;  C = C >> WIDTH;
        if (Y!=MASK)
          goto endofloopeven_comp;
        if (Y!=0)
          goto loopstart;

    The bad news is the hit when run in parallel: it increases for all running threads, for each added thread, on a i7-2620M:

    $ /usr/bin/time ./faster4
    2.05user 0.00system 0:02.08elapsed 98%CPU 
    
    $ /usr/bin/time ./faster4 & /usr/bin/time ./faster4
    2.16user 0.00system 0:02.17elapsed 99%CPU
    2.16user 0.00system 0:02.19elapsed 98%CPU
    
    $  /usr/bin/time ./faster4 & /usr/bin/time ./faster4 & /usr/bin/time ./faster4
    2.17user 0.00system 0:02.17elapsed 99%CPU
    2.73user 0.00system 0:02.73elapsed 99%CPU
    2.75user 0.00system 0:02.76elapsed 99%CPU 
    
    $ /usr/bin/time ./faster4 & /usr/bin/time ./faster4 & /usr/bin/time ./faster4 & /usr/bin/time ./faster4
    2.91user 0.00system 0:02.92elapsed 99%CPU
    2.91user 0.00system 0:02.93elapsed 99%CPU
    ...
    Read more »

  • Compute better, compute more

    Yann Guidon / YGDES11/24/2021 at 07:47 0 comments

    There are 2 main and complementary methods to reduce the time needed to scan the state of a whole statespace:

    1. compute more arcs in parallel
    2. test for 2 conditions to detect complementary/symmetrical orbits.

    The first method yields the best speedup, relying on multi-processor parallel processing. So far I can use POSIX threads but I also plan to use CUDA on RTX3070. Before I can setup the software chain, I also explore using SIMD: AVX promises 256 bits wide packed integers that GCC seems to handle mostly fine. https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html looks "reasonable", except I need a bit more than basic operations: carry wrap-around and conditional jump on comparison will require some intrinsics or even assembly...

    SIMD code is quite important because this is inherently how GPGPU work. Developing the AVX version will help make a better CUDA implementation.

    The second method was introduced in 76. More efficient scan of the state space and only brings a speedup of (almost) 2 but even that is significant. What if a computation would run 3 months instead of 5 months ? I still have to get the details right though, and this is the priority because the result will be the basis for the parallel versions.

  • A practical use of PEAC : packet transmission

    Yann Guidon / YGDES10/30/2021 at 02:07 0 comments

    One very delicate subject is how to send data commands over radio or wire, for example.

    I know this subject has been rehashed for decades now and "rolling codes" are the norm, or should be at least.

    Usually, a data packet is sent over a serial link with a fixed structure, in order :

    1. preamble
    2. a sequence number
    3. data packet type, or other meta-information
    4. size field, if the payload is variable
    5. payload
    6. CRC

    Often, a scrambler is applied over the digital signal to spread its spectrum, that is : avoid long runs of constant states.

    Each field has a role. Now, let's totally reconsider them and simplify all this !

    1. The preamble is optional but for Ethernet for example serves to
      1. syncrhonise the PLL
      2. detect the real beginning of the packet

      In this log, we will consider this framing to be external, we simply deal with a number of bytes that have been handled by the appropriate circuit (might be a simple serial port for example).

    2. The sequence number is usually sequential. This it is predictable and monotonous and there, the PEAC generator is ideal.
    3. Meta-information is application dependent and not covered
    4. Payload size : depends on the framing system.
    5. Payload ...
    6. CRC field, or checksum : can be replaced by the PEAC checksum.

    But wait !

    Let's not forget the scrambler : this is another layer that could use the PEAC scrambler.

    But wait again !

    Why use both a PEAC checksum and a PEAC scrambler ? This is redundant ! (at least for certain cases).

    Now let's consider the case of digital radio packets, like your door opener. The airwaves contain all sorts of noises and other transmissions, and this creates all kinds of troubles for the data integrity, authenticity, and many other potential problems. It becomes a combinatorial problem and the receiver must reject any packet that is malformed, yet with minimum complexity. The protocol must not allow packets to be easily forged, not just because of possible intrusions, but also because trivial technical problems (reflections ? repeats ? slight alterations ?) become easier to spot.

    From there, a different approach is possible, that differs from the classic one. The scrambler becomes integrated in the process to serve also as checksum. This changes the packet structure a bit :

    1. pseudo-random sequence number : ensures that the rest of the packet has more randomness from packet to packet
    2. payload (fixed size determined by explicit or implicit framing)
    3. constant field : with a size proportional to the desired collision probability

    And ... that's it ! Well, almost, because I can't write 0. : the scambler is initialised by a constant that also depends on the packet type. Or canal. Or address... Or whatever you like : change the init value and you get a totally different "world" !

    In a radio system with several receivers, those that don't have the appropriate init value will reject the messages. Broadcast messages can be managed at the framing level, with a different size for example, which implies a different init value.

    The Pseudo-Random Sequence Number helps because it gets caught in the scrambler, and ensures that the rest of the packet gets mixed even if the data are similar. The PRSN is mixed with a constant init value because it is the first data to be scrambled, so it is almost like it is not mixed and could be precomputed if needed.

    The Constant Field is a "known value" that replaces the CRC/Checksum field. This constant gets scrambled with the rest and will catch any error that was propagated by the descrambler. It could even be an address field as well, because any mismatch will cause rejection ! Its size can be as large as both the address field and the checksum field of a classic packet. It can be extended at will to prevent any collision, without increasing the size of the scrambler !

    Any new message will seem to vary wildly from the previous, even if very similar data are transmitted. The "init value" might even look like a key but the system is not even remotely considered as cryptographic, it can only fool a superficial...

    Read more »

View all 92 project logs

Enjoy this project?

Share

Discussions

Yann Guidon / YGDES wrote 10/23/2021 at 21:33 point

Interesting !

https://research.nccgroup.com/2021/10/15/cracking-random-number-generators-using-machine-learning-part-1-xorshift128/

https://research.nccgroup.com/2021/10/18/cracking-random-number-generators-using-machine-learning-part-2-mersenne-twister/

  Are you sure? yes | no

[deleted]

[this comment has been deleted]

Yann Guidon / YGDES wrote 10/13/2021 at 14:39 point

so far, half a year and 50£.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 09/14/2021 at 04:16 point

George Marsaglia and Arif Zaman got quite a lot figured out in 1991:
https://projecteuclid.org/download/pdf_1/euclid.aoap/1177005878

Why didn't this catch up, but they focus on extremely large sequences with tens of registers of lag, and didn't look enough at just a pair of registers. They also didn't look at the link with checksums.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 09/08/2021 at 07:28 point

Funny !

https://en.wikipedia.org/wiki/Lagged_Fibonacci_generator

Toward a universal random number generator, G.Marsaglia, A.Zaman

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 23:24 point

https://www.researchgate.net/profile/Michael_Greenwald/publication/3334567_Performance_of_checksums_and_CRCs_over_real_data/links/53f4b15f0cf2888a749113c5/Performance-of-checksums-and-CRCs-over-real-data.pdf

"In practice, TCP trailer sums outperform even Fletcher header sums."

Ouch.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/08/2021 at 00:51 point

"Ones complement (with EAC) Fletcher, however, has a weakness that sometimes offsets its probabilistic advantage: since bytes containing either 255 or 0 are considered identical by the checksum, certain common pathological cases cause total failure of Fletcher-255."

Re-ouch.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 23:09 point

http://users-tima.imag.fr/cdsi/guyot/Cours/Oparithm/english/Modulo.htm and others say that special adders exist to compute mod 2^n-1.

But this is not necessary because the carry is explicitly integrated in the next cycle.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 08/05/2021 at 22:20 point

https://www.cs.auckland.ac.nz/compsci314s1t/resources/Checksums.pdf  is a nice beginner-level overview of error detection basics. Polynomial checks are mentioned, as well as 1s complement / end-around-carry and Fletcher is well explained but no sight of Fibonacci.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 05/24/2021 at 03:52 point

Oh, nice !

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html#Integer-Overflow-Builtins

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates