Close
0%
0%

PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

Similar projects worth following
Another purely additive, non-cryptographic, fast and minimalistic checksum.

Aimed at being a 32-bit checksum for very low resources,
- very small, using only XOR and ADD,
- without the flaws of Adler32 and Fletcher, yet the same size and complexity,
- almost as good as CRC32, which it may replace when the memory footprint is critical,
- very easy to implement in digital circuits or microcontrollers with 16-bit accumulators,
- suitable for the detection of trivial corruption in network packets or files.

Disclaimer: This is a linear, reversible checksum so it's not appropriate for cryptography or hash tables. This is not even the fastest checksum ever.
But it's still quite good.

The basic application uses 2×16 bits. 2×32 bits and larger are possible with the non-masking algorithm.

If you wonder why, just click and read the log.

The original algo/project name was "Fibonacci checksum" but it later appeared that it was not the most accurate choice because Leonardo Pisano (the real name of Fibonacci) is the name associated to the periodicity of the sequence under modulo. All I did was add the tiny detail of "end-around carry", or "1-complement addition", which changed everything.

 
-o-O-0-O-o-
 

In 2012, Piano Magic released their album "Life Has Not Finished with Me Yet". One song contains a weird repeating pattern...

Glen Johnson's lyrics are often cryptic and more evocative than objective, but any geek's mind would cling on this mantra at the end:

"Add X to Y and Y to X"

This is really weird but... Why ? What's the point in this obscure "song" with no precise theme or clear subject ? And what does it do ? This last question is the most easily answered : just follow the damned algorithm.

C'est parti...

X=1, Y=0
Y+=X => 0+1=1
X+=Y => 1+1=2
Y+=X => 1+2=3
X+=Y => 2+3=5
Y+=X => 3+5=8
X+=Y => 5+8=13
X+=Y => 8+13=21
Y+=X => 13+21=34
X+=Y => 21+34=55
...

No need to go further, most of you should have recognised http://oeis.org/A000045, the famous Fibonacci sequence.

This gave me a compelling idea to modify the old Fletcher & Adler algorithms, keeping their very low footprint and minimalistic computational complexity. Both of these well known algos use a pair of values and have a similar structure. The trick is that rearranging the data dependency graph provides the equivalent of a minimalistic polynomial checksum, because the result is fed back on itself, in a more robust way than Fletcher's algo.

At first glance, this new checksum loop's body becomes something like :

Y += ( X ^ datum[i  ] );
X += ( Y ^ datum[i+1] );
i+=2;

This loop body is totally trivial to unroll. As trivial is the design of the corresponding digital circuit. This early version seemed to contain the whole checksum entropy in the last computed value but now X and Y are part of the signature. And the really important enhancement is the use of the carry!

For superscalar 32 bits CPUs, the following code seems to work well though the missing carry hardware (and/or lack of language support) requires more instructions to emulate:

t = X + Y + C;    Y = X ^ data;
C = t >> 16;      X = t & 0xFFFF;

A more efficient variation exists which does not use a temporary variable:

C += X + Y;
Y = X ^ data;
X = C & MASK;
C = C >> WIDTH;

In this worst case, without support of a carry flag, that's 5 basic operations (not counting memory loads) that fit in 4 registers and 3 cycles, to process 2 bytes. Not too bad. I'll let you deal with alignment. But is it really safe or safer?

The following logs will show how the idea evolves and the performance increases, through discussions about carry wrap-around, register widths, scheduling, avalanche, parallelism, orbits, structure, and escaping black holes...

 
-o-O-0-O-o-
 

Logs:
1. The variable
2. Datapath
3. Adler32 weakness
4. Easy circuit
5. Periodicity
6. A promising 32-bit checksum
7. Progress
8. Why
9. Orbital mechanics
10. Orbits, trajectories and that damned carry bit
11. Two hairy orbits
12. How to deal with black holes
13. Anteriority
14. Moonlighting as a PRNG
15. A new name
16. More orbits !
17. First image
18. Structure and extrapolations
19. Even more orbits !
20. Start and stop
21. Some theory
22. Some more theory
23. Even more theory.
24. A little enhancement
25. sdrawkcab gnioG
26. Further theorising
27. What is it ?
28. A bigger enhancement
29. Going both ways
30. Can you add states ?
31. Please, GCC !
32. _|_||_||_|______|______|____|_|___|_________|||_|__
33. Carry on with GCC
34. A good compromise
35. The new non-masking algorithm
36. Add or XOR ?
37. A 32/64-bit version
38. Closing the trajectories
39. Stats about missed crossings
40. Application : Displace LFSR in BIST
41. I need a big cluster.
42. Another theoretical rambling
43. An individual job and all the necessary processing
44. Test...

Read more »

dispatch_02.c

enabling computation of arcs in w32

x-csrc - 2.42 kB - 07/18/2021 at 02:15

Download

PEAC_core.tgz

First test of a VHDL version

x-compressed-tar - 1.54 kB - 07/13/2021 at 02:41

Download

job_logs.tgz

log files for w3 up to w20

x-compressed-tar - 18.09 MB - 07/09/2021 at 23:49

Download

gold_01.c

First working scanner to explore the behaviour of the "jobs" (serial version)

text/x-csrc - 1.47 kB - 07/09/2021 at 22:14

Download

test_w26.c

Compare the original (bit-limited) algo with the new (non-masking) algo.

x-csrc - 1.29 kB - 06/18/2021 at 00:42

Download

View all 24 files

  • Think Tetris

    Yann Guidon / YGDES4 hours ago 0 comments

    For this system I have used and abused several allegories to guide the design: token ring, trains on a track, JTAG... And now I'm adding Tetris (with only one column) and 1D cellular automata, because the system reminds me of the "sandpile" 2D CA I used to play with 15 years ago.

    The transition rule is probably the most basic you can find:

    if node[n] has data and node[n+1] is empty, then
       node[n+1] copies node[n] then clear node[n]

     In other words, huh, pixels:

    When applied to a whole row of nodes, the final effect is that all the grey/occupied nodes will shift to the right, toward the reading end of the chain.

    Now, the insertion of new data adds a bit of complexity to this bare system but it's still manageable. The insertion part can be seen as added states to the CA and the trick is to modify them adequately. The nodes' state must also consider the need to send a message when a job is finished.

    (tbc)

  • Ring de-buffering

    Yann Guidon / YGDESa day ago 5 comments

    The last log The train of thoughts and the red tail light introduced the principles for adapting a "token ring" idea to a JTAG-like chain. At the end, a major issue is identified: the host is slower and the nodes risk overflowing the output buffers. The protocol must be solid and needs to consider all corner cases...

    And then I realised it was ridiculous to waste memory (that will always be too small or too large) to have a copy of all the states. No FIFO is required because the chain is the (variable length) FIFO itself: it must be flushed by the host anyway, at its own pace (10MHz is fine).

    So all the nodes and the intercom are still synchronous to the computations, but the chain may only "shift" if the host requests to read one word. This is where it gets fun...

    First, there must be a backwards signal from one node to the previous to enable the shifting of one position. This means that the "ready" signal will ripple through the chain in reverse direction. Several simultaneous signals may be present at the same time in the chain because the ripple takes at least 2µs to propagate through the whole chain (for 1000 nodes for example).

    Second, the "ready" signals may stall when/where a node wants to insert data in the chain. There is no need to check that the X previous nodes are not intending to speak at the same time: if one node wants to speak it must wait for the condition of a "empty" type word in its own buffer, and the presence of the "ready signal". That's it. Of course the node must send contiguous words because otherwise its message might be "interrupted".

    Commands (write packets are inserted in a special buffer at the start of the chain, which will be shifted any time the host reads the end.

    To ease the management, the status of all the nodes can get condensed into a word (or a few) that counts how many nodes are waiting to send data. The host must read many, many empty words before one useful message appears at the end. To reduce the latency, the host should "poll" the chain by reading as often as possible, allowing data to move towards the output before the host knows something arrives. Smaller subchains reduce the latency/wait. Any node that remains idle is useless...

    Maybe the "ready token" can be automatically generated by the end of the chain when it receives an empty word, allowing the chain to move forward by itself and reduce the latency.

    I should draw diagrams and think further, but I am glad this roadblock is getting solved appropriately.

  • The train of thoughts and the red tail light

    Yann Guidon / YGDES2 days ago 0 comments

    You might have guessed already that this post is not directly related to the main subject of this project but I'm not feeling (yet) the need to spin it off to a new separate project. So here it is: I think I have solved the problem with the "token ring".

    A "ring" has been chosen for many reasons:

    • it is directly scalable, just plug more units and they will deal with addressing themselves.
    • a ring is probably the most routable topology, giving the most freedom to the P&R software of the FPGA. With manual placement, a hamiltonian path can be built.
    • There might be 2 separate rings because each is synchronous to the node's clock, and some nodes might run faster with the use of the Math block.
    • Latency is not critical in this application so having 1000 nodes in series does not change much. At a supposed speed of 400MHz, 1000 nodes are scanned at 400KHz, or every 2.5µs. Good enough.
    • I need the least wires, gates and control signals. The system must be really lightweight, because an gate used for interco is wasted for raw computations. Saving 1 gate will allow one more compute node to be instatiated.
    • The same wires will transport the commands from the host controller, and the results of the jobs.

    Now, if the basic idea of a ring bus or a token ring interco sounds simple and easy, the reality is much less so. In particular, classic ring networks use "slots" of fixed sizes so a data packet hops on the ring when one slot is free. For our jobs, this does not work as well because there are 2 sizes of packets:

    • The "write" packets that load a new value for computation on a given node:
      • w bits of address (plus a command word for extra features if needed, such as reset or force read),
      • w bit of value.
    • The "read" packets are sent by the node when the job succeeded:
      • w bits of address (plus some extra data if needed)
      • w bits for the resulting X value
      • 2w bits for the cycle counter

    Let's say we have a datapath with w bits, then there are already 6 types of packets, plus the empty type of packet. The ring is also synchronous to the main compute clock to keep things simple. And to keep up with 400MHz, they HAVE to  be simple.

    The challenge is to insert the words in the shift register of the ring, without any risk of creating collisions, because a sender must look up to 4 previous units to see if they request to send data. This is not desirable and this has left me scratching my head for a long while...

    The first inspiration was the token ring system but this couldn't work as is. The idea sounds good however and the trick is to adapt how the "token" is managed, to make sure there are no collisions. Anyway the principle remains the same: whichever node "has" the token can insert any stream of data on the ring.

    The ring in fact is not one, it's open indeed (a bit like a JTAG scan chain):

    • on one end, the host enqueues "write" commands,
    • on the other end, the host reads a stream of "read" packets.

    so the classic "token ring" system must be adapted. The behaviour, nature and algorithm of the token is changed a bit...

    • The host enqueues requests (2 words) on the "ring" (going through a RAM buffer because messages must have consecutive words and the host might be slow)
    • The host terminates the string of requests with a "token" packet, then fills the "ring" with empty packets.
       
    • Each node look for an "address write" packet and compares the address with its own (constant) position.
    • If a match is found, the cycle counter is cleared, Y is cleared, Xinit is written from the ring and the job can start.
       
    • When the job is over, the node stops and waits for the appearance of the "token" type packet.
    • When encountered, this token is held in place while the node serialises (MUX4...) the 4 words of the packet.
    • When the 4 words are sent, the token is released, so more nodes downstream can insert their data after this node's.
       
    • At the end of the chain, the write packets are omitted and the read packets are stored in a buffer/RAM so the host can read the results later.
    • When the end receives...
    Read more »

  • The PolarFire way

    Yann Guidon / YGDES2 days ago 0 comments

    Digging further into the MPF300, I get a better taste of its potential as number cruncher. Summary: it tastes good with an estimated run time for w26 of about 5 hours. That's a direct speedup of about 200 and I'm not even pushing it much.

    The key to this is the 924 "Math Blocks" that can run at about 500MHz, providing more than 400 billion additions per second. Not bad for a 500$ board, that's about 1Gadd/s/$. I found an eBay auction that was even sweeter. It would take much more work to get to that level with a cluster of RPi, after the whole HW and SW are operational AND the GPUs harnessed.

    The 924 "Math Blocks" are described in some detail in PolarFire_FPGA_Fabric_UG0680_V7.pdf (PDF) and some very interesting details make them attractive, for example by including the necessary DFF buffers, the carry signals and an ADD width of 48 bits. This means that these blocks are appropriate for both the Add/accumulator datapath AND the cycle incrementer. What is missing is the zero detection, which normally requires the use of external logic, probably going through embedded carry logic of the LUT. This part might decrease the speed.

    Speed matters and the delay of the Math block is certainly different from the native LUT logic, which is still plentiful. When using the Math block, DFF and Add logic are already implemented, some glue is still required for read/write but this does not fill the 300K LUT4, the rest of which can also implement a slightly slower and discrete version. That could bring the total number of parallel units in the 1500 range, eventually 2000 for small w. This provides a significant speedup with reasonable effort...

    The big roadblock for now is how to control, write and read all the registers. And LUT4s are not practical for the implementation of wide MUXes with low latency. And it seems that the RAM blocks of the PolarFire family don't have a FIFO control hard block...

  • Multiple ways, and rings

    Yann Guidon / YGDES3 days ago 0 comments

    As I try to re-validate w26 and look at the unreachable w32, I have identified methods to get more computations done per second, including a cluster of RPi that recently led to the creation of a new project/fork : #Clunky McCluster...

    I also explore the possibilities offered by the PolarFire FPGA I recently got. I am stuck at the internal network level and am considering a weird hierarchical token ring...


    Edit:

    So far, I favour 2 approaches : The RPi cluster, and the FPGA acceleration. CUDA running on GPGPU rented on the cloud from AWS or Google could have a great potential for huge calculation but at an unknown price, which increases with time used, which in turns increases with the development on a platform I can't really "get my greasy hands on". And the clock is ticking while credits evaporate...

    OTOH : I already own a great PolarFire board and a collection of RPi boards of various mileages. I can develop at my own pace and reuse the hardware I already own, as well as the skills, for other future projects, when I want, how I want.

    I don't know yet which one I will choose : each has their own strength but I have the tendency to spread and dissipate the focus/efforts.

    The FPGA/Polarfire route has a direct speedup factor of around 1000 as I think I can fit 1000 to eventually 2000 calculating circuits on this large chip. The constraints are pretty clear and there are a few, well-understood, major challenges to overcome. But once solved, that's it, it will stream results. The only real issue I see is scalability: it's a fixed-size solution and I'd have to buy more boards to get faster results.

    The RPi side is more evolutive, as you can extend a cluster, swap in faster boards, with more cores, progressively tune the software and eventually exploit the GPU. This is more flexible but in the beginning, the payout is low. The speedup would start at 5 or 10, with GPU eventually bringing another 10x speedup or so. Maybe if I succeed in using the Pi's GPU, I can then port the system to CUDA and get that sweet 10K speedup one day...

    All these ways share a common framework and algorithms for managing and reconstructing the orbits, so this is another branch to work on, on POSIX for now.

    But for now I have not chosen. One on hand, I have started the spin-off project #Clunky McCluster, OTOH I keep searching an appropriate intra-chip network topology that uses the least gates.

  • Just a test

    Yann Guidon / YGDES7 days ago 0 comments

    While trying to estimate the size of the complete set of arc data for w32, I wrote dispatch_02.c and let it run for a minute or so...

    $ gcc -Wall -DWIDTH=32 -Os dispatch_02.c -o dispatch && ./dispatch
    Run on W32 from 42 to 1500
       42 -> X=2656939159   N=5996278162 l=5
       43 -> X=2529572078   N=1751451951 l=5
       44 -> X=4217682635   N=9651841000 l=5
       45 -> X= 528636580   N=9545525216 l=5
       46 -> X= 626411012   N=855984539 l=5
       47 -> X=1331873985   N=1897705442 l=5
       48 -> X=3644464132   N=2553832670 l=5
       49 -> X= 955492511   N=2834782532 l=5
       50 -> X=2905244640   N=10158815173 l=5
       51 -> X=3977561810   N=570944009 l=5
       52 -> X=1285427445   N=1916254633 l=5
       53 -> X=2722001243   N=8057574457 l=5
       54 -> X=1572286052   N=2556104354 l=5
       55 -> X=1183375166   N=3596834417 l=5
       56 -> X=1423688005   N=2467949744 l=5
       57 -> X=1992353645   N=1488849858 l=5
       58 -> X= 208628631   N=3750365940 l=5
       59 -> X=3057416295   N=269839798 l=5
       60 -> X=3795627370   N=5996278162 l=5
       61 -> X=2510464790   N=251891606 l=4
       62 -> X=1770417005   N=8459247511 l=5
       63 -> X= 740091212   N=9545525216 l=5
       64 -> X=3908705066   N=691897578 l=5
       65 -> X=2100730390   N=3511641358 l=5
       66 -> X=4175190107   N=5996278162 l=5
       67 -> X=3476866509   N=1904497197 l=5
       68 -> X=1024457701   N=3162422531 l=5
       69 -> X= 939616518   N=855984539 l=5
       70 -> X=1593588803   N=5245804944 l=5
       71 -> X=  21251938   N=1049143365 l=5
       72 -> X=  53655981   N=5873919413 l=5
    ^C
    

    Using a MIDI-like "variable length code", I see an average of 5 bytes of the size of an arc.

    This means that 4Gi arcs × (4+5 bytes) = about 40GB, not including duplicates for verifications. The file(s) may be streamed, but the tabulation will require more than 64GB of RAM.

  • Protocol and format

    Yann Guidon / YGDES7 days ago 0 comments

    As the need for distributed computation increases, as well as the variety of platforms, I have to make sure that all the systems (dispatchers, calculators and tabulators) can communicate efficiently and reliably.

    For w32 I expect that the whole collection of arcs will require tens of gigabytes of storage (and transmission) so the format of the exchange files must be both compact and convenient. I do not expect a particular size for the files, or the duration of computation, because a range of coordinates can vary wildly depending on the available time and computational power.

    Each file describes a collection of contiguous arcs, from a start to an end index (inclusive), so each arc's origin is implicit (there is no need to include it in the file). The start and end are given in 0-prefix-padded hexadecimal in the file name, separated by an hyphen, for convenience. For security and traceability (when the range is computed by a third party), a non-guessable key/ID can be added. And the file name extension is the width. Like, for w32:

    00001234-00005678.Sd5KbIqbrw7.w32

    The format is expected to be good up to w32 so the file will contain "records"/structs with one uint32_t for the end coordinate of each arc. No need to overthink this because the permutations of the arcs are really unpredictable.

    OTOH as shown in the log 44. Test jobs, the arc lengths have a characteristic distribution, where 1/3 of the arcs have a length longer than the total number of arcs, so this will not fit in a uint32_t slot. There is no chance it will fill a uint64_t slot either because that's the total length of all the arcs. As for 48 bits: there is no theory yet that can predict the maximum arc length for a given width. A primitive variable-length code would work nicely (1 bit for continuation, 7 bits for mantissa) but the records will not be easy to tabulate, unless there is an encoder and decoder program. Or maybe we could simply select the size "per block". Fine...

    So I have to write the translators, as a C library most probably. The original idea used a fixed width structure, supplemented with bzip2 compression but some binary format is required anyway.

    Let's not forget to add a checksum to the block as well ;-)

    __________________________________________________________

    Anyway the above format is for transmission and storage. During computation, the block is held in RAM with the fully expanded struct.

    There could be several levels of dispatchers, with the root-level one

    • handling all the storage
    • talking with the contributing machines
    • allocating the ranges according to the relative speeds of the contributors

    and the leaf-level dispatchers that handle the jobs directly, depending on the target hardware (POSIX multi-core or CUDA or FPGA or ...).

    The leaf-level dispatcher(s) will receive a job ID with a range, corresponding to the available memory size, computational performance, expected run time... and because each job has an unknown/random run time, the results will be gathered "out of order", more or less. When a given number of consecutive arcs are completed (let's say, 16K ? or 1 hour ? whichever comes first ?) then these results are compacted and written to a file and/or sent over network. So it's like it's working with a "sliding window" of results.

    Now the communication between the leaf-level dispatcher and the individual jobs needs to be defined as well. Of course we still assume a constant width. This has already been covered with the logs 43. An individual job and all the necessary processing and 45. A job in VHDL.

    • Input : Xinit
    • Output : Xout, N, and probably also Xinit so the dispatcher knows where to put the result.

    The out-of-order results make the leaf-level dispatcher system less trivial, though it's nothing to fear for a POSIX version. This creates other kinds of issues in HW because data (Xinit) might be duplicated and waste resources...

  • The problem with SIMD

    Yann Guidon / YGDES07/17/2021 at 17:34 0 comments

    I am still evaluating the options available to re-run w26 and hopefully run w32 in a reasonable time. So far, I look at these methods:

    • BOINC-style distributed computation
    • FPGA
    • Cluster of Raspberry Pi
    • HPC-style GPU in CUDA

    The last one would provide the most speedup (in the 10K× range today) but it relies on a very wide SIMD programming model, like 32 lanes of 32 bits. Branches, conditions and the likes create all kinds of disruptions and the performance drops (see https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming) for example). A RTX 3090 could perform 10496 integer ADD32 at 1.7GHz (17000 Giga-adds/second) but detecting the 0 and then branching and processing the result stalls ALL the 31 other lanes.

    Same issue with the VideoCore GPU of the Raspberry Pi SBCs. The Pi3+ has 3 available clusters of 4 pipelines that can compute 16 ADD32 at 400MHz: that's 76G add32 per second. But this GPU is designed for graphics computations and any disruption (from tests, branches, processing) reduces the real throughput by an order of magnitude.

    I could enlist some laptops but they are quite demanding (power, noise, room, OS/HDD) and Intel's hyperthreading reduces the performance of individual tasks. OTOH, the Pi3B+ has 4 independent ARM cores at 1.4GHz, that I might overclock a bit (with the help of a fan). And these are "normal cores" that have no sequencing/scheduling/threading constraints so even a little cluster would be cheap, efficient, not intrusive, quiet and repurposable. Because if I have to invest in HW, I would like to use it for other unrelated projects later. Having a cluster of 4 or 6 Pi3B+ would let me start with a POSIX implementation then, when it is operational, explore how to accelerate with the GPU (if possible at all).

    Which makes me also think about implementing a TCP/IP dispatcher soon.

  • A job in VHDL

    Yann Guidon / YGDES07/13/2021 at 02:42 0 comments

    I am still evaluating the options for a massively parallel scanner, described in the previous logs. In the case where I can program a high density FPGA, I created a VHDL version of one "core", available in PEAC_core.tgz with a GHDL script.

    I provided a simple wrapper that iterates over the arcs, just like the C version but slower because of the emulation. The point is validation, and I spotted a single stupid error in the core. But that was only the easy part. The challenge is to connect hundreds, thousands of these small counters without wasting resources or latency.

    If I fit 1K cores on a FPGA, communication bandwidth will be crazy. I might need memory buffers, and probably local schedulers that would manage their own sub-group of cores. After all it's simply a counter, initialised by an external resources, that increments every time a core finishes its job, until a maximum value is reached. It's also a matter of fanout: data must enter and exit (while keeping synchronisation) and FPGA don't like large buses. I might create a sort of "ring bus", a parallel token ring, probably byte-wide, to keep the resources low, while cores are clustered and their results are put in FIFOs...

    Simulating this in GHDL will be fun !

  • Test jobs

    Yann Guidon / YGDES07/09/2021 at 22:19 0 comments

    Following the previous log An individual job and all the necessary processing, I wrote the program gold_01.c to check the behaviour before I increase parallelism.

    First thing it did was confirm that the result C=1 occurs only when the result X=0. So we can save/ignore this data point.

    Let's now look at the output for the smallest job: w=3

    1 6 6
    2 0 11
    3 2 5
    4 5 3
    5 3 4
    6 4 5
    7 7 35

    This one is quite weird because 7 loops back to 7 in 35 cycles but we can follow the primary loop easily :

    1 -> 6 -> 4 -> 5 -> 3 -> 2 -> 0

    And since 0 is reached with C=1, it goes back to 1 in 7 crossings (and 29 steps, w3 is clearly asymmetrical).

    From there, it's very easy to script for the other sizes.

    $ for i in 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
     do echo -n "$i: "
        gcc -Wall -DWIDTH=$i -Os gold_01.c -o gold
        /usr/bin/time ./gold > job_$i.log
     done
    
    3: Total: 69
    0.00user 0.00system 0:00.00elapsed 91%CPU (0avgtext+0avgdata 1424maxresident)k
    4: Total: 269
    0.00user 0.00system 0:00.00elapsed 87%CPU (0avgtext+0avgdata 1384maxresident)k
    5: Total: 797
    0.00user 0.00system 0:00.00elapsed 93%CPU (0avgtext+0avgdata 1392maxresident)k
    6: Total: 4157
    0.00user 0.00system 0:00.00elapsed 94%CPU (0avgtext+0avgdata 1428maxresident)k
    7: Total: 16079
    0.00user 0.00system 0:00.00elapsed 90%CPU (0avgtext+0avgdata 1368maxresident)k
    8: Total: 23114
    0.00user 0.00system 0:00.00elapsed 90%CPU (0avgtext+0avgdata 1316maxresident)k
    9: Total: 258059
    0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 1320maxresident)k
    10: Total: 1049597
    0.00user 0.00system 0:00.00elapsed 50%CPU (0avgtext+0avgdata 1388maxresident)k
    11: Total: 4190147
    0.00user 0.00system 0:00.00elapsed 83%CPU (0avgtext+0avgdata 1524maxresident)k
    12: Total: 16781251
    0.03user 0.00system 0:00.03elapsed 96%CPU (0avgtext+0avgdata 1548maxresident)k
    13: Total: 67117049
    0.08user 0.00system 0:00.09elapsed 98%CPU (0avgtext+0avgdata 1512maxresident)k
    14: Total: 268451737
    0.36user 0.00system 0:00.36elapsed 99%CPU (0avgtext+0avgdata 1624maxresident)k
    15: Total: 1070073927
    1.43user 0.00system 0:01.44elapsed 99%CPU (0avgtext+0avgdata 1928maxresident)k
    16: Total: 4295032829
    4.96user 0.00system 0:04.97elapsed 99%CPU (0avgtext+0avgdata 2396maxresident)k
    17: Total: 17179997009
    23.58user 0.01system 0:23.64elapsed 99%CPU (0avgtext+0avgdata 3352maxresident)k
    18: Total: 68719736689
    110.75user 0.03system 1:51.12elapsed 99%CPU (0avgtext+0avgdata 5528maxresident)k
    19: Total: 274803255069
    486.96user 0.11system 8:09.19elapsed 99%CPU (0avgtext+0avgdata 9612maxresident)k
    20: Total: 1099443499868
    1880.05user 0.36system 31:27.70elapsed 99%CPU (0avgtext+0avgdata 17692maxresident)k
    

    Of course, past w16, the running time and the memory explode.

    Interestingly, the number of total iteration quadruples for each increment of w but the total size of the file only doubles:

           44 10 juil. 00:50 job_3.log
          113 10 juil. 00:50 job_4.log
          251 10 juil. 00:50 job_5.log
          557 10 juil. 00:50 job_6.log
         1226 10 juil. 00:50 job_7.log
         2753 10 juil. 00:50 job_8.log
         5921 10 juil. 00:50 job_9.log
        12423 10 juil. 00:50 job_10.log
        27659 10 juil. 00:50 job_11.log
        58670 10 juil. 00:50 job_12.log
       122025 10 juil. 00:50 job_13.log
       264080 10 juil. 00:50 job_14.log
       559598 10 juil. 00:50 job_15.log
      1161492 10 juil. 00:50 job_16.log
      2449479 10 juil. 00:51 job_17.log
      5194057 10 juil. 00:53 job_18.log
     10763126 10 juil. 01:01 job_19.log
     22192296 10 juil. 01:32 job_20.log
    

    The raw data (19MB) is available here if you don't want to spend 1/2h to compute the larger ones.

    I visualised the output data with gnuplot. Here is a plot of the length of the arcs of w16:

    It looks quite random and the lengths range from 3 to 718818, or about 11× the number of arcs (that's not excessive).

    Here is the distribution (obtained by sorting the lengths):

    Does that make it an exponential curve ?

    Half of the arcs are shorter than 45000, or about 2/3 of the size.

    Of the 65535 arcs, only 19746 are duplicates (about 1/3).

    __________________________________________________...

    Read more »

View all 53 project logs

Enjoy this project?

Share

Discussions

Yann Guidon / YGDES wrote 05/24/2021 at 03:52 point

Oh, nice !

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html#Integer-Overflow-Builtins

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates