Close
0%
0%

Another Table-Based Stream Scrambler

Non-reversible, non-cryptographic scrambler for PRNG, 16 bits at a time.

Similar projects worth following
Reinventing RC4 but wider, faster and even worse.

I'm working on PRNGs these days.

I have made a replacement for the POSIX rand() and srand() functions (see log a POSIXy rand() ) using a mixed structure: a basic 32-bit LFSR PRNG has its LSB mixed with a PEAC 16x2 scrambler, outputting 16 bits (15 due to POSIX) with little effort and 65 bits of internal state. I want to make it even smoother and the best way is to add another scrambler with a very different structure. Naturally I remembered the RC4 stream cypher and a table-based system looks interesting, with its very large state and apparent simplicity. Though I'm not doing a cypher and I have forgotten the details. And memory accesses are expensive, particularly 16bit ones.

Anyway, a 256-byte table is nice but my algo works with 16 bits at once so a sort of "parallel RC4" would be required. I don't want to have to deal with key initialisation so let's just borrow the SBoxes of AES : one is forward, the other is reverse, let's put them in parallel as they are reciprocal permutations. They must be pretty good, I assume.

So with this system, I parallelize two table lookups, which is "good" for modern microprocessors, and the state is extended to 512 bytes, a bit like RC4A. But wait : if one has to swap values, then BOTH tables should be accessed by the same two indices. So the pair of tables is turned into a double-width, dual-port LUT (but I can do that with a FPGA, no biggie). This leads to a new structure that provides 2×16=32 bits per dual access: now the table works as an expansion table!

The extra 16 bits can be reused :

  • as extra parameters for the "swap" which is now a 32-bit rotation / circular shift
  • as offsets for the address inputs

With this extra feedback, the system becomes almost self-sufficient and can output a reasonable stream when fed with a constant (or short cyclical) value. Using a rotation instead of a swap is another "nice touch" that upsets the usual balance of a permutation, yet preserves the total number of set bits.

Even though the LUT starts with a pair of AES SBoxes which are permutations, the goal is only to fill the table with a well-behaved array with as many 1s as 0s. The rotation keeps that balance but the system does not rely on the table being a perfect permutation. In some degenerate cases, the entry should have at least a few bits set, no 0xFFFFFFFF or 0x00000000, that's all.

This is not cryptographically secure though, and the primary design assumption was that the input provided a half-decent entropy source (think LFSR or congruential PRNG). But with a careful design (the method to offset the address bits), it can become autonomous.

Looking back at RC4, this new algo can compete on throughput (only one read and one write per byte) though it certainly has glaring loopholes (and the code has many word casts). Yet it's still "good enough" for a semi-decent PRNG and the implementation in FPGA is a breeze.

I'm sure enhancements will appear, as I have identified a point or two to adjust...

.

 
-o-O-0-O-o-
 

Logs:
1. First log, second stage.
2. Startup
3. v2
4. v3
5. Initialisation
6. ARX
7. v4
8. 3:2 compression
9. Initialisation: the ugly
10. Entropic principles
11. v4.2
12. Feedback
13. so long v4.3, welcome v4.4
14. ...
15. ...
16. ...
17. ...
18. ...
19. ...

v4.4.tgz

see log 13

x-compressed-tar - 3.51 kB - 02/21/2025 at 00:03

Download

v4.tgz

see log 11

x-compressed-tar - 3.44 kB - 02/19/2025 at 04:51

Download

AddEntropy.tgz

created a non-deterministic function to add one long word of entropy to the LUT

x-compressed-tar - 2.98 kB - 02/13/2025 at 02:50

Download

ATBSS.v3.tgz

v3

x-compressed-tar - 2.54 kB - 02/12/2025 at 16:01

Download

scrambler_startup.tgz

show the LUT state as the scrambler tries to bootstrap from lowest possible entropy.

x-compressed-tar - 14.03 kB - 02/11/2025 at 15:55

Download

View all 7 files

  • so long v4.3, welcome v4.4

    Yann Guidon / YGDES02/20/2025 at 23:06 0 comments

    v4.3 is coming but going as fast...

    the "unilateral mix" is a total catastrophe 🫣

    iteration: 501 LUT:
     190  ⣿⣿⣿⣿⣿⣿⡿⣿⣿⣿⣿⣧⢀⠐⣿⣿⣿⢿⣾⣹⠿⣿⡖⣮⢈⠀⠁⢥⣊⠄⣻⣿
      84  ⢾⣷⣮⣿⢿⠺⠐⠀⠀⠄⣩⢚⢀⠠⠀⠀⠀⢠⡅⢔⠀⠠⡀⢀⠀⠀⠁⡶⠀⠀⣿⣿
     143  ⠵⠱⢻⡻⢗⠿⣩⠶⠈⢌⠢⢀⣿⣿⠠⠂⠛⣟⡅⡘⣿⣿⡥⠩⠝⡟⠂⡂⡿⣿⣅⣽
     108  ⡶⡿⠇⡕⢌⢠⢶⠑⣿⣿⢀⢐⠀⠀⣿⣿⠀⠐⠂⠀⠘⢀⢽⣳⠀⠀⢼⣾⠀⠀⣻⣵
     120  ⡑⡄⠯⡧⡀⠀⢼⢭⣰⢭⠕⠁⢌⠂⣟⠜⣟⡝⢓⠟⡊⢑⢭⠠⠔⠀⣿⣿⢤⡑⢯⡙
     115  ⣿⣿⠄⠡⠫⠣⣓⠒⡲⡅⠉⠇⢀⠀⢟⣏⣷⠌⠉⡰⢷⠦⠂⠄⣩⣯⠁⠲⣿⣿⠀⠀
      98  ⠐⣢⠪⡔⣿⣿⠀⡀⠀⠀⢍⣨⢄⡀⠆⡃⣎⡾⢠⠹⣪⣗⠠⣄⠀⠀⣿⣯⠇⠃⠢⠀
     170  ⣏⣽⣫⢮⣱⣺⠀⠀⡁⣧⣚⣳⣐⢺⣐⠨⣿⣿⣿⢿⣡⠂⣿⣿⠩⠴⢿⣼⣿⣿⣿⣿
     122  ⣿⣿⢃⠅⢴⠑⣿⣟⣿⣿⣗⢋⠠⣁⠀⠀⠄⡀⣿⣿⣿⡯⠀⠀⡍⡀⡶⢮⡀⠀⠐⢀
     102  ⣉⢰⢺⢆⣫⠶⠄⠉⣡⡶⢽⢫⠁⠀⠌⡐⢬⠛⡊⡣⡿⣍⠤⠀⠀⠀⣢⢈⠠⡘⡕⡷
     112  ⣿⣷⣮⣪⢤⢵⠀⠀⡼⠨⣿⢠⢽⣓⢟⡯⢬⡝⠀⠀⠀⠀⠠⠈⠀⠄⡯⣯⠠⡲⡛⠣
     105  ⡗⣼⠀⠁⢵⡏⠟⣘⠕⢍⠈⠡⠀⠑⣹⣷⡀⠀⠀⠀⠀⠀⠀⢉⣿⣷⣼⣑⢽⣧⢨⣪
     119  ⠋⠰⣿⢽⠀⠀⠥⢈⡠⡺⡪⠣⡨⡋⠌⠐⣿⣿⣹⣧⣯⣿⡁⡉⠀⠐⠀⠀⣿⠞⣲⣋
     139  ⣦⣏⠄⠂⠄⢈⠖⢻⣖⣟⡽⣡⣾⢶⠁⡄⠨⣼⣐⣯⣿⣿⣻⣼⠠⠐⣿⣿⢈⠔⠾⢑
     130  ⣭⣿⠫⠇⣯⠫⢈⡢⠀⠀⣿⣿⠂⡡⠀⠀⢂⠁⡀⠀⣻⣷⢽⣲⡔⠁⣿⣿⢍⠯⢿⣿
     191  ⣿⣿⠀⠈⠀⠀⣿⡻⢝⣷⠀⠀⣿⣿⣿⣿⣿⣯⣿⣿⢌⡋⣿⣿⣿⣿⣟⣽⣿⣿⣿⣿
    Total 2048

    the "move" is not anisotropic at all. I thought that the entropy of the indices would reduce this problem but not at all.

    I moved to a LFSR32 so there is now "enough entropy" to mix 16 bits and use as either a swap/bi-selector, or anisotropic move.

    The "enhanced mixer" is like that :

    That's ... 13 boolean ops if written in C...

    Try at home with Falstad's CircuitJS

    OTOH the "swapper" doesn't care what the values of the bits are.

    Simulate with this link.

    That's 7 boolean instructions only, half the above and doing the same job...

    So we get this:

    uint16_t S1 = S >> 8;
    uint16_t S2 = ~S1;
    uint16_t X2 = (X & S1) | (Y & S2);
    uint16_t Y2 = (X & S2) | (Y & S1);

    And yes, the S>>8 eats 3 bits that the barrel shifter uses, I hope it will be solved later.

    Source code is there and the new diagram is below:

    And now the result is back to "good", breaking the zero-entropy situation easily.

     LUT:
       0  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
       0  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
       0  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
       0  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
       0  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
       0  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
       0  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
       0  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
     256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    ...
    Read more »

  • Feedback

    Yann Guidon / YGDES02/20/2025 at 02:02 0 comments

    One of the working principles of the scrambler is that it should be able to work, or at least recover by itself, in catastrophic situations, such as when one of the "sources of entropy" is dysfunctional. These sources are:

    1. D, the input data, which is not at all under the system's control. Expect it to be lousy, though my dual-stage random() if nifty.
    2. S, the shift register/LFSR : so far I have implemented a degenerate/minimalist/barely sufficient LFSR8 to bootstrap the design and test its limits.
    3. FG (merging 2 bytes from 2 LUT entries) mixes D then affects R. The eventual carry spills over the new FG (out of the barrel shifter) and gets recycled for the next cycle. That's the red loop below:

    This loops is important to supplement the deterministic sequence of the LFSR, which is obviously predictible, but it's only here to "kick" things enough to get the entropy to increase. From there, the LUT gets scrambled even more and FG can have more entropy in return. So the "red loop" of FG needs a careful design. For now, FG is the MSB of the output of the barrel shifter, plus some carry-in from R. It's not much and I don't know if more is required, but FG "leaks" to R in the second cycle, after addition to the LSB of the barrel shifter.

    So R depends on 4 somewhat independent table lookups and 2 previous values of D. Reconstructing the full equation for the value of R ends up with having to know D and the barrel shifter's output. But the latter depends only on the LFSR which remains independent, unaffected by other data, and can't leak directly or easily. So all one has to do is "prime" the filter with a random state of the LFSR. The thing is : there are 1+8+5=14 bits consumed by the pipeline, only 16K combinations to consider. A LFSR32 will provide a longer sequence of these 16K substates but it was not meant as a strong protection either, but a tool to ensure that the LUT's entropy would always increase. The "non-leaking" part, or decorrelation, is just a bonus.

    BTW it's "decorrelated" because there is no way to reconstruct the LFSR's state from R, which also means I should remove the 1-bit carry-in link. Now I must find a different source for them, maybe implement some sort of PEAC with it ?

    Anyway now the diagram should be clearer, with the decorrelated LFSR path at the bottom, and the "FG loop" at the top.

  • v4.2

    Yann Guidon / YGDES02/19/2025 at 04:52 0 comments

    As explained in logs 7. v4 and 8. 3:2 compression here is the current result:

    So it's a bit simplified, with a newcomer: the boolean mixer in front of the write ports.

    v4.3 should provide a much more decent entropy source for the 14 bits consumed during each cycle, such as LFSR32. Maybe consume one word from D alternately ? Or enlarge D to 32 bits ?

    Source code: v4.tgz (not fully tested though)

  • Entropic principles

    Yann Guidon / YGDES02/18/2025 at 23:05 0 comments

    I only have "some familiarity" with cypher design: I won't dare to call this "knowledge", since owning Schneier's book "Applied Cryptography" won't make me a specialist. However I certainly do have experience and understanding of LGA, the algorithms, their behaviour...

    This scrambler owes as much to my FHP3 code as to RC4.

    Why is it relevant ?

    It is an argument to prove that even in the worst cases (such as all counters reset and the LUT in minimal entropy), the system will naturally evolve toward maximum entropy. This is important to resist practical attacks: a self-stabilising, resilient algorithm will converge to the desired state.

    The key point to consider it that the LUT behaves a bit like LGA: a sort of cellular automaton that

    obeys statistical mechanics, mimics small-scale particle dynamics, like atoms in the air bouncing off each others. LGA conserve the total kinetic energy and number of the particles, but through reshuffling explores new possible configurations, thus leading to an increase of entropy. It can be made reversible or irreversible, depending on the "nudges" from the algorithm.

    In our case, the LUT reshuffles/swaps bits and keeps their numbers equal, so it is a bit permutation, unlike RC4 that is a byte permutation.

    The factorial of 256 is about 10^507, or almost 1685 bits, so RC4's LUT can be considered as a good entropy pool, that's why it was used as RNG in Linux and BSD. 1685 is a good portion of the 2048 bits of storage.

    Note that a RC4 RNG stream skips the first 1024 bytes, and more recently the 3072 first bytes. ChaCha20 replaces RC4 now.

    Our algorithm has 256×16=4096 bits, half of them being set at any time. Being double the size is good and easy, but it is not bound to keeping the permutations of bytes: two separate bytes or words could have the same value so it dramatically increases the number of valid configurations, letting the entropy grow to even higher utilisation of the available bits.

  • Initialisation: the ugly

    Yann Guidon / YGDES02/18/2025 at 19:29 0 comments

    Just to check some assumptions and see how entropy increases in the LUT, I added some popcount instructions in the braille display and run the test again...

      0  64  LUT:
     128  ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
      80  ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128  ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
     128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭
    ...
    Read more »

  • 3:2 compression

    Yann Guidon / YGDES02/18/2025 at 09:28 0 comments

    There are 8 non-degenerate boolean operations with 2 inputs. For 3 inputs, it's much higher so it's harder to choose... And I'm still looking for the magic non-linear operation that is cheap, efficient and balanced.

    The goal is to generate R(16) and FG(16) from Z(32) and C(17). For now, R=Y+C and it's underwhelming. The last log proposed this :

    diff = X & ~Y;
    Y |= diff;
    X &= ~diff;

    But this creates an imbalance of bit density/probability, with Y > X. It's not a real problem in the LUT datapath except when A=B, where X and Y must have the same value to prevent conflicts.

    -----------------------------

    Well now that I think of it, if X=Y then X&~Y=0 so there is no imbalance or conflict... D'oh. So it's pretty safe in fact. Now, to prevent a bias in the LUT, A<B must be strictly as equiprobable as A>B. Good luck with that with the lousy LFSR.

    -----------------------------

    This imbalance could be dithered by using a third parameter (C in this case) that selects which outputs gets the bits set, and the other output gets reset, bit by bit.

    diff1 =  X & ~Y &  C;
    diff2 = ~X &  Y & ~C;
    
    Y |=  diff1;
    X &= ~diff1;
    
    X |=  diff2;
    Y &= ~diff2;

    It's cheap to implement as physical gates but it takes a bunch of instructions to execute on a CPU...

    -----------------------------

    Then there is another strategy with two "reciprocal MUX", or a "swap" operation: for each bit, C selects whether X and Y are swapped. That's a good way to get a well scrambled R and FG.

    .

  • v4

    Yann Guidon / YGDES02/18/2025 at 01:10 0 comments

    v4 is slowly taking shape, trying to replicate the success of sharing the addition between two consecutive words : as for the indices A & B, R and FG are treated in one single 32-bit operation, where C is only 17 bits wide, so there is little chance of carry-out.

    The initial carry-in comes from (S&1) and it's bad. The little LFSR8 gets squeezed to extract 14 bits, including the LSB that is reused 3 times, which is not good® !

    Can you spot the other changes ?

    uint16_t Scramble_round(uint16_t D) {
      uint8_t S=LFSR8;
      LFSR8 >>= 1;
      if (S & 1)
        LFSR8 ^= 0x8e;
    
      uint32_t C = D + FG + (S&1);
    
      uint8_t A = C ^ S;
      uint8_t B = C >> 8;
    
      uint16_t X = LUT16[A];
      uint16_t Y = LUT16[B];
    
      uint32_t Z = X | (Y << 16);
    
      Z = (Z >> (S&31)) | (Z << ((-S)& 31));
    
      uint32_t R = Z + C; // 32+17 bits
      FG = (uint16_t)(R >> 16);
    
      LUT16[A] = (uint16_t)Z;
      LUT16[B] = Z >> 16;
    
      return (uint16_t)R;
    }

    The code looks shorter and simpler, which is not necessarily better, but faster is good, and this is possible with fewer byte-wide operations.

    I have removed the once-defining byte-dephasing structure. In fact the rotation already does quite a lot, but now the result R is only "protected" from a whole 16-bit word written back to the LUT, by a (not so simple) addition.

    I have an idea to apply a non-linear boolean operation that preserves the number of bits. This would "swap" certain bits from 2 parts of the Z word:

    diff = X & ~Y;
    Y |= diff;
    X &= ~diff;

    • On the LUT datapath, before writeback, it relies on A!=B, which would introduce an Enigma-like flaw 0.4% of the time.
    • On the FG / R output path, it would make an inappropriate imbalance of bit values.

    Well, it's a start...

  • ARX

    Yann Guidon / YGDES02/16/2025 at 19:58 0 comments

    v3 already has significant enhancements over a "pure ARX" design or even RC4 :

    • The LUT is NOT a byte permutation. It's a bit permutation. This makes the output useless to recover the input/index.
    • The index addition for A and B is chained/merged hence co-dependent. Getting one index from the output does not directly provide the other index, particularly since A is XORed by the LFSR.
    • The LUT is an expansion table and only one half is output, the other is fed back.

    But it's not perfect yet.

    https://en.wikipedia.org/wiki/Rotational_cryptanalysis describes the class of cyphers based on the Add/Rotate/XOR operators, ARX in short. It is the basis of Salsa20, Speck, ChaCha20, ThreeFish, Blake, Skein, Cubehash...

    Only the (truncated carry) Add is "not linear", so the cypher requires a lot of rounds. It also hints how its cryptanalysis works, in particular because XOR, ADD and ROT are reversible. Some other cyphers combine non-reversible operators : OR and AND. v4 would benefit from this.

    Since the design of #PEAC I have a fondness for End-Around-Carry, which wipes the LSB computation that degenerates to a simple XOR, that is easy to correlate. So adding a carry-in and carry-out to the design will strengthen it. That's a second enhancement for v4.

    A third easy enhancement is the replacement of the lousy LFSR8.

  • Initialisation

    Yann Guidon / YGDES02/12/2025 at 19:54 0 comments

    One way to provide entropy to the LUT is to connect the entropy source to the block's input. The quality does not need to be great but the effect is slow.

    Now if you have a block of external data (a network packet, a keystroke, the usual deal), you may want to hit the LUT directly. It should be faster, right ?

    512 bytes is also 128× 32-bit words but you can't write them like that in the LUT. You have to respect the balance between 1s and 0s. Thus given an entropy word, you must XOR two locations : if you flip two bits, the overall bitcount is preserved.

    Well, no, it doesn't work like that.

    XORing is the easy bit of course. But keeping the bit balance is something else : one could use popcnt (__builtin_popcountl(x) if you want it to work with GCC) twice (once before and once after the XOR) but this would be slow, so let's turn the difficulty into an opportunity.

    Let's have our entropy word E that XORs a value L in the LUT. This gives the new word N=E^L. Fine.

    • Some bits from E will be cleared : they are given by R=E&L.
    • Some bits from E will be set : they are given by S=E&(~L) = E&N.

    The trick is to not bother with how many bits do this or that, keeping a count would be cumbersome, and another beautiful thing is : we don't care how many words it will take to scan to find a location where a corresponding bit can be toggled.

    So we will be scanning the word located after the place we just XORed, and try to see if each one fits a mask that even partially matches what has just been done.

    For each new word N, we try the masks S and R:

    • S shows which bits were set so now they indicate which ones must be reset : S2 = S & N.
    • R indicates cleared bits so now they set it : R2 = R & ~N

    Now get X = R2 | S2 and you can get N2 = X^N which can be rewritten back to the LUT.

    And don't forget to update the masks:

    R &= ~R2

    S &= ~S2

    If R | S is not 0 then move on to the next word.

    Of course this assumes that the LUT is already "properly initialised" with enough entropy. The word or the mask could be rotated once in a while ( to increase the chances of a match).

    This also promotes the use of very long words, like 64 bits, to amortize the costs.

    Oh and checking the actual number of bits of the LUT is a safety measure that must not be forgotten.

    --------------------------------------------------------------

    Actual code : it works well !

    when feeding the "entropy" 0x123456789ABCDEF 256 times, I get this result:

    ⢾⢓⡿⣣⣿⡣⣪⠗⠚⠐⠅⠃⠍⠀⢀⠃⠚⢐⠭⠣⠿⠀⢈⠇⡿⡿⣶⣿⣿⣿⣿⣿
    ⡥⡯⢺⣔⣉⢯⡷⡸⡯⡬⢾⢴⣊⣘⣿⠽⢈⢙⡛⠯⣟⣶⡦⣖⢹⡨⡃⣑⣸⡈⡁⣑
    ⢻⡨⠉⣭⣻⣐⣇⣱⡖⢟⢦⠦⠤⢧⡘⣏⡝⣿⣏⣓⠜⠯⡑⠆⢪⢈⠲⠬⢀⣐⣗⡩
    ⡜⡧⣕⣓⣾⠯⣭⢞⢨⠝⡀⠂⢈⠄⡡⣕⣅⠔⡂⠨⠂⣺⣌⠤⠺⣟⢥⣷⡿⢅⠑⣙
    ⣄⡻⢾⢑⡉⠫⠂⡊⠑⢡⠒⠌⢵⡋⠻⢗⣬⣞⣕⣻⢘⠱⡆⠮⢯⠺⢁⣃⠉⠬⡈⢐
    ⡩⡁⡈⠐⠶⠓⠰⡛⠵⠆⢄⢁⢙⢮⠃⠋⣈⠻⡂⡒⢯⠩⡺⣖⠹⡀⠰⢀⠦⠆⠉⠈
    ⢅⣏⢵⢛⢘⠿⡀⠘⡠⠸⡼⡐⣇⠐⡆⠗⣥⢕⠁⢂⠂⠂⠀⠜⡀⠦⡀⠁⠉⠍⠀⠂
    ⠙⡠⡌⠣⠉⠭⠉⠂⣼⢟⢡⣐⢔⠂⡰⡉⣜⡸⡦⠐⢀⠠⡱⣋⢡⣜⢄⣂⢞⠆⡊⢀
    ⡬⣫⡄⢑⢡⠨⡈⠊⣅⣿⠔⡂⢌⠙⡬⠒⣀⣷⠄⡂⠮⠉⠐⣃⣍⣾⠒⡊⠝⠘⡞⠂
    ⠰⠀⣙⢙⡒⡦⢁⡈⢕⠂⠢⢪⠸⠄⡍⣘⠘⡩⡝⠕⠆⢠⠱⡈⣧⢞⢀⣮⡘⠏⣀⠢
    ⠠⡭⡅⠕⣆⢢⠠⢈⡗⣷⣟⣻⣮⠛⣍⡕⢸⠁⠢⠊⠡⡤⢋⢾⢱⠀⠒⢨⣶⠳⢕⣗
    ...
    Read more »

  • v3

    Yann Guidon / YGDES02/12/2025 at 09:55 0 comments

    Leaks.

    v2 still leaks internal states.

    Well, data must come from somewhere anyway, right ?

    And when there is no external entropy source, we're left with this lousy LFSR8 to run the show in the worst case.

    Anyway, this system is not encryption because there is no key, just an evolving internal state.

    Now, as a standalone PRNG, it's getting better but the internal state could still be somehow deduced. The original intent is for the algo to be a scrambler so it expects "some sort of entropy" at the input, amplifying it along the way with a huge RC4-like talble. So it's "just a filter" that isolates one PRNG from the outside. Yet I still need an internal 8-bit "entropy source"...

    It appears that the link with cryptography is strong though, as trying to make this system "more random-like" means trying to "break it" like a cryptographer would do. So here I am, bikeshedding it to oblivion, not even caring what the result will be or what it would lead to.

    So my concern is : the result is a direct reading of the table, merging 2 independent half-reads. It's fine as long as the table is considered "unkown and unknowable". But considering that it takes several hundreds of cycles for the table to be well shuffles/scrambled, it's also as many values that can be directly extracted and correlated, hence the output could be one step closer to be predicted.

    So the output should not be a direct read of the LUT. It's sad since it's good to have an algorithm with very low latency, as it cascades well.

    • One way forward is to extract the 16-bit result/output from the 32-bit output of the barrel shifter. But this only moves the problem somewhere else since this value will be written to the LUT as well.
    • Another method would be to XOR the output of the B LUT, since the A LUT is xored at the address input. Recovery from 0-entropy would be faster but the LUT would stop being a "bit permutation" area, which would upset the strict balance of 1s and 0s throughout the table, and also require an even larger 16-bit LSFR ...
    • Of course the algorithm could be run several times before delivering its final results. Loop over the LUT 2, 3, 4 times and XOR the results together. XOR or ADD is possible because it does not have to obey to the bit count preservation rule.

    ...

    Overall the whole algo (like RC4) is mostly "diffusion" and there is little "confusion" happening, except for the address calculations. The barrell shifter is a small-scale diffusor while the LUT is a large-scale diffusor. Confusion is meant to happen in the LUT thanks to a well-brewed table, but only from input to output.

    Moreover, there are 32 bits to scramble and the barrell shifter only uses 32 combinations out of the factorial(32) ways to shuffle the bits around. We got a 8-bit LFSR and even then only 5 bits are used.

    So far the current solution is to take 16 bits out of the barrel shifter and add the pair of addresses (before the XOR).

    There is one more adder yet the diagram looks simpler.

    R is a function of the current and last pair of lookups. Why an addition ? It could be a XOR but I'd like more avalanche.

    No data enters the LUT, which is still a pure permutation of bits. A better shuffling would be welcome.

    Source is there: ATBSS.v3.tgz

    Mixing (worst case) is not complete after 2k iterations but ok at 4k.

View all 13 project logs

Enjoy this project?

Share

Discussions

Yann Guidon / YGDES wrote 03/10/2025 at 01:04 point

C(n,k) = n! / k!×(n-k)!

  Are you sure? yes | no

Yann Guidon / YGDES wrote 02/23/2025 at 02:38 point

https://oeis.org/A000984

I'm not sure I could compute binomial(2*n,n) = (2*n)!/(n!)^2
where n=2048...

  Are you sure? yes | no

Yann Guidon / YGDES wrote 02/22/2025 at 00:52 point

Theo de Raadt on arc4random
https://www.youtube.com/watch?v=aWmLWx8ut20
https://www.youtube.com/watch?v=gp_90-3R0pE

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates