-
so long v4.3, welcome v4.4
02/20/2025 at 23:06 • 0 commentsv4.3 is coming but going as fast...
the "unilateral mix" is a total catastrophe 🫣
iteration: 501 LUT: 190 ⣿⣿⣿⣿⣿⣿⡿⣿⣿⣿⣿⣧⢀⠐⣿⣿⣿⢿⣾⣹⠿⣿⡖⣮⢈⠀⠁⢥⣊⠄⣻⣿ 84 ⢾⣷⣮⣿⢿⠺⠐⠀⠀⠄⣩⢚⢀⠠⠀⠀⠀⢠⡅⢔⠀⠠⡀⢀⠀⠀⠁⡶⠀⠀⣿⣿ 143 ⠵⠱⢻⡻⢗⠿⣩⠶⠈⢌⠢⢀⣿⣿⠠⠂⠛⣟⡅⡘⣿⣿⡥⠩⠝⡟⠂⡂⡿⣿⣅⣽ 108 ⡶⡿⠇⡕⢌⢠⢶⠑⣿⣿⢀⢐⠀⠀⣿⣿⠀⠐⠂⠀⠘⢀⢽⣳⠀⠀⢼⣾⠀⠀⣻⣵ 120 ⡑⡄⠯⡧⡀⠀⢼⢭⣰⢭⠕⠁⢌⠂⣟⠜⣟⡝⢓⠟⡊⢑⢭⠠⠔⠀⣿⣿⢤⡑⢯⡙ 115 ⣿⣿⠄⠡⠫⠣⣓⠒⡲⡅⠉⠇⢀⠀⢟⣏⣷⠌⠉⡰⢷⠦⠂⠄⣩⣯⠁⠲⣿⣿⠀⠀ 98 ⠐⣢⠪⡔⣿⣿⠀⡀⠀⠀⢍⣨⢄⡀⠆⡃⣎⡾⢠⠹⣪⣗⠠⣄⠀⠀⣿⣯⠇⠃⠢⠀ 170 ⣏⣽⣫⢮⣱⣺⠀⠀⡁⣧⣚⣳⣐⢺⣐⠨⣿⣿⣿⢿⣡⠂⣿⣿⠩⠴⢿⣼⣿⣿⣿⣿ 122 ⣿⣿⢃⠅⢴⠑⣿⣟⣿⣿⣗⢋⠠⣁⠀⠀⠄⡀⣿⣿⣿⡯⠀⠀⡍⡀⡶⢮⡀⠀⠐⢀ 102 ⣉⢰⢺⢆⣫⠶⠄⠉⣡⡶⢽⢫⠁⠀⠌⡐⢬⠛⡊⡣⡿⣍⠤⠀⠀⠀⣢⢈⠠⡘⡕⡷ 112 ⣿⣷⣮⣪⢤⢵⠀⠀⡼⠨⣿⢠⢽⣓⢟⡯⢬⡝⠀⠀⠀⠀⠠⠈⠀⠄⡯⣯⠠⡲⡛⠣ 105 ⡗⣼⠀⠁⢵⡏⠟⣘⠕⢍⠈⠡⠀⠑⣹⣷⡀⠀⠀⠀⠀⠀⠀⢉⣿⣷⣼⣑⢽⣧⢨⣪ 119 ⠋⠰⣿⢽⠀⠀⠥⢈⡠⡺⡪⠣⡨⡋⠌⠐⣿⣿⣹⣧⣯⣿⡁⡉⠀⠐⠀⠀⣿⠞⣲⣋ 139 ⣦⣏⠄⠂⠄⢈⠖⢻⣖⣟⡽⣡⣾⢶⠁⡄⠨⣼⣐⣯⣿⣿⣻⣼⠠⠐⣿⣿⢈⠔⠾⢑ 130 ⣭⣿⠫⠇⣯⠫⢈⡢⠀⠀⣿⣿⠂⡡⠀⠀⢂⠁⡀⠀⣻⣷⢽⣲⡔⠁⣿⣿⢍⠯⢿⣿ 191 ⣿⣿⠀⠈⠀⠀⣿⡻⢝⣷⠀⠀⣿⣿⣿⣿⣿⣯⣿⣿⢌⡋⣿⣿⣿⣿⣟⣽⣿⣿⣿⣿ Total 2048
the "move" is not anisotropic at all. I thought that the entropy of the indices would reduce this problem but not at all.
I moved to a LFSR32 so there is now "enough entropy" to mix 16 bits and use as either a swap/bi-selector, or anisotropic move.
The "enhanced mixer" is like that :
That's ... 13 boolean ops if written in C...
Try at home with Falstad's CircuitJS
OTOH the "swapper" doesn't care what the values of the bits are.
Simulate with this link.
That's 7 boolean instructions only, half the above and doing the same job...
So we get this:
uint16_t S1 = S >> 8; uint16_t S2 = ~S1; uint16_t X2 = (X & S1) | (Y & S2); uint16_t Y2 = (X & S2) | (Y & S1);
And yes, the S>>8 eats 3 bits that the barrel shifter uses, I hope it will be solved later.
Source code is there and the new diagram is below:
And now the result is back to "good", breaking the zero-entropy situation easily.
LUT: 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ Total 2048 iteration: 51 LUT: 42 ⡟⣭⠀⠀⢀⠆⠀⠀⣿⢲⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡿⣿⠀⠀ 23 ⠀⠀⠀⠀⠀⠀⠂⣪⠀⠀⠀⠀⠀⠀⠀⠀⢐⠁⠀⠀⠀⠀⠀⠀⣿⣸⠀⠀⠀⠀⠀⠂ 10 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⢁⠀⠀⠀⠀⠀⠀ 0 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 33 ⣿⢃⠀⡃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠐⠀⣼⠇ 26 ⠀⠀⠀⠀⠁⣸⠀⠀⠀⠀⠀⠀⠀⠂⣿⢒⠠⠆⠀⠀⠀⡝⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 14 ⠀⠀⠀⠀⠀⣽⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠅⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 4 ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ 228 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣽⣿⣿⣿⣿⣿⣿⠀⠅⣿⣿⣿⣽⣿⣿⣿⣿⠀⡍ 256 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ 226 ⠀⡾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⡼⣿⣿⠀⣻⣿⣿ 237 ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣇⣺⣿⣿⡂⠂⣿⣿⣿⣿⣿⣿⣿⣿⣿⣻⣿⣿⣿⣿ 242 ⣿⣿⣿⣿⣾⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⢻⣿⣿⣿⡿⣹⣿⣿⣿⣿ 247 ⣿⣽⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡏⢷⣿⣬⣿⣿⣿⣿⣿⣿⣿⣿ 232 ⣿⣿⢀⠀⢱⡿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠧⣿ 228 ⣿⣿⣿⣿⣿⣿⣿⢼⣿⣿⣿⣿⣿⣿⣿⣿⣿⢢⣿⣯⣿⢺⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀ Total 2048 iteration: 501 LUT: 133 ⠿⢡⠢⣗⠂⣸⣿⣿⢐⠀⡜⠀⣭⣿⠁⢀⢸⠇⡏⢼⠄⡀⠃⣠⣏⢺⠃⣿⡗⣿⣏⣿ 113 ⣿⣿⠀⠀⠀⡏⣿⢺⣾⣦⡟⣑⠃⢸⠇⢰⢠⠅⠀⣸⣿⣿⠈⠀⠀⡝⣗⣚⠀⠄⢐⠀ 89 ⣿⣻⠆⡄⣌⠂⠀⡔⠀⠀⠀⡒⣔⡽⠀⠜⠀⠀⢰⠁⠀⠀⢀⠀⣿⣿⣾⠀⣻⡄⢇⢹ 90 ⡀⠌⣿⢁⠐⠂⡀⢀⠨⠅⡜⣘⢅⠎⠀⠀⠇⢸⣧⣹⠐⠞⣈⠽⢈⠀⣼⠲⠀⠀⣿⢴ 89 ⣃⣼⣺⠵⠁⢰⡄⠄⠀⠀⡏⣾⣺⡯⠀⠀⠀⡎⡿⡭⢨⡲⠀⠄⠀⠀⡇⡀⠘⠁⠀⡻ 81 ⣺⡟⢀⠀⠀⠀⠀⡂⢸⡄⠀⠀⢠⠀⣿⢒⢂⠁⠟⠨⠀⡝⠁⢺⢀⠀⠇⢴⠇⣺⢸⡏ 82 ⠁⢰⣿⣠⠀⣽⠀⠀⠀⠀⠿⠰⢰⡇⠇⠀⣾⣿⣸⠅⠀⠰⠀⠀⠀⠀⡩⢄⢰⠗⠈⡇ 143 ⣯⣾⢠⠉⢻⣟⣿⣟⡟⢰⠀⠀⡛⣸⢂⡎⠀⠀⡔⠄⣿⢣⣏⣹⣽⡿⢿⣿⠐⡘⣭⣇ 150 ⠆⢈⣾⠅⣞⣉⠋⣏⣿⣿⣿⢺⣾⢁⢸⣃⣿⣿⠠⡿⡈⡮⠸⡟⠊⡯⣿⣿⢸⠃⢐⠀ 158 ⢀⢞⣸⡟⣼⡘⣿⣿⠀⣦⣵⡿⠹⣿⣿⣿⠀⠀⣿⣿⢞⢂⣾⠃⢀⢀⣿⣿⡏⢰⠟⣾ 160 ⠇⢡⣿⢾⣿⣿⣿⣟⣿⢣⠀⠈⣿⣿⣸⡧⢰⡏⡮⣮⣿⠁⠇⠂⡾⡛⣧⢧⠐⡏⢾⡙ 132 ⠃⣝⣿⣻⡷⠡⠀⡜⢼⠉⣿⣿⠠⠟⢰⣱⢼⠌⣿⣹⡟⢲⠀⡸⢀⠆⣽⢁⠁⣜⢠⡮ 136 ⡠⠇⣿⣿⣨⡷⣿⣾⠃⢴⣼⠇⢮⣏⡆⣾⢠⡁⡔⡄⠀⠇⣿⢤⣸⢧⠀⠃⢰⠄⣿⣽ 193 ⣿⣿⣿⣿⢸⠃⣿⣿⣿⣿⣾⣾⢢⠀⡿⢃⣾⠂⣿⢗⡾⠇⠰⡌⡿⣻⣿⣿⣿⣿⣿⣽ 165 ⣷⣿⢀⠀⠏⣾⣿⣻⣿⣾⣿⣿⢡⢅⡼⠼⢸⠀⢽⠅⠐⠃⢀⡯⣿⣿⡟⡚⣿⣿⣿⢶ 134 ⡋⠄⡇⣬⣿⣿⠀⠀⣟⣧⣻⢿⠻⠫⠢⠃⢸⠃⠀⠁⣟⣓⢿⡝⠀⡸⡟⣿⣿⣹⠀⠄ Total 2048 iteration: 1001 LUT: 118 ⡧⢒⡟⣓⢩⣄⠁⢾⢄⠄⡜⠀⡟⡼⢸⡀⡎⠀⢑⣿⠘⣛⢀⡇⣢⠇⡩⣼⡢⡺⢰⢯ 117 ⠹⠦⠡⡏⠀⡏⢸⠘⢹⠠⡿⡋⡇⣀⡃⣿⣾⠐⢀⡘⠃⣧⡅⣱⠐⡼⣿⢂⢐⢁⠟⣸ 102 ⣼⡇⢛⢺⠸⡂⢰⠇⢠⠣⠁⠹⣾⡗⣄⠗⠀⠀⣝⣄⠇⢀⢸⠆⡦⠾⣾⠀⡄⡋⠄⠀ 108 ⣩⣠⠀⡇⠀⠐⠈⠷⢐⡃⡿⠂⢲⢐⠁⠢⢟⢵⣧⣹⣪⡁⣦⢠⡞⡻⣘⠚⡞⢻⠀⠄ 132 ⢤⣼⠸⠂⣟⣪⣾⠈⣙⢺⢦⠰⡯⣏⡺⣙⢹⣻⢘⡿⢨⡲⣾⢃⣀⠐⠌⠩⣣⣣⣔⠀ 119 ⢠⠫⣟⣸⢠⠠⢷⢫⠀⡼⣪⢪⠁⢀⣿⢒⢀⢻⠟⣞⢀⢛⠁⢺⢀⠌⡼⢰⠿⢱⠡⣾ 118 ⣹⠕⠀⠸⢡⠇⣄⠂⡏⠼⣾⠜⢇⣾⢲⡗⡿⡆⡰⠇⠀⠰⠠⣹⣑⢻⣿⣽⢰⠙⠀⠐ 133 ⢇⠧⡘⠀⢷⡿⣳⢫⠸⢂⠁⣟⡀⡯⢭⡑⢿⠿⢈⣁⢆⣺⣟⣿⠀⡀⣯⣿⣜⣦⠀⡂ 160 ⣽⠇⢳⣿⢟⢫⡞⣾⣿⡸⣿⢉⢀⠱⣟⣽⣽⣯⠟⣣⣾⠧⠃⣸⣌⠇⣿⢻⠰⡯⢐⠀ 163 ⠐⡕⣿⣟⣼⡘⣿⣽⢇⢼⢸⠗⢾⣻⡍⡾⠀⠅⣿⣿⡧⣠⠾⢾⢁⢿⡏⣿⢺⡇⣯⢕ 120 ⢷⣼⠸⠡⡘⣏⠆⣱⠀⠀⠀⠈⠂⠄⢷⢝⠇⣼⣰⡝⣿⠁⢲⣵⢠⡗⢕⢅⣵⢹⡸⣻ 147 ⠮⠁⠐⠋⠨⡴⢙⣼⢲⣽⢾⢫⣧⣳⢺⣷⠑⣵⢯⢻⣾⠗⣝⣿⣰⡑⣌⠇⠪⡅⣴⠿ 102 ⠻⡪⡀⢳⡋⣾⡘⠀⡠⠐⢤⠀⢀⡇⣽⢁⢠⡁⣤⠉⠀⡸⢜⣛⣱⣳⣯⣿⠀⠘⡀⠀ 132 ⡹⡿⢿⣇⢠⠙⡇⢠⢳⣾⠡⣬⠃⡄⠨⠽⠫⣽⣏⣶⣿⣽⠀⠈⡉⣻⠀⠢⠮⡙⡷⠤ 143 ⠰⡴⣮⡂⢫⢿⡉⣟⡗⢡⣿⣿⢰⢢⡘⠈⡽⣧⣼⡞⢡⠆⡟⣇⢾⠂⢥⢠⠃⡿⡛⣾ 134 ⣢⠅⢠⢀⠗⣧⡿⠆⢟⡿⢇⢘⢁⣻⣿⣏⢷⡻⠀⠈⡵⢨⢬⠀⣏⣿⠅⡈⣸⣋⢯⢣ Total 2048 iteration: 1501 LUT: 126 ⡏⢼⡍⠠⡐⠏⢣⣻⢋⠔⢪⣆⣹⡾⢰⠺⠪⡟⠀⢂⡇⡊⣸⣾⣢⠇⠋⣆⢘⣏⢠⣭ 123 ⢶⢾⣿⠁⠗⢳⢷⡿⡽⠳⠡⣋⢪⠍⡱⢂⠊⡣⢠⠥⡾⠃⡅⣱⢍⠇⡖⡇⠀⠢⣊⢃ 140 ⣅⣿⣧⣟⣭⣀⢐⠄⢇⣾⣆⡾⢞⠥⣿⢤⣽⢱⣠⠱⣾⠀⣡⣎⠃⢄⡇⢰⣝⡹⣶⠍ 132 ⣿⣶⠀⡇⠇⣦⠂⠪⡜⡘⣼⠯⣅⢌⣪⢏⢟⢵⣾⢄⢣⣫⣦⣥⠀⢰⣨⠌⡨⠹⠝⣯ 131 ⢾⣻⠗⠐⠰⡟⠁⢳⠢⢠⢙⡏⢌⣽⠏⡌⠼⣭⡑⡦⣅⡠⣿⣸⡗⢽⣃⣟⠣⠺⠠⡅ 126 ⡿⢹⣞⡯⣘⢚⢔⢮⡞⠎⢃⡅⡤⢉⢢⠀⡬⠅⠇⣓⡛⣿⠇⡮⢊⠄⠖⡰⢸⢓⢴⡟ 130 ⡙⢛⣎⢃⣑⢹⡬⢘⢏⣺⣅⡎⣣⢃⠱⣼⡈⢂⡰⠇⠰⢵⡞⣰⣮⣧⡏⣿⡈⢸⢀⠥ 136 ⣾⢜⣬⣻⢻⣞⡮⡶⢳⣑⠡⠗⢫⠅⡦⣴⠀⠞⢀⣼⣛⣰⡞⢁⠎⢠⣿⣩⠎⠘⣏⠥ 119 ⡏⣻⡟⡊⢟⢫⠈⣰⣊⠊⡏⣰⢴⠈⡎⡌⣑⡊⠈⡯⢸⡅⡸⠆⡰⡥⡃⣹⡴⡂⢡⠓ 137 ⢺⠾⠇⡑⢏⢺⠼⢱⣦⡹⠵⡈⡣⡼⢻⡫⢂⣼⠣⣋⡇⢼⡭⣙⢁⢿⣁⣐⠹⢇⠿⡅ 123 ⡜⠱⡇⡶⢪⡍⠟⡠⣼⠀⠷⣅⡯⢂⡁⢠⡗⢘⠂⡍⣓⠀⣊⠳⣿⢘⡦⢎⣻⡸⡇⢺ 125 ⣿⢀⠱⢅⠅⡘⣿⡳⠃⢻⠐⢽⡶⡸⢲⡍⡸⡩⡾⡟⠵⢀⢧⠺⠌⢈⡳⢠⣽⢹⠄⠰ 114 ⣘⠂⠆⠠⡫⣴⠈⣸⡠⠓⢤⠀⢳⠂⢅⡵⠗⡄⡹⣘⡈⠓⢪⣷⡏⣴⣭⠇⠖⡟⡛⢲ 119 ⠀⡀⢞⠇⢻⠐⠀⢅⢓⣿⠡⣬⠷⡹⡀⡟⠢⣸⢺⣭⡉⣱⠀⠨⢻⠥⡞⡝⢷⣺⠒⢰ 132 ⡿⠱⢾⡹⠶⣎⡤⢂⢰⢁⡎⣱⡟⢚⡧⢆⢼⠀⣳⢘⡟⠋⢧⣾⠢⢾⠡⣀⢐⡬⡿⠾ 135 ⠩⣘⣟⣟⢄⡇⡏⠰⢜⠼⠞⢞⢿⡦⣵⣘⢇⡒⡖⡻⢶⠌⡻⢘⢰⢱⣈⠈⣵⠤⣿⡃ Total 2048 ...
Cool.
-
Feedback
02/20/2025 at 02:02 • 0 commentsOne of the working principles of the scrambler is that it should be able to work, or at least recover by itself, in catastrophic situations, such as when one of the "sources of entropy" is dysfunctional. These sources are:
- D, the input data, which is not at all under the system's control. Expect it to be lousy, though my dual-stage random() if nifty.
- S, the shift register/LFSR : so far I have implemented a degenerate/minimalist/barely sufficient LFSR8 to bootstrap the design and test its limits.
- FG (merging 2 bytes from 2 LUT entries) mixes D then affects R. The eventual carry spills over the new FG (out of the barrel shifter) and gets recycled for the next cycle. That's the red loop below:
This loops is important to supplement the deterministic sequence of the LFSR, which is obviously predictible, but it's only here to "kick" things enough to get the entropy to increase. From there, the LUT gets scrambled even more and FG can have more entropy in return. So the "red loop" of FG needs a careful design. For now, FG is the MSB of the output of the barrel shifter, plus some carry-in from R. It's not much and I don't know if more is required, but FG "leaks" to R in the second cycle, after addition to the LSB of the barrel shifter.
So R depends on 4 somewhat independent table lookups and 2 previous values of D. Reconstructing the full equation for the value of R ends up with having to know D and the barrel shifter's output. But the latter depends only on the LFSR which remains independent, unaffected by other data, and can't leak directly or easily. So all one has to do is "prime" the filter with a random state of the LFSR. The thing is : there are 1+8+5=14 bits consumed by the pipeline, only 16K combinations to consider. A LFSR32 will provide a longer sequence of these 16K substates but it was not meant as a strong protection either, but a tool to ensure that the LUT's entropy would always increase. The "non-leaking" part, or decorrelation, is just a bonus.
BTW it's "decorrelated" because there is no way to reconstruct the LFSR's state from R, which also means I should remove the 1-bit carry-in link. Now I must find a different source for them, maybe implement some sort of PEAC with it ?
Anyway now the diagram should be clearer, with the decorrelated LFSR path at the bottom, and the "FG loop" at the top.
-
v4.2
02/19/2025 at 04:52 • 0 commentsAs explained in logs 7. v4 and 8. 3:2 compression here is the current result:
So it's a bit simplified, with a newcomer: the boolean mixer in front of the write ports.
v4.3 should provide a much more decent entropy source for the 14 bits consumed during each cycle, such as LFSR32. Maybe consume one word from D alternately ? Or enlarge D to 32 bits ?
Source code: v4.tgz (not fully tested though)
-
Entropic principles
02/18/2025 at 23:05 • 0 commentsI only have "some familiarity" with cypher design: I won't dare to call this "knowledge", since owning Schneier's book "Applied Cryptography" won't make me a specialist. However I certainly do have experience and understanding of LGA, the algorithms, their behaviour...
This scrambler owes as much to my FHP3 code as to RC4.
Why is it relevant ?
It is an argument to prove that even in the worst cases (such as all counters reset and the LUT in minimal entropy), the system will naturally evolve toward maximum entropy. This is important to resist practical attacks: a self-stabilising, resilient algorithm will converge to the desired state.
The key point to consider it that the LUT behaves a bit like LGA: a sort of cellular automaton that
obeys statistical mechanics, mimics small-scale particle dynamics, like atoms in the air bouncing off each others. LGA conserve the total kinetic energy and number of the particles, but through reshuffling explores new possible configurations, thus leading to an increase of entropy. It can be made reversible or irreversible, depending on the "nudges" from the algorithm.
In our case, the LUT reshuffles/swaps bits and keeps their numbers equal, so it is a bit permutation, unlike RC4 that is a byte permutation.
The factorial of 256 is about 10^507, or almost 1685 bits, so RC4's LUT can be considered as a good entropy pool, that's why it was used as RNG in Linux and BSD. 1685 is a good portion of the 2048 bits of storage.
Note that a RC4 RNG stream skips the first 1024 bytes, and more recently the 3072 first bytes. ChaCha20 replaces RC4 now.
Our algorithm has 256×16=4096 bits, half of them being set at any time. Being double the size is good and easy, but it is not bound to keeping the permutations of bytes: two separate bytes or words could have the same value so it dramatically increases the number of valid configurations, letting the entropy grow to even higher utilisation of the available bits.
-
Initialisation: the ugly
02/18/2025 at 19:29 • 0 commentsJust to check some assumptions and see how entropy increases in the LUT, I added some popcount instructions in the braille display and run the test again...
0 64 LUT: 128 ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 80 ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ Total 2000
On the first insertion, the LUT has already lost 48 bits !
The total must always remain 2048, by definition, so there is something very wrong.
I have an excuse, though: the initial pattern 0xAAAA is one of the "worst cases", which is not expected to happen. But the loss of bits is not acceptable.
...............
Aaaaand it was a simple stupid bug, it's solved.
Or not.
8 112 LUT: 113 ⣈⣚⣬⣾⠱⠂⠭⠧⠠⠄⠭⠄⠭⠉⠠⠍⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 120 ⢔⠐⠋⣏⢐⠣⡐⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 121 ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠡⠭⠅⠭⠬⠉⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⣈⣚⣬⣾⠱⠂⠭⠧⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ 128 ⢔⠰⠋⣯⢝⠯⡔⣰⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭⠭ Total 2018
Now 30 bits are lost when the insertion algo gives up ...
.....
Aaaand I didn't validate an index at the start of the function ! It works so much better now !
----
FYI here is how the LUT looks when initialised with the AES LUT:
LUT: 141 ⠪⠳⡁⡼⡲⠿⢭⡻⠘⢺⠞⡳⢕⡷⡘⢥⣟⠘⠠⠁⢓⠷⣎⡓⢁⣾⢻⢯⢯⣓⣻⠾ 134 ⡼⣢⢳⢂⡙⣡⢂⡽⣋⣺⡗⡩⣿⠧⢇⢸⠜⣕⣆⢬⠣⢒⠤⣗⢤⣌⣮⢔⣱⠺⣣⢠ 134 ⠬⢟⡻⣽⢌⢋⠚⠖⢖⠞⢢⡟⠓⢿⡝⣤⣶⠜⡤⢕⢍⢵⡃⢹⠢⠹⣺⣨⢣⠙⡦⠍ 111 ⡀⠄⡖⢧⢑⠓⠶⢣⡐⡈⣩⢎⠔⠅⢚⣊⠾⠇⡫⠊⢒⢀⡡⢲⡵⣳⣃⠗⢩⢚⠕⠽ 121 ⠺⡁⣸⢃⢾⡔⠴⡊⢆⡋⡰⡶⣈⡪⠎⢐⢬⠪⢔⡛⡬⢮⣤⢛⡭⡑⠵⢳⢞⡗⢊⢄ 130 ⡴⠫⠸⢩⡠⠀⠨⣵⣽⠐⣵⣼⣙⢙⣪⡫⡮⡲⠍⣣⠦⣞⠯⡙⢗⡢⣅⡤⣍⡨⢄⣧ 123 ⢈⢨⣨⣷⣓⣒⠀⣻⣄⠣⣜⡥⢫⠛⡂⢅⢿⠥⢴⣹⡨⠂⠅⡿⣘⠨⢛⡜⠥⣏⠆⣐ 125 ⢨⠩⡔⢓⡎⠠⣇⣇⣢⢊⡟⣍⡇⡘⠂⢽⢡⣜⣗⢞⣝⣪⠃⠑⠁⠈⠋⣿⣂⢻⡳⢪ 137 ⡚⣥⢉⡄⠉⠋⠡⣴⡧⡯⠷⢏⣬⠤⣲⠏⢏⢤⢺⢗⣧⡾⣦⡝⢸⠴⢜⡭⢶⡉⠻⠻ 129 ⢎⠰⣔⢁⠼⡧⠒⣬⢷⠒⣕⡒⠝⢈⢅⣀⢲⠦⣹⣶⠟⣘⣰⠌⡌⣮⠽⡮⣯⡃⡶⣫ 118 ⠧⢰⢹⠚⡊⡚⠹⡂⡍⡡⡑⠆⢥⠔⣁⡬⡷⢢⢟⢫⠲⣔⡆⠲⣒⢉⡈⢍⣞⢴⡋⡹ 139 ⣼⢷⠮⣠⡞⠟⡣⡵⢦⣅⢪⢭⡹⡦⠐⣑⣊⡴⣫⠮⢠⢼⣾⣲⡸⠵⣥⡺⡪⣖⢼⡀ 125 ⡏⣚⣭⡸⣐⠕⠛⡖⣀⡌⠇⢖⢧⢜⠙⢦⢙⣰⠊⣭⠈⠼⡩⡏⠗⡣⢀⣝⣴⣃⡯⣂ 131 ⠰⠸⠩⡞⡿⢝⣑⠶⡉⡠⢝⠃⡢⢾⡅⡆⡕⠱⢵⠝⡺⠯⣏⣙⢋⢆⣡⢡⣌⡍⣷⣎ 130 ⢐⢱⢰⣸⡛⣈⡥⠉⣖⡱⡒⣩⢽⣆⢘⢌⣠⣋⣳⡎⣛⢇⡜⣱⢃⣦⠫⠭⣉⡐⠱⣯ 120 ⠏⣄⡓⢑⠄⣁⡾⡅⣚⣟⠿⢶⢮⠢⠖⡰⢱⠡⡱⣉⠌⡕⠳⡇⠭⢘⠑⠬⡄⣛⡽⠎ Total 2048
The variability is pretty great, from 111 to 141 bits per line. This gotta be smoothed by 500 insertions of 64-bit words...
500 11 LUT: 132 ⡖⢂⢔⣈⡆⢵⢛⣎⠱⠿⡈⠝⢻⡧⢍⢠⠻⠟⡤⠛⢹⣮⣈⢂⠩⠗⡦⠽⣻⠾⣞⠲ 126 ⣔⣢⢏⣧⢆⣑⣒⡼⡈⣳⢤⢼⠫⠦⠗⣤⢭⠌⡉⡧⠷⡉⠋⡋⠛⢝⢌⣙⠍⠊⢍⡩ 127 ⣙⢑⠆⣤⢧⣯⢷⢜⢓⠘⠤⣄⡴⠏⡛⣡⡴⣯⣑⠓⠻⠠⠥⡷⡦⡦⣛⠛⢘⠰⠵⡒ 124 ⢜⡗⠖⣒⡄⢗⠳⡷⠱⢨⣅⠥⡘⡨⣊⠜⢫⣧⢟⢜⢝⢃⠑⡸⡍⠐⡾⡇⠃⡬⠇⣺ 125 ⢴⢎⡱⠐⡛⡀⣿⢂⢆⠶⣫⠸⢹⡞⢀⡷⡻⣁⠂⣫⠅⢡⢴⡙⡀⣏⣩⢠⣓⣾⡛⢑ 131 ⠚⢪⡌⠀⣳⡣⡈⡧⠔⠇⡠⢬⠊⢯⣵⠻⢋⣹⢛⣿⣗⣖⡛⢅⢲⡭⠲⢠⠗⡁⠿⣼ 136 ⡅⢒⣣⡋⢨⣾⠃⢆⢪⣬⢌⢴⠗⢜⣲⡻⡭⡱⣢⡁⡸⢽⠽⢼⡽⠳⣊⡏⢪⡹⠱⢝ 140 ⠏⣞⠷⠶⣍⡥⣹⠳⣍⢗⣽⠍⣈⣱⡖⡏⣿⢗⡃⠅⢨⠱⣳⢞⠀⡨⢾⣒⣶⡞⢈⡩ 118 ⡦⢊⢝⢾⠉⢧⠬⠪⢉⡽⡜⡠⡷⠘⢦⣁⢮⠬⠸⠇⡉⡟⡠⢃⠤⡈⠆⠻⠊⣟⣔⡧ 121 ⢋⢳⣑⣐⣔⡠⡉⡈⡟⣙⠺⢇⠺⢏⣢⠨⠉⠫⣥⣴⡵⣲⠀⠚⣦⣜⠈⠃⢫⡝⡏⡤ 126 ⠃⠘⡥⣙⢰⡿⢄⡰⠑⢘⡃⣝⡑⠯⢹⢅⣾⡧⢒⠪⣨⣐⠅⣿⣸⣉⡦⣖⢰⢥⡡⢶ 126 ⣭⢗⡳⡣⣪⠅⣩⣋⠐⡠⢎⢼⠔⡺⠖⣀⡰⢲⠯⠣⣯⡖⢖⢖⢍⡍⣔⣴⡲⠹⠘⢙ 129 ⡢⢾⠋⢏⠌⢆⣻⡧⠫⣈⠴⡔⡠⣻⠟⡞⠝⢍⡍⡇⢜⠊⠊⢓⣫⡺⢪⢈⡣⣅⡳⡮ 124 ⡱⡧⢣⡾⡐⢱⣚⠫⢦⢘⠼⢅⢊⠌⡇⢏⠐⠱⣱⡘⣹⡭⣯⡱⣥⠆⠖⠍⣟⡒⢔⢈ 129 ⣔⣬⡘⡣⡌⢜⢝⡮⠹⠕⢙⢲⣱⣡⠉⠌⠩⠧⢝⠤⡠⢷⣟⠚⣞⣐⡠⣟⠿⣘⡱⢱ 134 ⢸⡴⣩⡹⢒⣆⡆⠒⢢⡭⢏⡭⡐⣐⢛⢃⡝⢏⡰⢺⡍⡽⡢⣭⠵⠞⡒⣯⣝⢾⠑⢜ Total 2048
Well the distribution didn't budge significantly, so AES is a good start, requiring only a nominal number of insertions.
I also changed the way entropy is added, in case it's low quality.
int AddEntropy(uint64_t* p, uint64_t E, int index, int max_mask) { int i=Loop_AddEntropy(p, E, index, max_mask); if (i<0) return i; return Loop_AddEntropy(p, ~E, i, max_mask); }
This doubles the effect of any call while also compensating for bad entropy (like very small numbers or a counter). And Loop_AddEntropy() now refuses to run if E=0.
Here is the LUT initialised with a counter (0 to 255 and complement):
LUT: 128 ⣿⠀⣾⠁⣽⠂⣼⠃⣻⠄⣺⠅⣹⠆⣸⠇⢿⡀⢾⡁⢽⡂⢼⡃⢻⡄⢺⡅⢹⡆⢸⡇ 128 ⣷⠈⣶⠉⣵⠊⣴⠋⣳⠌⣲⠍⣱⠎⣰⠏⢷⡈⢶⡉⢵⡊⢴⡋⢳⡌⢲⡍⢱⡎⢰⡏ 128 ⣯⠐⣮⠑⣭⠒⣬⠓⣫⠔⣪⠕⣩⠖⣨⠗⢯⡐⢮⡑⢭⡒⢬⡓⢫⡔⢪⡕⢩⡖⢨⡗ 128 ⣧⠘⣦⠙⣥⠚⣤⠛⣣⠜⣢⠝⣡⠞⣠⠟⢧⡘⢦⡙⢥⡚⢤⡛⢣⡜⢢⡝⢡⡞⢠⡟ 128 ⣟⠠⣞⠡⣝⠢⣜⠣⣛⠤⣚⠥⣙⠦⣘⠧⢟⡠⢞⡡⢝⡢⢜⡣⢛⡤⢚⡥⢙⡦⢘⡧ 128 ⣗⠨⣖⠩⣕⠪⣔⠫⣓⠬⣒⠭⣑⠮⣐⠯⢗⡨⢖⡩⢕⡪⢔⡫⢓⡬⢒⡭⢑⡮⢐⡯ 128 ⣏⠰⣎⠱⣍⠲⣌⠳⣋⠴⣊⠵⣉⠶⣈⠷⢏⡰⢎⡱⢍⡲⢌⡳⢋⡴⢊⡵⢉⡶⢈⡷ 128 ⣇⠸⣆⠹⣅⠺⣄⠻⣃⠼⣂⠽⣁⠾⣀⠿⢇⡸⢆⡹⢅⡺⢄⡻⢃⡼⢂⡽⢁⡾⢀⡿ 128 ⡿⢀⡾⢁⡽⢂⡼⢃⡻⢄⡺⢅⡹⢆⡸⢇⠿⣀⠾⣁⠽⣂⠼⣃⠻⣄⠺⣅⠹⣆⠸⣇ 128 ⡷⢈⡶⢉⡵⢊⡴⢋⡳⢌⡲⢍⡱⢎⡰⢏⠷⣈⠶⣉⠵⣊⠴⣋⠳⣌⠲⣍⠱⣎⠰⣏ 128 ⡯⢐⡮⢑⡭⢒⡬⢓⡫⢔⡪⢕⡩⢖⡨⢗⠯⣐⠮⣑⠭⣒⠬⣓⠫⣔⠪⣕⠩⣖⠨⣗ 128 ⡧⢘⡦⢙⡥⢚⡤⢛⡣⢜⡢⢝⡡⢞⡠⢟⠧⣘⠦⣙⠥⣚⠤⣛⠣⣜⠢⣝⠡⣞⠠⣟ 128 ⡟⢠⡞⢡⡝⢢⡜⢣⡛⢤⡚⢥⡙⢦⡘⢧⠟⣠⠞⣡⠝⣢⠜⣣⠛⣤⠚⣥⠙⣦⠘⣧ 128 ⡗⢨⡖⢩⡕⢪⡔⢫⡓⢬⡒⢭⡑⢮⡐⢯⠗⣨⠖⣩⠕⣪⠔⣫⠓⣬⠒⣭⠑⣮⠐⣯ 128 ⡏⢰⡎⢱⡍⢲⡌⢳⡋⢴⡊⢵⡉⢶⡈⢷⠏⣰⠎⣱⠍⣲⠌⣳⠋⣴⠊⣵⠉⣶⠈⣷ 128 ⡇⢸⡆⢹⡅⢺⡄⢻⡃⢼⡂⢽⡁⢾⡀⢿⠇⣸⠆⣹⠅⣺⠄⣻⠃⣼⠂⣽⠁⣾⠀⣿ Total 2048
and the result after adding "entropy" (counter from 1 to 256):
256 46 LUT: 136 ⡥⠟⡲⡌⣉⢟⢉⣳⢚⠕⢍⣱⠖⡱⡶⠬⠞⢁⢞⣢⡖⡩⣻⡹⠈⣙⢜⡻⡕⢢⡟⡽ 124 ⢔⢌⠱⠿⣟⡟⠁⣬⠪⡓⣎⣁⠠⢠⣾⠓⠖⣫⠠⣫⡮⠳⠄⣨⠨⣁⠅⣘⣖⣪⢶⡯ 125 ⡰⡅⣋⢦⠠⣛⠽⢔⢏⢸⠴⡙⣟⠤⣂⠩⣉⠽⢮⢅⡠⢚⢵⣴⠢⠣⣝⣸⢟⡡⡂⠫ 118 ⢼⣀⡲⠿⡑⢁⠁⠞⢟⠝⣣⢽⢓⢄⠠⠷⡀⣝⠜⡒⡬⡪⣏⣈⢈⠲⡸⠰⣤⠇⣞⢴ 144 ⡵⢀⢧⣏⠓⢰⡡⢛⡮⣅⣡⡝⣸⢻⡇⢝⢏⢣⣰⡥⣜⢽⢯⠥⢏⣥⣰⡥⣜⢽⢯⠥ 127 ⢎⣥⣰⡥⣜⢽⢯⠥⡰⠚⠏⢚⠣⡂⡐⣚⠫⣼⡔⠔⣍⡋⡥⡎⠲⣿⡖⠠⢵⠏⠥⡎ 120 ⢩⠦⡠⢙⠕⠋⡥⠐⣱⢓⡦⠒⣱⢹⣘⡯⡱⢴⣴⢳⡡⢕⡑⡷⣲⠎⠂⢕⣡⢠⡺⡨ 127 ⡇⢳⡽⡏⠯⡇⡭⢕⢸⣌⢊⠔⣐⢺⣐⠨⢀⠬⢩⡟⣟⡤⢐⠦⡵⣢⢱⣖⣥⠌⡥⡗ 124 ⣿⠗⡂⠹⣒⢟⠑⢺⣈⠕⢯⣱⢌⡀⣺⡂⢻⠝⠚⡍⢕⡱⠙⠤⢓⣹⠰⡼⠢⡊⡜⢬ 118 ⡦⠄⠠⣑⢼⠆⠞⠄⢪⣽⡞⠡⣏⢱⢃⠣⡤⢿⣅⠕⣏⣂⣮⡾⡀⡂⢍⠑⠓⡂⣞⢖ 131 ⢲⣩⡪⣪⢤⢵⠁⡡⠮⣲⣿⢠⠕⡺⢗⢡⢙⠖⠀⡟⣪⢅⡈⡞⢬⠮⣣⡻⣮⢬⠟⣘ 137 ⢬⣍⣣⡻⣯⢬⠟⣘⡒⡣⠈⢤⠑⡃⣰⠧⠾⡸⠜⣱⢙⡋⡚⠣⡞⢧⣼⣑⢵⣧⣝⢩ 123 ⣃⠄⡫⢢⢟⢴⡜⢃⠶⣩⢔⠝⡨⡋⢡⡸⠶⢇⠯⢑⣖⢑⡖⣪⠵⢚⡧⠬⢅⣱⣔⢐ 127 ⣂⣄⢼⣴⠵⢡⠡⠵⠽⢃⡃⠋⣊⡞⣞⣊⠨⣼⢁⢵⠇⢼⠌⢞⣔⢶⠮⡟⢜⣀⣱⡫ 137 ⣱⠻⠫⠇⢌⢻⡶⢙⠨⠠⢍⡼⢣⣮⠀⡮⡳⣟⠧⢙⣭⢕⢽⣲⠳⣁⡼⢏⡆⢘⣷⡣ 130 ⢤⠝⣨⢏⡼⣺⡐⡰⡓⢜⢕⡾⢋⠅⢢⢯⢣⡪⠳⢭⠕⠙⡥⠈⡳⢭⡌⢒⣾⣷⢘⣅ Total 2048
Thus in the worst case of pathologically low entropy, the algorithm needs more iterations, probably 400 or 500, to achieve a semblance of uniform scrambling. A proper "entropy source" would be a whole sentence or "passkey", cycling through chunks of 8 bytes at a time.
-
3:2 compression
02/18/2025 at 09:28 • 0 commentsThere are 8 non-degenerate boolean operations with 2 inputs. For 3 inputs, it's much higher so it's harder to choose... And I'm still looking for the magic non-linear operation that is cheap, efficient and balanced.
The goal is to generate R(16) and FG(16) from Z(32) and C(17). For now, R=Y+C and it's underwhelming. The last log proposed this :
diff = X & ~Y; Y |= diff; X &= ~diff;
But this creates an imbalance of bit density/probability, with Y > X. It's not a real problem in the LUT datapath except when A=B, where X and Y must have the same value to prevent conflicts.
-----------------------------
Well now that I think of it, if X=Y then X&~Y=0 so there is no imbalance or conflict... D'oh. So it's pretty safe in fact. Now, to prevent a bias in the LUT, A<B must be strictly as equiprobable as A>B. Good luck with that with the lousy LFSR.
-----------------------------
This imbalance could be dithered by using a third parameter (C in this case) that selects which outputs gets the bits set, and the other output gets reset, bit by bit.
diff1 = X & ~Y & C; diff2 = ~X & Y & ~C; Y |= diff1; X &= ~diff1; X |= diff2; Y &= ~diff2;
It's cheap to implement as physical gates but it takes a bunch of instructions to execute on a CPU...
-----------------------------
Then there is another strategy with two "reciprocal MUX", or a "swap" operation: for each bit, C selects whether X and Y are swapped. That's a good way to get a well scrambled R and FG.
.
-
v4
02/18/2025 at 01:10 • 0 commentsv4 is slowly taking shape, trying to replicate the success of sharing the addition between two consecutive words : as for the indices A & B, R and FG are treated in one single 32-bit operation, where C is only 17 bits wide, so there is little chance of carry-out.
The initial carry-in comes from (S&1) and it's bad. The little LFSR8 gets squeezed to extract 14 bits, including the LSB that is reused 3 times, which is not good® !
Can you spot the other changes ?
uint16_t Scramble_round(uint16_t D) { uint8_t S=LFSR8; LFSR8 >>= 1; if (S & 1) LFSR8 ^= 0x8e; uint32_t C = D + FG + (S&1); uint8_t A = C ^ S; uint8_t B = C >> 8; uint16_t X = LUT16[A]; uint16_t Y = LUT16[B]; uint32_t Z = X | (Y << 16); Z = (Z >> (S&31)) | (Z << ((-S)& 31)); uint32_t R = Z + C; // 32+17 bits FG = (uint16_t)(R >> 16); LUT16[A] = (uint16_t)Z; LUT16[B] = Z >> 16; return (uint16_t)R; }
The code looks shorter and simpler, which is not necessarily better, but faster is good, and this is possible with fewer byte-wide operations.
I have removed the once-defining byte-dephasing structure. In fact the rotation already does quite a lot, but now the result R is only "protected" from a whole 16-bit word written back to the LUT, by a (not so simple) addition.
I have an idea to apply a non-linear boolean operation that preserves the number of bits. This would "swap" certain bits from 2 parts of the Z word:
diff = X & ~Y;
Y |= diff;
X &= ~diff;- On the LUT datapath, before writeback, it relies on A!=B, which would introduce an Enigma-like flaw 0.4% of the time.
- On the FG / R output path, it would make an inappropriate imbalance of bit values.
Well, it's a start...
-
ARX
02/16/2025 at 19:58 • 0 commentsv3 already has significant enhancements over a "pure ARX" design or even RC4 :
- The LUT is NOT a byte permutation. It's a bit permutation. This makes the output useless to recover the input/index.
- The index addition for A and B is chained/merged hence co-dependent. Getting one index from the output does not directly provide the other index, particularly since A is XORed by the LFSR.
- The LUT is an expansion table and only one half is output, the other is fed back.
But it's not perfect yet.
https://en.wikipedia.org/wiki/Rotational_cryptanalysis describes the class of cyphers based on the Add/Rotate/XOR operators, ARX in short. It is the basis of Salsa20, Speck, ChaCha20, ThreeFish, Blake, Skein, Cubehash...
Only the (truncated carry) Add is "not linear", so the cypher requires a lot of rounds. It also hints how its cryptanalysis works, in particular because XOR, ADD and ROT are reversible. Some other cyphers combine non-reversible operators : OR and AND. v4 would benefit from this.
Since the design of #PEAC I have a fondness for End-Around-Carry, which wipes the LSB computation that degenerates to a simple XOR, that is easy to correlate. So adding a carry-in and carry-out to the design will strengthen it. That's a second enhancement for v4.
A third easy enhancement is the replacement of the lousy LFSR8.
-
Initialisation
02/12/2025 at 19:54 • 0 commentsOne way to provide entropy to the LUT is to connect the entropy source to the block's input. The quality does not need to be great but the effect is slow.
Now if you have a block of external data (a network packet, a keystroke, the usual deal), you may want to hit the LUT directly. It should be faster, right ?
512 bytes is also 128× 32-bit words but you can't write them like that in the LUT. You have to respect the balance between 1s and 0s. Thus given an entropy word, you must XOR two locations : if you flip two bits, the overall bitcount is preserved.
Well, no, it doesn't work like that.
XORing is the easy bit of course. But keeping the bit balance is something else : one could use popcnt (__builtin_popcountl(x) if you want it to work with GCC) twice (once before and once after the XOR) but this would be slow, so let's turn the difficulty into an opportunity.
Let's have our entropy word E that XORs a value L in the LUT. This gives the new word N=E^L. Fine.
- Some bits from E will be cleared : they are given by R=E&L.
- Some bits from E will be set : they are given by S=E&(~L) = E&N.
The trick is to not bother with how many bits do this or that, keeping a count would be cumbersome, and another beautiful thing is : we don't care how many words it will take to scan to find a location where a corresponding bit can be toggled.
So we will be scanning the word located after the place we just XORed, and try to see if each one fits a mask that even partially matches what has just been done.
For each new word N, we try the masks S and R:
- S shows which bits were set so now they indicate which ones must be reset : S2 = S & N.
- R indicates cleared bits so now they set it : R2 = R & ~N
Now get X = R2 | S2 and you can get N2 = X^N which can be rewritten back to the LUT.
And don't forget to update the masks:
R &= ~R2
S &= ~S2
If R | S is not 0 then move on to the next word.
Of course this assumes that the LUT is already "properly initialised" with enough entropy. The word or the mask could be rotated once in a while ( to increase the chances of a match).
This also promotes the use of very long words, like 64 bits, to amortize the costs.
Oh and checking the actual number of bits of the LUT is a safety measure that must not be forgotten.
--------------------------------------------------------------
Actual code : it works well !
when feeding the "entropy" 0x123456789ABCDEF 256 times, I get this result:
⢾⢓⡿⣣⣿⡣⣪⠗⠚⠐⠅⠃⠍⠀⢀⠃⠚⢐⠭⠣⠿⠀⢈⠇⡿⡿⣶⣿⣿⣿⣿⣿ ⡥⡯⢺⣔⣉⢯⡷⡸⡯⡬⢾⢴⣊⣘⣿⠽⢈⢙⡛⠯⣟⣶⡦⣖⢹⡨⡃⣑⣸⡈⡁⣑ ⢻⡨⠉⣭⣻⣐⣇⣱⡖⢟⢦⠦⠤⢧⡘⣏⡝⣿⣏⣓⠜⠯⡑⠆⢪⢈⠲⠬⢀⣐⣗⡩ ⡜⡧⣕⣓⣾⠯⣭⢞⢨⠝⡀⠂⢈⠄⡡⣕⣅⠔⡂⠨⠂⣺⣌⠤⠺⣟⢥⣷⡿⢅⠑⣙ ⣄⡻⢾⢑⡉⠫⠂⡊⠑⢡⠒⠌⢵⡋⠻⢗⣬⣞⣕⣻⢘⠱⡆⠮⢯⠺⢁⣃⠉⠬⡈⢐ ⡩⡁⡈⠐⠶⠓⠰⡛⠵⠆⢄⢁⢙⢮⠃⠋⣈⠻⡂⡒⢯⠩⡺⣖⠹⡀⠰⢀⠦⠆⠉⠈ ⢅⣏⢵⢛⢘⠿⡀⠘⡠⠸⡼⡐⣇⠐⡆⠗⣥⢕⠁⢂⠂⠂⠀⠜⡀⠦⡀⠁⠉⠍⠀⠂ ⠙⡠⡌⠣⠉⠭⠉⠂⣼⢟⢡⣐⢔⠂⡰⡉⣜⡸⡦⠐⢀⠠⡱⣋⢡⣜⢄⣂⢞⠆⡊⢀ ⡬⣫⡄⢑⢡⠨⡈⠊⣅⣿⠔⡂⢌⠙⡬⠒⣀⣷⠄⡂⠮⠉⠐⣃⣍⣾⠒⡊⠝⠘⡞⠂ ⠰⠀⣙⢙⡒⡦⢁⡈⢕⠂⠢⢪⠸⠄⡍⣘⠘⡩⡝⠕⠆⢠⠱⡈⣧⢞⢀⣮⡘⠏⣀⠢ ⠠⡭⡅⠕⣆⢢⠠⢈⡗⣷⣟⣻⣮⠛⣍⡕⢸⠁⠢⠊⠡⡤⢋⢾⢱⠀⠒⢨⣶⠳⢕⣗ ⡟⣿⣱⣿⣚⣜⡷⢸⢀⠀⠤⠈⠀⡢⣈⠢⡏⣿⣵⣿⢺⢜⣷⠝⣹⡷⣷⡿⡖⣻⣮⣸ ⢕⣕⢱⠖⠕⡧⠈⠊⡸⠢⡚⣭⣪⣘⢲⣵⣿⣷⠧⣮⡻⣡⢯⣏⠀⠩⣐⠙⢄⢒⣄⠢ ⣂⠉⠹⠼⡂⣐⢆⢠⠧⣶⣚⣫⡏⣽⢆⠪⡍⢯⡤⢏⣼⣇⡰⣗⠳⢨⡄⣾⠔⣣⣨⠸ ⣞⡖⠻⡭⡫⣜⢞⣣⠙⠔⠾⢎⣠⣀⢻⣉⠑⠝⠲⣞⣻⡐⢿⣽⣆⡲⣳⠯⢦⣽⢟⣧ ⡗⣫⣑⠏⣎⢿⢯⡖⣸⣯⡺⣬⣷⣛⢕⡩⣲⡳⡚⣬⡦⣋⣿⣽⠟⠈⢿⠾⡪⢴⣿⣷
And the starting point was the "worst case" so in practice fewer than 256 words would be necessary. I played with a few other initial bit patterns and it seems to work too.
Anyway, at least input some LFSR, congruential thing, or varying numbers, because the constant input creates extra clusters & patterns that sometimes can't be broken, so in the very worst case, the function returns -1 (lacking any fallback method).
code here : AddEntropy.tgz
-
v3
02/12/2025 at 09:55 • 0 commentsLeaks.
v2 still leaks internal states.
Well, data must come from somewhere anyway, right ?
And when there is no external entropy source, we're left with this lousy LFSR8 to run the show in the worst case.
Anyway, this system is not encryption because there is no key, just an evolving internal state.
Now, as a standalone PRNG, it's getting better but the internal state could still be somehow deduced. The original intent is for the algo to be a scrambler so it expects "some sort of entropy" at the input, amplifying it along the way with a huge RC4-like talble. So it's "just a filter" that isolates one PRNG from the outside. Yet I still need an internal 8-bit "entropy source"...
It appears that the link with cryptography is strong though, as trying to make this system "more random-like" means trying to "break it" like a cryptographer would do. So here I am, bikeshedding it to oblivion, not even caring what the result will be or what it would lead to.
So my concern is : the result is a direct reading of the table, merging 2 independent half-reads. It's fine as long as the table is considered "unkown and unknowable". But considering that it takes several hundreds of cycles for the table to be well shuffles/scrambled, it's also as many values that can be directly extracted and correlated, hence the output could be one step closer to be predicted.
So the output should not be a direct read of the LUT. It's sad since it's good to have an algorithm with very low latency, as it cascades well.
- One way forward is to extract the 16-bit result/output from the 32-bit output of the barrel shifter. But this only moves the problem somewhere else since this value will be written to the LUT as well.
- Another method would be to XOR the output of the B LUT, since the A LUT is xored at the address input. Recovery from 0-entropy would be faster but the LUT would stop being a "bit permutation" area, which would upset the strict balance of 1s and 0s throughout the table, and also require an even larger 16-bit LSFR ...
- Of course the algorithm could be run several times before delivering its final results. Loop over the LUT 2, 3, 4 times and XOR the results together. XOR or ADD is possible because it does not have to obey to the bit count preservation rule.
...
Overall the whole algo (like RC4) is mostly "diffusion" and there is little "confusion" happening, except for the address calculations. The barrell shifter is a small-scale diffusor while the LUT is a large-scale diffusor. Confusion is meant to happen in the LUT thanks to a well-brewed table, but only from input to output.
Moreover, there are 32 bits to scramble and the barrell shifter only uses 32 combinations out of the factorial(32) ways to shuffle the bits around. We got a 8-bit LFSR and even then only 5 bits are used.
So far the current solution is to take 16 bits out of the barrel shifter and add the pair of addresses (before the XOR).
There is one more adder yet the diagram looks simpler.
R is a function of the current and last pair of lookups. Why an addition ? It could be a XOR but I'd like more avalanche.
No data enters the LUT, which is still a pure permutation of bits. A better shuffling would be welcome.
Source is there: ATBSS.v3.tgz
Mixing (worst case) is not complete after 2k iterations but ok at 4k.