The log 24. A little enhancement explored the idea of a marginally "safer" version which would compensate for the altered carry bit that would lead to a "funnel". The result was not stellar and a huge hit on performance. Meanwhile I'm still looking for a method to deal with 32-bit wide data to further enhance the performace-per-cycle but the theory is pessimistic on this front.
Today, a new idea percolated and it might bring both "safety" and "performance" together by reusing the existing building blocks in new, clever ways, with the magic of the fractional format. Indeed when we look at the classic code:
t = X + Y + C; Y = X ^ data; C = t >> 16; X = t & 0xFFFF;
We see that the useful values are in the least significant bits. There is no need to think further if the platform is 16-bit wide or supports this computation mode: the result masking and the carry shifting are performed implicitly by the hardware. It gets ugly with wider registers because no language (apart from ASM) lets you access the carry bit and you have to clean up after the computations.
But we can save the masking by pre-shifting the values! For example, for a 32-bit register, the above code would work with the higher half, the 16 MSB. The new code would become :
t = X + Y + C; Y = X ^ data;
Wow ! We shaved one half of the code ! It's now even more simple and lean ! And we can deal with 16-bit data on the LSB half, out of reach from the dirty funnels. The carry propagation from the lower half will only happen every other time, with very very low chance of creating a funnel. It has not disappeared but as they say, "engineering is the art of pushing problems where they are not an issue".
So we have the new split design, with the lower 16 bits getting the mixed data, the higher bits running the usual PRNG sequence with a few nudges, but avoiding the known theoretical traps, and a checksum that can be as wide as 64 bits now (when X and Y are concatenated).
The scheme can be repeated with wider registers: for example it becomes easy to process 32 bits per step with a 64-bit register, and so on. As long as you keep the Pisano machinery in the MSB, you can arrange the data as you like. I'll let you figure how to deal with 48-bit chunks of data ;-)
But wait ! Where has the C variable gone ? In the following line:
t = X + Y + C;
C is usually the carry generated by the addition, which becomes useful again with the new MSB alignment. So this is solved by reusing this carry flag. End of story, right ?
Weeeeeellllllll, the carry should be inserted at the bottom of the Pisano field, at bit 16 for the 32 bits version, so the above code is misleading and no architecture can do that. x86 has a CMOV instruction and ARM has/had(?) predicated opcodes so it is still possible to conditionally add 0x10000 to Y for example.
But in C, we don't get that sweet little carry bit. The wraparound could be detected by a hackmem-esque trick, such as
C = ((t<X) && (t<Y))? 0x10000 : 0 ;
C = (((t-X) & (t-Y)) >> 16) & 0x10000;
But this increases the computation time and loop size, as well as introduce a few off-by-one issues.
Furthermore we might miss the cases where a carry is generated in the intermediary result of the addition (because, apart from x86 with its LEA opcode, no other architecture provides a 3-addend addition and IIRC it doesn't even affect the carry flag).
So once again we're back to the original code, where the mask&shift can work in parallel. And by not masking C we can feed some of the LSB sum back into the LSB to mix with the incoming data, to further increase the mix...
t = X + Y + C; Y = X + (uint16_t)data; C = t >> 16; X = t & 0x7FFFFFFF;
In this version, the 16-bit Pisano part would be in bits 15-30. Bit 31 would be the carry and the MSB of the 16 bits of data would eat into the LSB but that's not considered as a significant problem here.
At this point, not masking the MSB of X is tempting but that would discard all the already-explored maths. I'll have to test that...