At this moment, the search for w26 is only at the 73% mark so it's not officially "good" but very probably so. I doubt I can ever confirm or refute w32 so, this far, w26 is the best I can have confidence in. And w32 is very unlikely "good" anyway. So what can I do with w26 ? And how ?
I think it's not as useless as I originally thought, on the contrary, and even more so since my last encounters with GCC. I want something that is quite simple, robust, and which works anywhere without ridiculous compiler wrangling and weird gymnastics.
Remember that even though w26 works on 26 bits, its period is double that ! So even with a pair of 32-bit registers, I get a 2⁵⁶ period without effort. This should be enough, right ? And with such a size, you'd be wiser to chunk your data, such as 16M blocks just to be safe.
So let's see how to design a checksum that works well for 32 and 64 bits, using 32-bit wide registers.
- First, we deal with a 26 bits wide accumulator/adder, but we can see that it's ridiculous to scrub the remaining 6 MSB, because they contain useful entropy. The existing codes mask the MSB, mostly for convenience (in shifting the carry back to the LSB), but this would work against the checksum's robustness, so let's use the whole 32 bits now. There is no harm in using a bit that is not the MSB for the carry loopback.
- One would be tempted to use the whole 32 bits of the register to mix the input data stream, but even though it would double the bandwidth, I hesitate and I prefer sticking to 16 bits of mixing, despite the extra alignment/masking/shifting mess. With this arrangement, the system has one chance in approx. 2¹⁰ (1024) to fall in a "funnel" (when the state falls in a "trajectory" that becomes undistinguishable from an orbit, making the system non-reversible hence more likely to lose information).
So far we have split the 32 bits into 3 zones :
- 6 MSB : sort of "guard bits", they come for free and are good to keep around.
- 10 middle bits : the MSB of the 26-bit Pisano iterator
- 16 LSB : where the iterator gets its state mixed.
There is some headroom, and should work well in most cases. But we have to solve one last issue : we can't get the carry bit.
In fact we can't get it directly because the previous methods would either mask the "guard" MSB or shift the computation such that the carry bit appears on the carry flag. But this bit is not lost. The iterator virtually operates on the bits 0 to 25 so we could detect a carry when the bit #26 changes state. So the carry is computed by
(((X^Y)>>26) & 1)
which is a decent code sequence on any decent 32-bit CPU.
In hardware, it's even more simple, particularly if you can access intermediary results from the adder's internal nodes.
The corresponding source code adds some latency, there is a string of 3 dependent instructions which reduces the speed a bit, but it's not horrible either. Furthermore we can use ADDs everywhere, and not worry about missing a carry in the MSB anymore in case we add the carry separately. Overall, it's a good design gain and the whole thing, whether you use 32 bits (the last result) or 64 bits (the last result concatenated with the precedent) has a lot of headroom.
From there, we can make this macro:
Z = Z + W + dat16 + (((Z^W) >> 26) & 1)
Then substitute Z and W for X and Y alternatively.
PISANO_MACRO(Y, X, data) PISANO_MACRO(X, Y, data)
There is still room for "enhancement" with the x86 breed, using the RCx (Rotate through Carry) opcodes. For example RCL 6 will put the bit 26 into the carry bit, and you just need to ADC to save one ADD opcode. So the sequence would be
MOV T, X XOR T, Y ADD Y, dat RCL T, 6 ADC X, Y
I wish I could avoid the first MOV... but I have to admit its inevitableness and test the hell out of it....