Adding bytes together modulo 256 (for example) is not enough. The problem is the MSB : carries from lower bits or more MSB from other data items will overwrite the data, and/or "carry out/away" the datum. This is a well-known problem that Adler solves with a modulo-prime accumulator, so the MSB is fed back to the LSB "in a certain way". But this does not come for free.
So let's see what comes for free : use the widest possible registers and keep some headroom to store the entropy that would be "carried out" at the MSB. Ideally we use a 64-bit number where the checksummed data is accumulated at the lower half (32 bits at a time, screw endianness !) and the remaining high 32 bits work as an entropy buffer. At the end of the checksum loop, the two halves can be combined with a simple XOR if fewer than 64 bits are required.
uint64_t X, Y; uint32_t *buff; // init: X = 1; Y = 42; // loop body: Y += (X + buff[i ]); X += (Y + buff[i+1]); i+=2; // finish : Z is the LAST destination register. Result = (uint32_t) (Z ^ (Z>>32));
Experience says that the tricky part is the cast from a smaller data size but RISC processors should handle this well:
- Either you have a 32-bit CPU and the 64 bits are split into a lower and higher register. The higher register gets updated with a "Add with carry" opcode whenever available.
- Or you have a 64-bit CPU and the core first loads the 32-bit datum with the appropriate instruction "unsigned load 32 bits" that aligns the datum to the LSB of the register and clears the higher half.
For x86, it's not exactly easy but the 32-bits version is trivial thanks to the "ADDC" opcode, so it's safe to assume the trick creates no architectural bottleneck, for superscalar or OOO cores.
Some approximate calculations show that it takes about 44 iterations (or 22 X & Y passes) to fill 32 bits starting a value 1. http://oeis.org/A000045/list lists only until the 40th value (which is about 100 millions). But this is not really a problem because lower bits of the enormous number will still linger inside these 32 bits, as deformed echoes from the value.