(Previously posted in the main detail page of the project)
When you need to reduce a sequence of data into a single number, there are roughly 3 families of algorithms with varying trade-offs. Their names are sometimes used regardless of their actual use but I try to associate them correctly here:
- Checksums output a small number, usually 8, 16, or 32 bits, where "hopefully" at least one checksum bit changes if a data bit changes. Checksums must be really fast to compute and use the least possible resources. If the checksum doesn't match the data, an error is detected, the data is discarded and an error is thrown. Due to the small size of the signature, it is not unlikely that certain alterations are missed, but this is not the main constraint. Such a small size corresponds to an "error model" with very low BER (probability of error) during a catastrophic failure, mainly hardware or the effects of a computer bug. Causes may be a badly seated connector, vibrations, a power supply glitch, a hot neutron that flips one or more RAM bits, EMI (ElectroMagnetic Interferences) or faulty medium (magnetic / optical surface scratch) to name a few. In other words: checksums protect small to medium data blocks during storage and/or transmission.
- Hash values are more complex because they are used on small to medium blocks, called "keys", and index a hash table that must be evenly spread. This adds statistical constraints on the transform from the data to the hash, such as: changing one input data bit should ideally change one half of the output hash bits. The hashes are usually 32 to 64 bits wide, sometimes 128 bits to reduce collisions even more (rule of thumb: double the size to prevent a Birthday attack). A subset of the hash bits should be suitable to address a hash table with fewer entries. The algorithm may be reversible and/or linear (like FNV) but recently, new hashes are designed to prevent hash flooding attacks. In practice, this category of algorithms applies to higher level processes than the checksums, but are still "under the hood" in parsers, compilers, interpreters, databases, domain name servers, Internet routers, anti-spam filters, spellcheckers,...
- Cryptographic Message Digests or digital signatures must also change exactly half of the output bits if a single input bit changes, but add even more severe constraints to allow authentication and prevent extremely sophisticated forgeries. The process is slow, complex, not "easily" reversible, highly non linear, the statistics are critical, the signatures are at least 128 bits wide, 256 bits is getting common. These algorithm undergo intense scrutiny, follow strict norms and are more visible, or at least at higher software levels.
In this project, we deal with the first, "dumb" class of reduction algorithms because checksums ensure the integrity of many low-level protocols. Their speed, size and simplicity dictate the performance of the systems, where cost is also a big constraint. Another attribute is the potential for parallelism. Checksums usually don't bother locating or correcting the detected alteration(s), SEC/DED or Reed-Solomon codes must be implemented for "forward error correction".
The most employed checksums are, in increasing order of complexity and integrity (slightly borrowed from Wikipedia) :
- XORsum : XOR all the bytes or words in parallel.
Crude, fast but misses countless trivial cases. Bleh.
- Add see RFC1071: Computing the Internet Checksum...
This is the historical origin of the word "checksum", because all the data are added together. It's a bit better than XOR because additions causes a little "avalanche" but they are commutative and can't distinguish swapped bytes for example. It's used for the TCP checksum among others.
- Fletcher see RFC1146: TCP Alternate Checksum Options
Tries to address the commutativity issue by working with two chained sums. The resulting checksum size is doubled but still contains many obvious flaws.
- Adler see RFC1950: ADLER32
More robust than Fletcher with the use of modulo-prime operations but still has some flaws.
This is the go-to standard for data integrity, speed and complexity, it catches many types of errors and it is very flexible but has its own kinds of flaws (such as a stuck state at 0). It's used practically everywhere in many forms.
- Multiplicative/polynomial checksums such as FNV and countless others.
A bit better but the multiply makes it slower than Adler.
It's always a compromise. Adler seems to be a good compromise of speed and integrity, as it is used in zlib while CRC32 is used for the ZIP format. CRC32 was used a lot but its 1KB lookup table wastes room and is a bottleneck in the most recent CPUs.
Multiplicative checksums are a bit too heavy, with only a marginal sensitivity to alterations. They might be suitable for hash tables because their "confusion & diffusion" is higher but multiplies are still not free, and computing one per input datum is quite extreme. Multiplies can be emulated by "shift and add" operations but a barrel shifter (even used for a fixed amount) is still not free in a processor (though it is a simple rewiring of the bits when implemented in hardware).
The domain of checksums seems to have been frozen during the last years and I hadn't seen an algorithm that is as simple as Fletcher while being almost as good as CRC32.
Then I listened to a Piano Magic song yet another time...