As shown in the previous log A promising 32-bit checksum, I have come up with a nice opcode sequence that fits great with a 80(2)86:
; Checksumming 4 bytes in 6 opcodes ; X=BX, Y=DX lodsw ax addc dx, bx xor bx, ax lodsw ax addc bx, dx xor dx, ax
The magic is with the opcode LODSW that both loads a value from DS:SI into AX, and increments SI. It could be coded otherwise with fewer opcodes but still contains as many actual operations:
; Checksumming 4 bytes in 6 opcodes ; X=BX, Y=DX addc dx, bx xor bx, [di] addc bx, dx xor dx, [di+2] ; and here you have to deal with ; manually incrementing DI etc.
The rest is some basic dataflow graph hacking, as shown below :
The question is open whether this is still valid for 32 bits registers or 64 bits. I know it's not working for 8 bits and I'll have to research... If needed.
So far, it's a nice little hash that fits the bill for small datagrams, with very little init to do, only one temp register, the X and Y can be merged (XOR) to give a 16-bits checksum, or just left as is and concatenated to provide 32 bits.
Working with carry is possible with x86 or POWER processors, but painful with Alpha/MIPS/SPARC/RISCV style RISC cores.
OTOH this updated structure makes the digital circuit even smaller and easier !
- The XOR uses fewer gates than a full adder, it's easier to place&route...
- Following the teachings of Robert Jenkins, this transform is fully reversible: this garantees that no information is lost, hence ALL input data leave a trace in the checksum. New data do not "overwrite" older ones so the block size is not a first order consideration.
- No MUX ! Swapping occurs at the end of the "round" (let me borrow block cypher terminology ;-) ) and there is no MUX or weird bus crossing.
- This circuit is placed between the inputs and outputs of 33 D-FlipFlops, which may not have set or reset inputs. I added the OR and AND gates in the critical datapath to "inject" the constants X=1 and Y=0 in the circuit but your implementation may dump them at will.
- I don't see how it could be faster...
- Add a small LUT and you start to build a Feistel round for a dumb block cypher :-D
But as noted before : I only aim at minimalism and efficiency, that is : the best detection and speed per gate or instruction. If one bit toggles at the input, I only care that it toggles at least one output bit. And this is already a win: so far, this checksum surpasses ADLER32 and Fletcher, without requiring more resources, and it is almost as good as CRC32 without the hassles.