Swap, parity etc.

Loop unrolling should provide a significant speedup, at least for Pentium-class computers (superscalar and such). The big question so far is how many steps should a loop contain, and in particular, to echo the end of the last log, should this number be odd or even ?

For more advanced core families, the bottleneck will be the memory bandwidth, and the cast of the memory words into 16 bits. Though even this can be mitigated on a 32-bit platform by doing the chunking ourselves.

Let's take the basic element of the unrolled loop, that computes X and Y :

// odd
Y += X;
X += *(buffer+n);
// even
X += Y;
Y += *(buffer+n+1);

Both pairs of instructions can execute in parallel without hazard, but this relies on the x86's ability to fetch memory then compute in the same instruction. This hits a limitation however : x86 can't manage both 32 and 16 bit wide data in the same instruction easily or fast (damned opcode prefixes !!!). In fact, I know of no architecture that can do this natively, without using some sort of adaptation instruction.

Even the Pentium has this limitation, where the dreaded 0x66 and 0x67 prefixes inhibit superscalar execution. In fact, the mixing step mixes 2 sizes in one line, adding a uint16_t to a uint32_t accumulator. That's a ticking bomb...

Fortunately this is easy to solve with the use of a temporary memory-caching register:

uint32_t M = *(uint32_t *)(buffer+n);
// odd
Y += X;
X += (M & 0xFFFF);
M >>= 16;
// even
X += Y;
Y += M;

Here I assume a Y2K+ x86 or comparable processor core. There are more instructions but they only explicit and split the previous version into actual operations, and they actually speed things up on 32-bit cores by efficiently removing weird size mixings.

This could even be easily extended to 64 bits, loading 4 16-bit words at a time and then shuffling them ourselves. The memory interface unit has less pressure but this is moved to the ALUs.

1a: M = *(buffer+n);  => LOAD [basereg + cstoffset]
1b: Y = Y + X;        => ADD
2a: t = M & 0xFFFF;   => AND or equivalent
2b: M = M >> 16       => SHL or equivalent
3a: X = X + t;        => ADD
4a: X = X + Y;        => ADD
4b: Y = Y + M;        => ADD

That's 7 basic opcodes to process 4 bytes, or almost 2 opcodes per byte. Due to the data dependencies, at least 4 cycles are required, so it's 1 cycle per byte (ideally). In order to get 2 opcodes per byte in practice, the above block should be unrolled at least 7 times.

This now creates a new conflict : if the "swap" idea is used at the end of the loop to renormalise both X and Y in turn, the extra word will misalign the pointer. The performance penalty could be higher than the renormalisation of both X and Y.

Anyway, what is this "swap idea" ? It's simply a way to renormalise only one variable per loop, just as is already done with the reference code though not explicitly. Each loop swaps X and Y at the end of the renormalisation of only one variable, so adding some more blocks as above would be pretty efficient.

Now, this makes the loop size odd, as discussed earlier. This is not desirable but if the load alignment constraints are taken into account, the load/store unit might slow every other unrolled loop down. I suspect most modern cores should handle unaligned loads gracefully these days but I'm not willing to risk the potential penalty on simple cores, or for backwards compatibility or any other reason.

And now I see that the log's title is out of sync with the contents.

Modular equivalence

At last !

Discussions

Become a Hackaday.io Member