As we have already explored, a system with width=w will have
- w bits for X
- w bits for Y
- 1 bit for C
The whole state space is also divided into two groups :
- the orbits (2 ideally, of equal length)
- the trajectories, which lead to one orbit at the next step.
The 1-step trajectories might irk some purists because they would be seen as a "funnel".
In a way this is true because this makes the system non-reversible, as half of the alterations (from the user datastream) will make the state fall in the "diagonal" band of states that leads to the funnel. Switching orbit is not an issue since it is reversible.
There is one solution though : we have already seen that the orbits always fall in a "triangle", where C=0 when X>Y and C=1 when X<Y. So the trick would be to adjust/correct C after Y is altered, so the state remains in either of the orbits. But where would this correction fall in this code ?
t = X + Y + C; Y = X ^ data; C = t >> 16; X = t & 0xFFFF;
In the cases I see, this will simply override, or overwrite, C. At first glance it seems to throw one bit of entropy away but if we consider the condition of C, it is not really lost since it depends on X and Y so if Y changes, C should too. So we could write
C = sign(X-Y)
But not anywhere, because 2 parallel operations are taking place simultaneously. However C is updated on the 2nd line and since it will be overwritten, that statement can be replaced and re-organised:
t = X + Y + C; Y = X ^ data; // ^ or +, as you like X = t & 0xFFFF; C = (((X-Y) >> 16) & 1
I'm not sure yet if/how I can run the masking of X simultaneously with the comparison. x86 has comparison instructions that would greatly simplify the last line, which might be obscure to the compilers. Let's also try with:
t = X + Y + C; Y = X ^ data; X = t & 0xFFFF; C = (X<Y)?1:0
That last line should be translated in two x86 opcodes. If carries are used in asm, just move things around:
C = sign(X-Y); // just a CMP t = X + Y + C; // ADDC Y = X ^ data; // XOR X = t & 0xFFFF; // might be avoided in 16-bits mode
This is obviously slower than the classic version, with 3 cycles, and must be verified as equivalent with the original code.
What are the costs and benefits ?
- The immediate cost is the added subtraction that compares X and Y, slowing the computation down in the critical datapath.
- Another cost is the loss of one diagonal line, since C is now explicitly computed but with a partial formula
- The benefit is having a truly reversible computation and a potentially more solid error detection but this assertion must be tested and verified.
In practice, the balance must be carefully chosen. So far, before this alternate version is thoroughly tested and compared, it seems that the original, simple version, is still the most attractive, at least for short or medium data streams. The "enhanced version" without the funnels must show a significantly increased robustness to justify the added speed cost.