X+X = 2X

A project log for µδ code

A binary maths hack, for use in DSP code and lossless compression

Yann Guidon / YGDES 07/09/2016 at 16:010 Comments

Before we start, let's just remember and examine one of the fundamental rules of arithmetics. If you take 2 non-negative numbers and add them, the sum is equal or greater than any of the addends. This is the rule that makes the #Recursive Range Reduction (3R) HW&SW CODEC work.

In a more general way, if you have two numbers of N bits, you need N+1 bits to uniquely represent the sum. Since X+X=2X, 2^N + 2^N = 2^(N+1). This +1 is the carry bit and is required to restore the original addends if you ever want to subtract it back from the other.

Same goes for the subtraction : the difference requires a "borrow" bit. You can't avoid this, even though there are some cases where you can work modulo 2^N.

We are used to dealing with the carry and borrow bits but things can quickly get out of hand! Imagine you want to compute a 1024-tap integer FFT: each result will be the sum of 2^10 numbers, adding 10 bits to the original sample size. If you're dealing with CD quality, 16+10=26 bits so it fits in the 32-bits registers of common CPUs or DSPs.

Now if you want to use 24-bits samples, you're screwed. 34 bits don't easily fit in 32-bits registers. Will you resort to slower 32-bits floating points ? 40-bits integers ? 64-bits integers ?

Take the classical 8×8 DCT square now. The original 8-bits samples of a picture get fatter during each of the 3+3=6 passes, resulting in 14 bits for the results. The integer units have almost doubled the precision of the source data and this considerably increases gate count, latency, power consumption...

Now you start to see where I'm getting to : the classical filters and time-frequency transforms have a fundamental problem of size.