Close

Design of the phase-out code

A project log for 3RGB image lossless compression format

Tiled CODEC for still and moving pictures based on the 3R compaction algorithm

yann-guidon-ygdesYann Guidon / YGDES 06/03/2017 at 19:020 Comments

The phase-in code (or truncated binary code) is significantly better than the plain binary code because it can remove one bit from a number in certain cases.

http://ygdes.com/3r/ has shown that a modified version, the phase-out code, is (in average) slightly better than phase-in code. It is not always the best because it depends on the provided data. However, we don't care much about reducing the size of numbers that are already small : the large numbers are those that must be reduced.

The early tests used a simple transform (n = lim - n) to reuse the existing phase-in code. The optimised version must use a leaner, more direct and less serial algorithm, which can then be implemented in logic gates. I had to come up with a suitable algorithm, which is:

Parameters:
k bits maximum,
mask = (1<<k) - 1;
mask>>1 ≤ lim ≤ mask;
n ≤ lim
if ( n > (((lim<<1) &mask) |1 )) { k--; n = n+(mask>>1) -lim; } send_bits (n, k);

This is quite similar to the code for phase-in presented by Wikipedia (edited for coherence):

u = (1 << k) - lim;
if (n < u) send_bits(n,   k-1); 
      else send_bits(n+u, k  );

There are some very interesting nuances however:

  1. The expression (((lim<<1) &mask) |1 ) amounts to a rotation of lim, it is very easy to design/wire as a digital circuit.
  2. The expression n+(mask>>1) -lim has various expressions that are easier to wire as digital circuits, such as n- (lim & mask>>1) +1, which amounts to a simple subtraction with one masked operand (and the mask's LSB is dropped). The +1 can be wired as a carry-in set to 1, which can be also optimised out in logic.
  3. Both expressions above can be computed in parallel because they are independent. As a digital circuit, the results n and k are selected with a MUX controlled by the comparison's result. As a program, in a superscalar processor, the various values are computed speculatively and affected with a predicate.

I am more concerned by the computational cost and latency in hardware than software, and by the decoding than the coding.


Speaking of decoding, there is the requirement to play nicely with the bitstream extraction circuit: call the function receive_bits() only once, but how do we know how many bits to get, since it's a variable length code ?

If the function is called with the shortest size in argument, it must be called later to complete the missing bit. This takes more time, but there is no risk to overflow the bitstream when it reaches the end.

So the fast solution is to over-request bits: if the received number uses fewer bits than requested, the bit pointer can be re-adjusted before the next cycle. This solves the question of speed but leaves us with a new probem: what happens when the stream ends on a word boundary with a short code ?

It is easy to know when the bitstream ends and almost as easy to not read one more word beyond the end. The stream reader could return any value because it will not be used. It can be zero, all ones, or the last value.

However the register must be shifted because the shifter can't shift by -1.

The file test-phase-out.html demonstrates the algorithm with JavaScript.

There's a few details that I had to iron out but now it works: exercise-phase-out.html

Apart from that, it works like a charm.


20170704

Some code optimisations give the simplified result for encoding:

  uint32_t k=0, l=lim, mask;

  if (lim) {
    // generate mask from lim
    if (l & ~255) { k =8; l=lim>>8; }
    if (l & ~ 15) { k+=4; l>>=4; }
    if (l & ~  3) { k+=2; l>>=2; }
    if (l & ~  1) { k+=1; l>>=1; }
    mask=(1<<k)-1;

    if ( (val>>1) > (lim & mask) )
      val+=mask-lim;
    else
      k++;
OTOH decoding is a bit more hairy...
  uint32_t
    val=0,   // default value if direct return
    val1,    // val >> 1
    k,       // number of bits
    l=lim,   // temporary limit
    mask,
    mask1;   // mask >> 1

  if (lim) {
    k=1;
    if (l & ~255) { k =9; l=lim>>8; }
    if (l & ~ 15) { k+=4; l>>=4; }
    if (l & ~  3) { k+=2; l>>=2; }
    if (l & ~  1) { k+=1; l>>=1; }
    Reg_Offset -= k;
    mask=(1<<k)-1;
    mask1 = mask>>1;
    val = (Reg_buffer >> Reg_Offset) & mask;
    val1 = val>>1;

    ....

    // adjust (phase-out)
    if (val1 > (lim & mask1)) {
      val = (val1+lim)-mask1;
      Reg_Offset++;
    }
A new version is available as phaseout.c along with some testbench code.

Discussions