The "tree walker" circuit is an essential piece of the compaction/compression system but it must interact with another critical element : the circuit that reads the bitstream and expands each value to be presented to the tree walker.
The "vanilla" 3R algo uses a simple bit-extraction system, where an integer number of bits is handled during each cycle. This involves a barrel shifter and some nice logic but it's considered a "classic circuit".
The "enhanced 3R" uses a more complex system called "truncated binary" or "economy codes" or "phase-in code", depending on the sources. This gets more complex because the extracted value must also be compared with a "threshold value" (which must also be computed) to determine if an extra bit is required.
The original version (http://ygdes.com/ddj-3r/ddj-3r_compact.html) suffers from a bad case of bad data organisation, the bits are in an awkward order in the bitstream. When a "long code" is found, one more data bit must be fetched close to the MSB. I'm trying to solve that but it's hard.
I hit a tough problem: I can't find an organisation of the data where less than 2 barrel shifters are required in the critical datapath. I'm not even counting the final MUX to perform the SHR(1).
Going back to the fundamental questions, let's first solve that "original sin" of the wrong bit order. Consecutive bits must be contiguous so a "Big Endian" order is chosen.
Let us now consider the case of a 16-bits "bit accumulator" that contains a 5-bits field. To extract those bits, there are three possible choices :
- The field is aligned/justified to the "left", the MSB of the register:
MSB LSB 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 4 3 2 1 0 x x x x x x x x x x x
- The field is aligned/justified to the "right", the LSB of the register:
MSB LSB 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 x x x x x x x x x x x 4 3 2 1 0
- The field is not aligned/justified so bits are anywhere in the register:
MSB LSB 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 x x x 4 3 2 1 0 x x x x x x x x
All of them amount to the same general complexity, number of units and latency.
For example, in all these cases, only 16 bits are represented but the "bit accumulator/Shift register" must actually be 32 bits wide, allowing a new 16-bits word to be added to the register while it is being read at the same time.
- In the "left justified" case, when more than 16 bits are consumed and shifted out, a new 16-bits word must be brought in and shifted (from 0 to 15 bits) before being stored. The justified output must also be shifted (from 0 to 15 bits again) as well.
- Same as 1. (more or less)
- In this case, the input data is written to the shift register in a "double buffering" fashion, one half at a time (alternating MSB and LSB). The complexity is relegated to the read side, where it's not a barrel shifter that is used, but a 32-bits barrel rotator (though with a 16-bits output).
So between one big rotator or two smaller shifters, the choice is not straight-forward. A larger unit needs more fanout and two smaller units can be more easily placed/routed but more units means more control logic too.
Continued at 5. Phase-out encoder/decoder