Close
0%
0%

Recursive Range Reduction (3R) HW&SW CODEC

A high speed circuit design in JS and VHDL for decoding 3R bitstreams, a "quasi-entropy" code that compacts series of correlated numbers.

Similar projects worth following
The 3R algorithm was first published in 2003 http://ygdes.com/ddj-3r/ddj-3r_compact.html and made it to Salomon's "Data Compression - The Complete Reference". It was then ported to JavaScript for an online demo http://ygdes.com/3r/

I develop a new approach that is useful for very fast, very compact and very simple decompressors. Memory requirement is only Log2(N), throughput is 3N/2 cycles and the decoder is only a bunch of counters, registers and MUXes.

The current application is Lossless coding of greyscale/RGB frames. Its very high speed makes it suitable for texture compression (GPU/CGI world) and video transmission over Ethernet. Compression ratio is not competitive with existing formats but it can be implemented in a tiny FPGA and the decoding source code is smaller than .GIF or .PNG

The Recursive Range Reduction algorithm is a pretty unusual algorithm that compacts (or packs) successive samples of a continuous signal (it is designed for anything DSP-related, not text). It is inherently adaptative, it can extract different features at various scales and it can work where other classic alphabet-oriented (LZx, BWT, RLE...) methods fail. As it uses a different metric of entropy (it does not look at repetitions or probabilities), it's a new tool in the data compression scientist's toolchest.


The applications for the 3R algorithm are pretty "niche" but a smaller, and faster implementation will make it more appealing for low-resources signal processing.

The purpose of this work is to design and implement a new, custom, resilient, high-speed video transmission protocol where a small bandwidth decrease is desirable (vs uncompressed data) as long as latency and complexity are kept very low. A "bypass" mode prevents data from exceeding the original size (which happens quite often).

Lossless sound compression is seriously considered. Other applications are possible, such as data logging of physical measurements (temperature, light, hygrometry, whatever). You'll have to provide your own filter/predictor to pre-crunch the numbers and make them palatable to 3R.

Currently, only the "tree" part of the decoder is publicly available in JS. Hopefully, this work will evolve to a fully portable format for image coding/decoding that is much lighter than a full PNG library.


Funnily, I only discover VESA's DSC (Display Stream Compression) in 2015/9 : www.vesa.org/wp-content/uploads/2014/04/VESA_DSC-ETP200.pdf The 3R algorithm is less efficient but it is really lossless (there is no quantization in the initial design, though there can be in specific cases), uses very little storage or electronic gates, and (unlike VESA's $350 fee) it's Free Software from the beginning ;-)


Logs:
1. First sample source code snippet
2. 3R encoding and hierarchical census data mapping
3. First successful decoding of 3R stream in VHDL
4. Synthesis/place/route results
5. Simplicity
6. Color spaces
7. Publication in Open Silicium
8. Another set of articles in OpenSilicium
9. Another article about the VHDL implementation
10. Project status
11. Design of the bit extraction circuit

test_dec3r.vhdl

This testbench exercises the 3R decoder. it also shows how to interface it with other blocks. You'll want to look at it when the rest compiles.

x-vhdl - 3.04 kB - 04/14/2016 at 05:41

Download

dec3R.vhdl

The decoder. Sorry for the comments in french.

x-vhdl - 5.88 kB - 04/14/2016 at 05:37

Download

bram.vhdl

A generic 1-port BlockRAM for storing the stack

x-vhdl - 1.13 kB - 04/14/2016 at 05:32

Download

  • Plane-filling curve

    Yann Guidon / YGDES04/16/2017 at 02:41 0 comments

    This shape reminds me of something...

    fractal curve

    Except this one has a different order of symmetry.

    Interesting !

  • Design of the bit extraction circuit

    Yann Guidon / YGDES06/02/2016 at 00:23 0 comments

      The "tree walker" circuit is an essential piece of the compaction/compression system but it must interact with another critical element : the circuit that reads the bitstream and expands each value to be presented to the tree walker.

      The "vanilla" 3R algo uses a simple bit-extraction system, where an integer number of bits is handled during each cycle. This involves a barrel shifter and some nice logic but it's considered a "classic circuit".

      The "enhanced 3R" uses a more complex system called "truncated binary" or "economy codes" or "phase-in code", depending on the sources. This gets more complex because the extracted value must also be compared with a "threshold value" (which must also be computed) to determine if an extra bit is required.

      The original version (http://ygdes.com/ddj-3r/ddj-3r_compact.html) suffers from a bad case of bad data organisation, the bits are in an awkward order in the bitstream. When a "long code" is found, one more data bit must be fetched close to the MSB. I'm trying to solve that but it's hard.

      I hit a tough problem: I can't find an organisation of the data where less than 2 barrel shifters are required in the critical datapath. I'm not even counting the final MUX to perform the SHR(1).


      Going back to the fundamental questions, let's first solve that "original sin" of the wrong bit order. Consecutive bits must be contiguous so a "Big Endian" order is chosen.

      Let us now consider the case of a 16-bits "bit accumulator" that contains a 5-bits field. To extract those bits, there are three possible choices :

      1. The field is aligned/justified to the "left", the MSB of the register:
        MSB                                         LSB
        15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
         4  3  2  1  0  x  x  x  x  x  x  x  x  x  x  x
      2. The field is aligned/justified to the "right", the LSB of the register:
        MSB                                         LSB
        15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
         x  x  x  x  x  x  x  x  x  x  x  4  3  2  1  0
      3. The field is not aligned/justified so bits are anywhere in the register:
        MSB                                         LSB
        15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
         x  x  x  4  3  2  1  0  x  x  x  x  x  x  x  x

      All of them amount to the same general complexity, number of units and latency.

      For example, in all these cases, only 16 bits are represented but the "bit accumulator/Shift register" must actually be 32 bits wide, allowing a new 16-bits word to be added to the register while it is being read at the same time.

      1. In the "left justified" case, when more than 16 bits are consumed and shifted out, a new 16-bits word must be brought in and shifted (from 0 to 15 bits) before being stored. The justified output must also be shifted (from 0 to 15 bits again) as well.
      2. Same as 1. (more or less)
      3. In this case, the input data is written to the shift register in a "double buffering" fashion, one half at a time (alternating MSB and LSB). The complexity is relegated to the read side, where it's not a barrel shifter that is used, but a 32-bits barrel rotator (though with a 16-bits output).

      So between one big rotator or two smaller shifters, the choice is not straight-forward. A larger unit needs more fanout and two smaller units can be more easily placed/routed but more units means more control logic too.

  • Project status

    Yann Guidon / YGDES04/17/2016 at 22:23 0 comments

    Currently, only one part of the decoder is developped and available. This is a critical part, the innovative algorithm (which linearizes the tree-walking from the recursive definition) which must be connected to the other blocks.

    The other blocks are the bit-shuffler/bitfield extractor, and the output filter (which depends on the application).

    Of course, the forward and reverse blocks must be designed so this doubles the efforts again !


    Let's just look at the decoder: it receives one block of data encoded as a bitstream (with a granularity of 8 bits so blocks are byte-aligned).

    The bitstream is examined and words are extracted from the bitstream: from x bits they are zero-extended (usually to 16 bits).

    Those words are fed to the tree-walker unit that select which words to bypass or subtract and store for later. In the end, the output words are smaller.

    These smaller words, either a difference or a litteral value, are sent to the reconstruction filter.


    The encoder is almost the reverse but needs one more step, so this is not a totally symmetric process, and it requires more buffering, one full block must be stored before being sent.


    So far the decoder has been tested with manually generated word streams. They can also be copy-pasted from dynamic results of the test page in JavaScript: http://ygdes.com/3r/


    The bitstream supports several formats to reduce the impact of incompressible or degenerate incomping data.
    In the following, we assume that the input data are N-bits positive numbers, packed in blocks of 2^M.

    • A leading zero byte indicates "bypass": just reassemble the words with no alteration. This is for total noise.
    • A 0 MSB means that the samples have been encoded in "truncated binary" / "economy code". This is practical if there is noise is amplitude-limited. The first word is an economy-encoded value that is the maximum for the whole block. This still bypasses the tree-walking block but saves some bits.
    • A 1 MSB means that tree-walking is used. The first word is the size of the first sum. From there on, the tree-walker will compute all the necessary differences and output proper-sized values. "Zerotree" allows blocks of 0s to use almost no room. In this mode, there is a constant feedback from the treewalker to provide the number of expected bits of the next value to extract (which is a little tricky to schedule).

  • Another article about the VHDL implementation

    Yann Guidon / YGDES04/14/2016 at 03:01 0 comments

    The design of the 3R decoder continues with this article that details all the twists and tricks that led me to write the VHDL code. It is now available in kiosks as OpenSilicium#18 !The milestones have been finally reached but a lot of work remains before the sound and image compressors finally go live in FPGA.

  • Another set of articles in OpenSilicium

    Yann Guidon / YGDES01/02/2016 at 12:27 0 comments

    Publication is progressing, with 2 articles about the coding style and the optimisation of the algorithm (using a linearisation of the recursive system).

    16 pages at the end of OpenSilicium n°17 !


    More to come, whenever I have time ;)

  • Publication in Open Silicium

    Yann Guidon / YGDES11/21/2015 at 01:31 0 comments

    A first article is already available in Open Silicium n°16 (in french). It is pretty general and the explanation of the algorithm's subtleties will appear in january 2016 in Open Silicium n°17.

    Hopefully, the VHDL will be published in OS n°18 (april 2016 ?) (done: Another article about the VHDL implementation)

  • Color spaces

    Yann Guidon / YGDES09/29/2015 at 20:47 0 comments

    Note for later : Test a variety of lossless colorspace conversions and see which performs best :-)

    • G, R-G, B-G

    This one is the simplest !

    ; encoding
    Y = G
    Cb = B - G
    Cr = R-G
    
    ; decoding
    G = Y
    B = Cb - Y
    R = Cr - Y
    Latency is 1 addition (1 cycle). 8-bits components wrap around on overflow.
    • Reversible Color Transform (RCT)

    https://en.wikipedia.org/wiki/JPEG_2000#Color_components_transformation

    ; encoding
    Y = ( R + 2G + B) >> 2
    Cb = B - G
    Cr = R - G
    ; decoding
    G = Y - ((Cb + Cr)>>4)
    R = Cr + G
    B = Cb + G
    That one is very nice, encoding takes 1 cycle and decoding 2 cycles. Is it more efficient than the previous one ?
    • YCoCg

    It's a bit different https://software.intel.com/en-us/node/503873

    ; coding
    Y = R/4 + G/2 + B/4
    Co = R/2 - B/2
    Cg = -R/4 + G/2 - B/4
    ; or better :
    t = (R + B) >> 1
    Co = (R - B) >> 1    ; I have to find an "add-sub" circuit, but A+B and A-B have the same LSB
    Y = (G + t ) >> 1
    Cg = (G - t) >> 1
    
    ; decoding
    R = Y + Co - Cg
    G = Y + Cg
    B = Y - Co - Cg
    
    
    Coding and decoding use 4 add or sub each but more bits are needed to preserve perfect reversibility.
    • YCoCG24

    I love this one. I'm not sure who designed it first, it uses a different lifting scheme that preserves reversibility

    http://stackoverflowproxy.eu/questions/10566668/lossless-rgb-to-ycbcr-transformation/12146329#12146329

    ; encoding
    Co = B - R
    R = R + Co>>1
    Cg = R - G
    Y = G + Cg>>1
    
    ; decoding
    G = Y - Cg>>1
    Cg = Cg + Y
    R = Cg - Co>>1
    Cg = Co + R
    

    It is perfectly reversible at the price of a longer critical datapath.

    Which of those transforms is the best in practice ?


  • Simplicity

    Yann Guidon / YGDES09/28/2015 at 16:28 0 comments

    Google is releasing their incredible image compression algorithms, currently called Brotli, and proposes it as an Internet standard : http://www.ietf.org/id/draft-alakuijala-brotli

    The 3R codec goes in the reverse direction : speed is important but compression ratio is not, I want the codec to be as small and simple as possible. No huge constant tables, hopefully a few hundreds of lines of source code, a thousand logic gates at most...

    But it is well known that simplicity is complicated ! It is not a sophisticated design and I am interested only by the "low hanging fruits", the easy and cheap tricks that compress the most data.

  • Synthesis/place/route results

    Yann Guidon / YGDES09/01/2015 at 13:23 0 comments

    Thanks to extensive simulations with GHDL, the VHDL source code synthesised "out of the box" and effortlessly reached 110MHz on the venerable ProASIC3 FPGA family. One SRAM block and about 450 tiles (LUT3) are used.

    The SmartFusion2 has a different structure and also provides smaller, faster SRAM blocks. Only 120 LUT4 are used but I thought I'd reach more than the announced 147MHz... I'm believe I might shave a nanosecond or two if I read the doc well.

    I still have to try with Xilinx (S3E, Artix) and Lattice (MachXO2 and ICE40) but I have no experience with them... That might be a good opportunity to try Antti's toys ;-)

    Here is a screenshot of the P&R's map for the SmartFusion2 target.

  • First successful decoding of 3R stream in VHDL

    Yann Guidon / YGDES08/29/2015 at 04:28 0 comments

    I got the core part of the system working a few minutes ago :-D

    More internal details will follow in the coming months.

View all 12 project logs

  • 1

    Here is the optimised decoding algorithm in JavaScript.

    function decode_3R(DEPTH, // Log2(sample number)
         stream) {  // pointer to the array of sample
    
      var A=DEPTH-1,    // stack pointer
          C=0,          // loop counter
          R,            // checksum register
          S=1<<DEPTH,   // Shift register
          STACK=[],     // no-init stack
          T=stream[0],  // Temp. register
          flux_index=1,
          V;            // input value
    
      R=T;  // init
      function out_value(val) {
        R -= val;
        // insert your filter code here
      }
    
      do {
        while (true) {
          V = stream[flux_index++];  // read one input value
          if (S & 2)   
            break;
    
          STACK[A--] = T-V;
          T = V;
          S = S >> 1;
        }
    
        out_value(V);
        out_value(T-V);
    
        A=1;  // new stack pointer
        /* branch-prediction-friendly version
        if (C & 2) {
          A=2;
          if (C & 4) {
            A=3;
            if (C & 8) {
              A=4;
              if (C & 16) {
                A=5;          */
                while ( C & (1<<A) )
                  A++;  /*
              }
            }
          }
        } */
    
        T = STACK[A--];
        S =
        C = C+2;
      } while (C < 1<<DEPTH);
      return R; // R=0 if the block is valid
    }

    This code is very optimised yet very easy to tinker with and to port to other languages.

    Other parts are missing, since they are application-dependent : the bitstream extraction and the filter/predictor must be carefully developed.

View all instructions

Enjoy this project?

Share

Discussions

Antti Lukats wrote 04/16/2016 at 12:23 point

if yI want to compress a BYTE stream, 8 bit in, 8 bit out, then I should set DEPTH=0 ??

but that doesnt seem to work, so how can I compress a byte wide stream?

  Are you sure? yes | no

Antti Lukats wrote 04/16/2016 at 11:52 point

VESA standard requires fee to obtain the standard, but the compression HDL code could be free and open source. And yes, I do not like standards that are not freely available.

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates