This is only a development version, cluttered with #ifdef and sanity checks but the final versions wil be much more compact and based on macros, sprinkled over the concerned functions (not even inlined).
One interesting characteristic is the size of the shift register : I use 64 bits in this version. It could be 32 bits (as in the hardware implementation) but there is one benefit : you can shove up to 32 bits without having to check for an overflow, whatever the original position (because the last sequence has shifted the high half out already).
For the encoder or decoder, this means that you can process up to 4 bytes, or a whole pixel, without testing if you need to read or write a word to/from memory. The saving in code size and in speed is not negligible and at least it exploits the recent processors.
Unfortunately the A53 cores at the heart of the Raspberry Pi 3 are not running in 64 bits mode, we'll see if the Neon extension can be used...
But at least, encoding will not be dog-slow because of this.
- create a merged version of send_phaseout and receive_phaseout that does not call the bit insertion/extraction routines (an inlined, streamlined version)
- split the functions into various macros.