There is this old question of the appropriate size of the tiles, usually this is a dilemma between 16×16 and 8×8.
The first thing to consider is that, since 3RGB doesn't actually model the signal, there is little need for large tiles because a stray "very high pixel" will affect 4× more values with 16×16, than 8×8. The "lone wolf" has more deleterious effect with a large tile.
A smaller tile makes the stream more resilient in case of transmission/storage error. The whole tile can be discarded with 4× less visual effect.
A smaller tile fits easily in "BlockSRAM" memories of FPGA. Or more tiles in a BlockSRAM. The latest MicroSemi FPGA (Igloo2, SmartFusion2 and PolarFire) even have perfectly sized 2R1W register blocks (4 entries, with 12 bits of PolarFire).
The major factor however concerns the design of the bit serialiser/shifter, the width depends on the largest value that needs to be represented. The worst case is with the luma blocks that use 3R:
- For 8×8: 64 values of max 765 = 48960 (fits well in 16 bits)
- For 16×16: 256 values of max 765 = 195840 (requires 18 bits)
The obvious problem is that handling 18-bits values with a 32-bits processor is more complicated than 16 bits, because 32/2<=18 and
- the shift register can process up to 1×16-bits word per cycle, whereas 18-bits words reduce the granularity to bytes,
- up to 3 cycles (1 per byte) might be required to make room for a new word in the shifter's barrel, but the 16-bit shifter can make room immediately or in parallel, the bandwidth is much better
So the granularity is settled to 8×8 pixels per tile...