After more work, here is the circuit diagram for the SHL unit!
I have tried to make the function a bit more apparent by using some colors.
One cool detail is that the last stage is just a "swap" layer, made of MUX2. However, it is more interesting to keep the '153 because it contains 2 AND gates so if a '157 was used, a half 74HC08 would be needed.
Here, some inputs are wasted but the circuit's structure remains very regular. The unused inputs might also be used to extend the functionality of the unit, with finer insert/extract operations.
The chip count is 32×74HC153 (dual MUX4 with enable), 4×74HC32 OR gates and some more logic for the control. Some more '08 might be needed to mask some bytes of the DST operand.
The critical datapath is 5 circuits, which is pretty good for such a versatile circuit. Indeed, it can also perform the operations of the IE (Insert/Extract) unit that manages/aligns bytes from/to memory (more control logic is required though).
Many AND inputs are tied to 0V. This could instead be tied to a global "enable" signal that reduces signal swings in the unit and save some power.
I think that this unit is very cool, despite its significant size. Though it would be worse with transistors ;-)