The latest source code archive contains the enhanced decoder for the register set, including 3 strategies:
- Straight (fast)
- update only meaningful control lines
- update only meaningful control lines when the related field is used
I provide a pseudo-randomised test to compare these strategies and the outcome is great:
[yg@Host-001 R7]$ ./test.sh Testing R7: straight decoder:R7_tb_dec.vhdl:165:5:(report note): 100000 iterations, 702273 toggles latching decoder:R7_tb_dec.vhdl:165:5:(report note): 100000 iterations, 301068 toggles Instr-sensitive :R7_tb_dec.vhdl:165:5:(report note): 100000 iterations, 160231 toggles R7: OK
There is a ratio of approx. 1/5 between the first and third result, which I explain below :
- Given that the probability of one bit being set is pretty close to 1/2, it makes sense that the first "straight" decoder toggles the output bits every other time in average. There are 14 control lines to drive and with a 1/2 probability, 7 lines change.
- The next method gives a better result, that you can understand using similar logic : we get 3 toggles per instruction, which makes total sense. There are 2 decoders but only 1/2 chance of change, so we can focus on one decoder. Each decoder updates only 3 of the 7 control lines because the other 4 give results that will not be used. So far, so good, no surprise at all.
- The last method gives an average toggle rate of 1.6 per instruction. This is one half of the previous result and though it should be taken with a lot of precaution, the benefit is clear. Some instructions (about 1/4) don't use the SND field, and the SRI field is not used when Imm8 or Imm4 fields are used, giving a further significant reduction of toggles.
Of course, these numbers are NOT representative of real use cases. I used pretty uncorrelated bits as sources, while real workloads have some sorts of patterns. The numbers will certainly increase or decrease, depending on each program.
There is a compromise for each situation and the 3 methods are provided in the source code, so you can choose the best trade-off between latency and consumption. The numbers are pretty good and I think I reached the point of diminishing return. Any "enhancement" will increase the logic complexity with insignificant gains...