-
More statistics - top 5 instructions
12/25/2018 at 04:01 • 0 commentsI extended emulator by finding of top 5 most frequent instructions - these are results for some RISC-V benchmarks:
dhrystone: Five Most Frequent: 1) ADDI = 87830 (27.05%) 2) BEQ = 33399 (10.29%) 3) SW = 33037 (10.17%) 4) LW = 31050 (9.56%) 5) LBU = 27712 (8.53%) median: Five Most Frequent: 1) ADDI = 3758 (23.13%) 2) LW = 3519 (21.66%) 3) BNE = 1825 (11.23%) 4) BGE = 1240 (7.63%) 5) SW = 1141 (7.02%) multiply: Five Most Frequent: 1) ADDI = 9581 (19.29%) 2) BNE = 7309 (14.72%) 3) SLLI = 7052 (14.20%) 4) BEQ = 6691 (13.47%) 5) ANDI = 6540 (13.17%) qsort: Five Most Frequent: 1) ADDI = 77881 (32.97%) 2) LW = 56308 (23.84%) 3) BLT = 37593 (15.91%) 4) SW = 17257 (7.31%) 5) BLTU = 7834 (3.32%) rsort: Five Most Frequent: 1) LW = 76238 (20.37%) 2) ADDI = 54419 (14.54%) 3) SW = 53704 (14.35%) 4) ADD = 51461 (13.75%) 5) SLLI = 50106 (13.39%) towers: Five Most Frequent: 1) ADDI = 6397 (34.29%) 2) SW = 3716 (19.92%) 3) LW = 3682 (19.74%) 4) LI* = 944 (5.06%) <=== this is part of ADDI 5) BEQ = 615 (3.30%) vvadd: Five Most Frequent: 1) ADDI = 3809 (31.81%) 2) LW = 2135 (17.83%) 3) BNE = 1443 (12.05%) 4) SW = 945 (7.89%) 5) ADD = 745 (6.22%)
As you can see most frequent RISC-V instruction is ADDI (that is also used for LI "load immediate" assembler command and some others as NOP and MV). The only exception is rsort benchmark test where ADDI is 2nd and 1st one is LW (load word). As you can see I counting LI separately (this count included to ADDI count) just to have visibility to its usage.
-
Some statistics from emulator
12/23/2018 at 04:15 • 3 commentsI added some statistics calculations into RV32I[MA] emulator ( originally created by Fabrice Bellard and modified and shared on Hackaday by @Frank Buss ) and collected stats from some RISC-V benchmark tests (see https://github.com/riscv/riscv-tests/tree/master/benchmarks). With DEBUG_EXTRA option it collects this info from Dhrystone benchmark for example:
Instructions Stat: LUI = 892 AUIPC = 7716 JAL = 11212 JALR = 12850 BEQ = 33399 BNE = 11298 BLT = 1721 BGE = 3480 BLTU = 7017 BGEU = 2248 LW = 31050 LBU = 27712 LHU = 502 SB = 4968 SH = 502 SW = 33037 ADDI = 87830 SLTIU = 1500 XORI = 1 ORI = 1 ANDI = 6151 SLLI = 10647 SRLI = 9534 SRAI = 95 ADD = 11486 SUB = 2813 SLL = 402 SLTU = 1844 SRL = 353 OR = 2459 CSRRW = 1 CSRRS = 8 LI* = 20602 Five Most Frequent: 1) ADDI = 87830 (27.05%) 2) BEQ = 33399 (10.29%) 3) SW = 33037 (10.17%) 4) LW = 31050 (9.56%) 5) LBU = 27712 (8.53%) Memory Reading Area 80000000...80007ae2 Memory Writing Area 80001000...80007b3f >>> Execution time: 1425296449 ns >>> Instruction count: 324730 (IPS=227833) >>> Jumps: 50209 (15.46%) - 18074 forwards, 32135 backwards >>> Branching T=26147 (44.19%) F=33016 (55.81%)
Without DEBUG_EXTRA option (no instructions stat and no memory usage stats) and with -O3 option (fastest optimization) emulator is capable of doing almost 13 millions instructions per second on my relatively modern AMD64 computer with Debain Linux onboard:
>>> Execution time: 25084843 ns >>> Instruction count: 324730 (IPS=12945267) >>> Jumps: 50209 (15.46%) - 18074 forwards, 32135 backwards >>> Branching T=26147 (44.19%) F=33016 (55.81%)
Here you can see that 15% of executed instructions are jumps (when PC is changed to something different from usual PC+4) and most jumps were backwards. Also branches were 44% true (with jump) and 56% false (no jump). Below you can see similar stats for some other benchmarks:
median: >>> Execution time: 1391119 ns >>> Instruction count: 16244 (IPS=11676930) >>> Jumps: 3552 (21.87%) - 1254 forwards, 2298 backwards >>> Branching T=2613 (53.36%) F=2284 (46.64%) multiply: >>> Execution time: 4743276 ns >>> Instruction count: 49670 (IPS=10471665) >>> Jumps: 13808 (27.80%) - 6310 forwards, 7498 backwards >>> Branching T=12915 (86.46%) F=2022 (13.54%) qsort: >>> Execution time: 19821720 ns >>> Instruction count: 236219 (IPS=11917179) >>> Jumps: 45487 (19.26%) - 8141 forwards, 37346 backwards >>> Branching T=37792 (59.71%) F=25503 (40.29%) rsort: >>> Execution time: 31545464 ns >>> Instruction count: 374291 (IPS=11865129) >>> Jumps: 15239 (4.07%) - 797 forwards, 14442 backwards >>> Branching T=14653 (73.66%) F=5239 (26.34%) towers: >>> Execution time: 1474786 ns >>> Instruction count: 18656 (IPS=12649970) >>> Jumps: 2027 (10.87%) - 762 forwards, 1265 backwards >>> Branching T=1037 (57.20%) F=776 (42.80%) vvadd: >>> Execution time: 1004666 ns >>> Instruction count: 11974 (IPS=11918388) >>> Jumps: 1830 (15.28%) - 492 forwards, 1338 backwards >>> Branching T=1417 (62.18%) F=862 (37.82%)
As you can see it is very important to pipeline jumps properly - not just wasting cycles by wrong branching as it's usually done in simple RISC hardware designs (branch penalty) - it has to be branch prediction or even speculative execution of both branches with ignoring wrong path after condition becomes known.