Project | Super-V | Hackaday.io

« Back to project details Sort by:

More statistics - top 5 instructions

12/25/2018 at 04:01 • 0 comments

I extended emulator by finding of top 5 most frequent instructions - these are results for some RISC-V benchmarks:

dhrystone:

Five Most Frequent:
1) ADDI   = 87830 (27.05%)
2) BEQ   = 33399 (10.29%)
3) SW   = 33037 (10.17%)
4) LW   = 31050 (9.56%)
5) LBU   = 27712 (8.53%)

median:

Five Most Frequent:
1) ADDI    = 3758 (23.13%)
2) LW    = 3519 (21.66%)
3) BNE    = 1825 (11.23%)
4) BGE    = 1240 (7.63%)
5) SW    = 1141 (7.02%)

multiply:

Five Most Frequent:
1) ADDI    = 9581 (19.29%)
2) BNE    = 7309 (14.72%)
3) SLLI    = 7052 (14.20%)
4) BEQ    = 6691 (13.47%)
5) ANDI    = 6540 (13.17%)

qsort:

Five Most Frequent:
1) ADDI    = 77881 (32.97%)
2) LW    = 56308 (23.84%)
3) BLT    = 37593 (15.91%)
4) SW    = 17257 (7.31%)
5) BLTU    = 7834 (3.32%)

rsort:

Five Most Frequent:
1) LW    = 76238 (20.37%)
2) ADDI    = 54419 (14.54%)
3) SW    = 53704 (14.35%)
4) ADD    = 51461 (13.75%)
5) SLLI    = 50106 (13.39%)

towers:

Five Most Frequent:
1) ADDI    = 6397 (34.29%)
2) SW    = 3716 (19.92%)
3) LW    = 3682 (19.74%)
4) LI*    = 944 (5.06%) <=== this is part of ADDI
5) BEQ    = 615 (3.30%)

vvadd:

Five Most Frequent:
1) ADDI    = 3809 (31.81%)
2) LW    = 2135 (17.83%)
3) BNE    = 1443 (12.05%)
4) SW    = 945 (7.89%)
5) ADD    = 745 (6.22%)

As you can see most frequent RISC-V instruction is ADDI (that is also used for LI "load immediate" assembler command and some others as NOP and MV). The only exception is rsort benchmark test where ADDI is 2nd and 1st one is LW (load word). As you can see I counting LI separately (this count included to ADDI count) just to have visibility to its usage.

Some statistics from emulator

12/23/2018 at 04:15 • 3 comments

I added some statistics calculations into RV32I[MA] emulator ( originally created by Fabrice Bellard and modified and shared on Hackaday by @Frank Buss ) and collected stats from some RISC-V benchmark tests (see https://github.com/riscv/riscv-tests/tree/master/benchmarks). With DEBUG_EXTRA option it collects this info from Dhrystone benchmark for example:

Instructions Stat:
LUI   = 892
AUIPC   = 7716
JAL   = 11212
JALR   = 12850
BEQ   = 33399
BNE   = 11298
BLT   = 1721
BGE   = 3480
BLTU   = 7017
BGEU   = 2248
LW   = 31050
LBU   = 27712
LHU   = 502
SB   = 4968
SH   = 502
SW   = 33037
ADDI   = 87830
SLTIU   = 1500
XORI   = 1
ORI   = 1
ANDI   = 6151
SLLI   = 10647
SRLI   = 9534
SRAI   = 95
ADD   = 11486
SUB   = 2813
SLL   = 402
SLTU   = 1844
SRL   = 353
OR   = 2459
CSRRW   = 1
CSRRS   = 8
LI*   = 20602

Five Most Frequent:
1) ADDI   = 87830 (27.05%)
2) BEQ   = 33399 (10.29%)
3) SW   = 33037 (10.17%)
4) LW   = 31050 (9.56%)
5) LBU   = 27712 (8.53%)

Memory Reading Area 80000000...80007ae2
Memory Writing Area 80001000...80007b3f

>>> Execution time: 1425296449 ns
>>> Instruction count: 324730 (IPS=227833)
>>> Jumps: 50209 (15.46%) - 18074 forwards, 32135 backwards
>>> Branching T=26147 (44.19%) F=33016 (55.81%)

Without DEBUG_EXTRA option (no instructions stat and no memory usage stats) and with -O3 option (fastest optimization) emulator is capable of doing almost 13 millions instructions per second on my relatively modern AMD64 computer with Debain Linux onboard:

>>> Execution time: 25084843 ns
>>> Instruction count: 324730 (IPS=12945267)
>>> Jumps: 50209 (15.46%) - 18074 forwards, 32135 backwards
>>> Branching T=26147 (44.19%) F=33016 (55.81%)

Here you can see that 15% of executed instructions are jumps (when PC is changed to something different from usual PC+4) and most jumps were backwards. Also branches were 44% true (with jump) and 56% false (no jump). Below you can see similar stats for some other benchmarks:

median: 

>>> Execution time: 1391119 ns
>>> Instruction count: 16244 (IPS=11676930)
>>> Jumps: 3552 (21.87%) - 1254 forwards, 2298 backwards
>>> Branching T=2613 (53.36%) F=2284 (46.64%)

multiply:

>>> Execution time: 4743276 ns
>>> Instruction count: 49670 (IPS=10471665)
>>> Jumps: 13808 (27.80%) - 6310 forwards, 7498 backwards
>>> Branching T=12915 (86.46%) F=2022 (13.54%)

qsort: 

>>> Execution time: 19821720 ns
>>> Instruction count: 236219 (IPS=11917179)
>>> Jumps: 45487 (19.26%) - 8141 forwards, 37346 backwards
>>> Branching T=37792 (59.71%) F=25503 (40.29%)

rsort: 

>>> Execution time: 31545464 ns
>>> Instruction count: 374291 (IPS=11865129)
>>> Jumps: 15239 (4.07%) - 797 forwards, 14442 backwards
>>> Branching T=14653 (73.66%) F=5239 (26.34%)

towers: 

>>> Execution time: 1474786 ns
>>> Instruction count: 18656 (IPS=12649970)
>>> Jumps: 2027 (10.87%) - 762 forwards, 1265 backwards
>>> Branching T=1037 (57.20%) F=776 (42.80%)

vvadd: 

>>> Execution time: 1004666 ns
>>> Instruction count: 11974 (IPS=11918388)
>>> Jumps: 1830 (15.28%) - 492 forwards, 1338 backwards
>>> Branching T=1417 (62.18%) F=862 (37.82%)

As you can see it is very important to pipeline jumps properly - not just wasting cycles by wrong branching as it's usually done in simple RISC hardware designs (branch penalty) - it has to be branch prediction or even speculative execution of both branches with ignoring wrong path after condition becomes known.