Arduino Arithmetic Acceleration Attempts... AVR Aces' Advice Appealed!
alpha_ninja wrote 05/31/2015 at 19:53 • 3 pointsAlliterations are awesome! Anyway, Arduino Arithmetic appears altered by the size of variables. This makes sense, but I can't understand why it varies just the way it does.
I was starting to look into the speed of operations on ATMega328 for my project, Charles Jr.
However, some peculiarities arised when I ran some tests...
(100k operations per reading, reading in microseconds, I'm assigning the value to another variable, which explains the high base time for *2.)
Code link: https://gist.github.com/alpha-ninja/befa075e2ae81810100f
----------------- BYTE
add 2x: 81904
add 3x: 107012
add 4x: 88144
multiply x2: 81856
multiply x3: 107008
multiply x4: 88148
square: 100724
cube: 125872
hypercube: 125872
----------------- UINT
add 2x: 88304
add 3x: 119588
add 4x: 100728
multiply x2: 88148
multiply x3: 119584
multiply x4: 100720
square: 138448
cube: 201332
hypercube: 176176
----------------- ULONG
add 2x: 100892
add 3x: 415128
add 4x: 163604
multiply x2: 100728
multiply x3: 415120
multiply x4: 163604
square: 578608
cube: 1138240
hypercube: 1050216
For BYTE:
add 2x, multiply x2, add 4x, multiply x4 — all are low because of shifting. makes sense.
However: why is cube == hypercube??? x*x*x should be faster than x*x*x*x, right?
For UINT:
why does shifting by 2 places take longer than shifting by one?
Why is hypercube faster than cube? This might be similar for the same thing for BYTE...
For ULONG:
Again, why does shifting by 2 places take so much longer?
And: Why is add 3x / multiply x3 suddenly so much faster than shifting?
Again, why is hypercube faster (albeit less than before...) than cube?
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.
@Xark Thanks for looking at this!
Are you sure? yes | no
Hmm, some Interesting questions...
Regarding BYTE cube vs hypercube code, I moved your test into a small separate function which makes it much easier to understand the assembly generated (but I had to "fight" the optimizer that wanted to inline it). You can see that it uses *almost* the exact same code for cube and hypercube (only difference is using the low word or high word of previous multiply for 2nd MUL. This explains why the time is identical.
byte cube(byte b)
{
return b * b * b;
}
byte hypercube(byte b)
{
return b * b * b * b;
}
000000be <_Z4cubeh>:
be: 88 9f mul r24, r24
c0: 90 2d mov r25, r0
c2: 11 24 eor r1, r1
c4: 98 9f mul r25, r24
c6: 80 2d mov r24, r0
c8: 11 24 eor r1, r1
ca: 08 95 ret
000000cc <_Z9hypercubeh>:
cc: 88 9f mul r24, r24
ce: 80 2d mov r24, r0
d0: 11 24 eor r1, r1
d2: 88 9f mul r24, r24
d4: 80 2d mov r24, r0
d6: 11 24 eor r1, r1
d8: 08 95 ret
Similarly, for the UINT *2 *3 and *4 cases, here is what the compiler generates:
000000da <_Z6times2j>:
da: 88 0f add r24, r24
dc: 99 1f adc r25, r25
de: 08 95 ret
000000e0 <_Z6times3j>:
e0: 9c 01 movw r18, r24
e2: 22 0f add r18, r18
e4: 33 1f adc r19, r19
e6: 82 0f add r24, r18
e8: 93 1f adc r25, r19
ea: 08 95 ret
000000ec <_Z6times4j>:
ec: 88 0f add r24, r24
ee: 99 1f adc r25, r25
f0: 88 0f add r24, r24
f2: 99 1f adc r25, r25
f4: 08 95 ret
So kind of makes sense that *4 is twice as many ops as *2, either "v + v" or "(v+v) + (v+v)". For the *3 case the compiler does (v+v)+v (which needed temp registers, so slower).
I suspect the remaining questions have very similar answers. Benchmarks can be "subtle". :-)
Are you sure? yes | no