Arduino Arithmetic Acceleration Attempts... AVR Aces' Advice Appealed!

alpha_ninja wrote 05/31/2015 at 19:53 • 3 points

Alliterations are awesome! Anyway, Arduino Arithmetic appears altered by the size of variables. This makes sense, but I can't understand why it varies just the way it does.

I was starting to look into the speed of operations on ATMega328 for my project, Charles Jr.

However, some peculiarities arised when I ran some tests...

(100k operations per reading, reading in microseconds, I'm assigning the value to another variable, which explains the high base time for *2.)

Code link: https://gist.github.com/alpha-ninja/befa075e2ae81810100f

----------------- BYTE
add 2x:      81904
add 3x:      107012
add 4x:      88144
multiply x2: 81856
multiply x3: 107008
multiply x4: 88148
square:      100724
cube:        125872
hypercube:   125872
----------------- UINT
add 2x:      88304
add 3x:      119588
add 4x:      100728
multiply x2: 88148
multiply x3: 119584
multiply x4: 100720
square:      138448
cube:        201332
hypercube:   176176
----------------- ULONG
add 2x:      100892
add 3x:      415128
add 4x:      163604
multiply x2: 100728
multiply x3: 415120
multiply x4: 163604
square:      578608
cube:        1138240
hypercube:   1050216

For BYTE:

add 2x, multiply x2, add 4x, multiply x4 — all are low because of shifting. makes sense.

However: why is cube == hypercube??? x*x*x should be faster than x*x*x*x, right?

For UINT:

why does shifting by 2 places take longer than shifting by one?

Why is hypercube faster than cube? This might be similar for the same thing for BYTE...

For ULONG:

Again, why does shifting by 2 places take so much longer?

And: Why is add 3x / multiply x3 suddenly so much faster than shifting?

Again, why is hypercube faster (albeit less than before...) than cube?

Discussions

alpha_ninja wrote 06/01/2015 at 01:48

@Xark Thanks for looking at this!

Are you sure? yes | no

Xark wrote 06/01/2015 at 01:42

Hmm, some Interesting questions...

Regarding BYTE cube vs hypercube code, I moved your test into a small separate function which makes it much easier to understand the assembly generated (but I had to "fight" the optimizer that wanted to inline it). You can see that it uses *almost* the exact same code for cube and hypercube (only difference is using the low word or high word of previous multiply for 2nd MUL. This explains why the time is identical.

byte cube(byte b)
{
return b * b * b;
}

byte hypercube(byte b)
{
return b * b * b * b;
}

000000be <_Z4cubeh>:
be: 88 9f mul r24, r24
c0: 90 2d mov r25, r0
c2: 11 24 eor r1, r1
c4: 98 9f mul r25, r24
c6: 80 2d mov r24, r0
c8: 11 24 eor r1, r1
ca: 08 95 ret

000000cc <_Z9hypercubeh>:
cc: 88 9f mul r24, r24
ce: 80 2d mov r24, r0
d0: 11 24 eor r1, r1
d2: 88 9f mul r24, r24
d4: 80 2d mov r24, r0
d6: 11 24 eor r1, r1
d8: 08 95 ret

Similarly, for the UINT *2 *3 and *4 cases, here is what the compiler generates:

000000da <_Z6times2j>:
da: 88 0f add r24, r24
dc: 99 1f adc r25, r25
de: 08 95 ret

000000e0 <_Z6times3j>:
e0: 9c 01 movw r18, r24
e2: 22 0f add r18, r18
e4: 33 1f adc r19, r19
e6: 82 0f add r24, r18
e8: 93 1f adc r25, r19
ea: 08 95 ret

000000ec <_Z6times4j>:
ec: 88 0f add r24, r24
ee: 99 1f adc r25, r25
f0: 88 0f add r24, r24
f2: 99 1f adc r25, r25
f4: 08 95 ret

So kind of makes sense that *4 is twice as many ops as *2, either "v + v" or "(v+v) + (v+v)". For the *3 case the compiler does (v+v)+v (which needed temp registers, so slower).

I suspect the remaining questions have very similar answers. Benchmarks can be "subtle". :-)

Are you sure? yes | no

.Stack

Arduino Arithmetic Acceleration Attempts... AVR Aces' Advice Appealed!

Discussions

Become a Hackaday.io Member