Close

check your optimization-level!

A project log for limited-code hacks/ideas

When you're running out of code-space... sometimes yah's gots to hack. Here are some ideas.

eric-hertzEric Hertz 11/28/2016 at 18:048 Comments

If working with a microcontroller, your system may already be set up for the "-Os" optimization-level, so the information here might not save you any program-memory...

-----------------------

AS I UNDERSTAND (I am by no means an expert on any of this!):

-Os is "optimize for size"

-Os basically does as much computation (of your code) as possible during the compilation-process, and tries to look for the most code-size-efficient means to compile it, rather than leaving a bunch of that code up to your processor to handle in real-time.

Contrast that with "-O0" (no optimization), where the code will be compiled almost exactly as you wrote it.

E.G. a really simple example:

With -O0 (default): "a = 1 + 2;" might very well write the value 1 to the register containing the variable a, then add 2 to it. At least two instructions to be executed in realtime on your processor.

With -Os "a = 1 + 2;" most-likely will result in one instruction, writing the value 3 to the register containing variable a.

Other optimization-levels (-01, -02...) aren't discussed here, but check out the comments at the bottom of the page, from @Karl S, and note that they might in fact result in *larger* code than with no optimization, as it might optimize for *speed*).

-------

The key is, the optimization-level may have a lot to do with the size of your compiled-project... And it's not just a matter of "levels", but different types entirely

(e.g. some optimization-"levels" may prefer execution-speed over *size*, etc. In gcc there are also "-f<options>" which allow you to fine-tune your optimizer's preferences, and there are also pragmas(?) to choose specific optimization-levels for specific parts of your code... These are a bit beyond me...)

So you might want to do some reading-up on the matter, and/or experiment!

-------

Here's a [multitude of] "wow"-moments, I've experienced with the matter, but first some overview:

I've a macro that turns a pin on a port into an output called "setoutPORT()".

(This is for an AVR...)

#define setoutPORT(pinNum, PORTx)   \
      setbit2(DDR_FROM_PORT(PORTx), pin)
#define DDRPORTOFFSET   1
#define DDR_FROM_PORT(PORTx) \
      ((_MMIO_BYTE(&(PORTx) - DDRPORTOFFSET)))
#define setbit2(variable, bitNum) \
         (variable = ((variable) | (1 << (bitNum))))

(The point is to use one definition for the port-name to use with all pin-related macros, regardless of which register they actually need to access)

Here's a *really* simple program using it:

#define LED_PIN  1
#define LED_PORT PORTB

int main(void)
{
   //set PB1 as an output
   setoutPORT(LED_PIN, LED_PORT);
   while(1)
   {}
}

And here's how "main" compiles with my default optimization-level (-Os):

 int main(void)
 {
    setoutPORT(1, PORTB);
   38: b9 9a          sbi   0x17, 1  ; 23
   3a: ff cf          rjmp  .-2         ; 0x3a 
 0000003c <_exit>:
   3c: f8 94          cli
 0000003e <__stop_program>:
   3e: ff cf          rjmp  .-2         ; 0x3e <__stop_program>

rjmp .-2 is the while(1) loop, it jumps back *to itself* (Sometimes, with optimization, the disassembly-output doesn't show all the original source-code, in this case it forgot the while(1){})

Wow-Moment Number Zero:

I've been using this method (setoutPORT and all its dependencies) for *years* with AVRs and have known it to (and relied on it to) compile to a single sbi instruction...

But y'all likely haven't seen it yet, so take a moment to look at all the math involved in the setoutPORT macro... That's a *lot* of math, including pointer-arithmetic.

I guess I was mistaken, because I always thought the Preprocessor was responsible for handling constant math, like that. Or, at least, that the compiler looked for constant-math inherently as an early-stage in the compilation-process (I guess the preprocessor wouldn't know much about pointer-arithmetic).

I figured the -Os part of the optimizer was only handling the conversion of

(variable = ((variable) | (1 << (bitNum))))

into an sbi, which is pretty impressive in-and-of itself.

Today's Wow-Moment:

Here's the output with no optimization (-O0):

int main(void)
{
  38: cf 93          push  r28
  3a: df 93          push  r29
  3c: cd b7          in r28, 0x3d   ; 61
  3e: de b7          in r29, 0x3e   ; 62
   setoutPORT(1, PORTB);
  40: 87 e3          ldi   r24, 0x37   ; 55
  42: 90 e0          ldi   r25, 0x00   ; 0
  44: 27 e3          ldi   r18, 0x37   ; 55
  46: 30 e0          ldi   r19, 0x00   ; 0
  48: f9 01          movw  r30, r18
  4a: 20 81          ld r18, Z
  4c: 22 60          ori   r18, 0x02   ; 2
  4e: fc 01          movw  r30, r24
  50: 20 83          st Z, r18

   while(1)
   {}
  52: ff cf          rjmp  .-2         ; 0x52 <__SREG__+0x13>

00000054 <_exit>:
  54: f8 94          cli

00000056 <__stop_program>:
  56: ff cf          rjmp  .-2         ; 0x56 <__stop_program>
That's a bit much for me to parse, right now... but it would seem, for the most-part, that math is NOT being pre-calculated at compile-time. So my single sbi instruction has gone up to 13 instructions(?!). And, I think, some of those instructions are *two-cycle* instructions!

Furthermore, note that a lot of that math involves pointers and pointer-arithmetic... And, again, *all* of that is boiled down to *constants*.

(Interestingly, it looks like the (1<<1) must've been handled outside the optimizer, as the ori is fed a constant 2).

There's some weirdness in there, though... What's with r28? Looks like it's never used...? or does r28/29 make up "Z"? And why's it bother pushing 'em, when there's no return? I'll have to look up the ol' instruction-set before I can parse this.

Regardless... That's a *LOT* of instructions... not only to be stored in program-memory, but also to be executed in real-time.

-----------------

Here's the same macro handled on PIC32...

I've only the free version of xc32-gcc, so -Os is not available. Its highest optimization-level is -O1.

(I'm not sure which of the available optimization options I used when I compiled this)

9d002ec8:   00801021    move  v0,a0
9d002ecc:   a3a20000    sb v0,0(sp)
   setoutPORT(Tx0_pin, Tx0_PORT);
9d002ed0:   3c02bf88    lui   v0,0xbf88
9d002ed4:   24426420    addiu v0,v0,25632
9d002ed8:   24420034    addiu v0,v0,52
9d002edc:   24030008    li v1,8
9d002ee0:   ac430000    sw v1,0(v0)
9d002ee4:   3c02bf88    lui   v0,0xbf88
9d002ee8:   24426420    addiu v0,v0,25632
9d002eec:   2442fff4    addiu v0,v0,-12
9d002ef0:   24030008    li v1,8
9d002ef4:   ac430000    sw v1,0(v0)
That means every time I call setoutPORT it goes through *all that math*, and that goes for other things like "writePORT" and "setpinPORT", etc. Wow.

(Of course, these macros are written slightly differently than shown for the AVR, above, as the PIC32 registers are different, but the concept is identical).

So... I'd say "write your registers directly, if code-space is a concern!"

and

"Don't expect constant-math to always be optimized-out!"

-------

Though, there must be a way to handle this more-efficiently.

One method might be using preprocessor string-concatenation... E.G. PORT##B or DDR##B. I tried that *long* ago, but for some reason replaced it in favor of this. I think I wasn't sure how to have e.g. #define Tx0_PORT B, and have PORT##Tx0_PORT work out... but I think I understand that better, now. And I wrote about this method once before (I have no idea where) and someone (I'm sorry!) commented on using that method, as well... so it should work, right?

Alternatively... Maybe things as oft-used as pin-setting/clearing can be handled *explicitly* in inline-assembly, via macros... as, without -Os, even DDRB |= (1<<1) would most-likely result in several instructions, likely even an unnecessary read-modify-write.

and, maybe even

"If you can't use -Os, then you might want inline-assembly!"

(Thank you @Elliot Williams for your help getting my pic32 disassembler working to show the original source-code! That's what got me to this "Wow-Moment.")

-----------

A less-brief-than-I-intended note on inline-assembly:

It's *confusing as heck* if you want to use C-variables within your assembly-listing. And there's even an equally-confusing means to pass it constants... BUT: If you're working with constants, anyhow, you *can* put macro-names in it... It's been "a minute" since I've done-so... but it's not particularly difficult. E.G. the AVR setoutPORT call was:

sbi   0x17, 1

so, e.g. I think it'd be *SOMETHING LIKE*:

#define DDRB_STRING "0x17"
#define PIN1_STRING "1"

asm("sbi " DDRB_STRING ", " PIN1_STRING ";");

or:

#define QUOTE_THIS(x) #x
#define QUOTE_THIS_VALUE(x) QUOTE_THIS(x)

#define setout(pinNum, DDR_ADDR) \
    asm("sbi" QUOTE_THIS_VALUE(DDR_ADDR) \
          ", "QUOTE_THIS_VALUE(pinNum) ";");

called as:

setout(1, 0x17);

or maybe even:

#define LED_PIN 1
#define LED_DDR_REG 0x17
setout(LED_PIN, LED_DDR_REG); 

But I don't think you can do: setout(1, DDRB) 
since DDRB is defined as MMIO_BYTE(something) 
which doesn't boil down to a raw number, 
rather something involving pointers.

(weee!)

Discussions

Aguilera Dario wrote 01/06/2017 at 05:07 point

I've tried all the optimizations options (avr-c++ (GCC) 6.2.0) on my code (on a atmega328p), to read from a program memory array, some minor operation and then toggle output pins, and got this:

-O3 -mint8: 884 bytes

-O3: 908 bytes

-O2 -mint8: 926 bytes

-O2: 948 bytes

-Os -mint8: 910 bytes

-Os: 930 bytes

-O0: 2582 bytes

The -mint8 option tells the compiler to only use 8bit variables (read it from some atmel app note). So, I'm sticking with the first option :-)

  Are you sure? yes | no

Eric Hertz wrote 01/06/2017 at 07:19 point

Interesting results! And thanks for throwing them here, could be quite useful to consider.

vvvvvvvvvvvvv

-mint8 sounds like a great idea for shrinking code-size and increasing speed.

^^^^^^^^^^^^^^^

Personally, I almost never use 'int', in favor of 'int8_t', etc., as I like to know the maximum value that can be stored, so it probably wouldn't help me much :( OTOH, it's entirely plausible libraries I use might... in which case, it could be a nice surprise!

vvvvvvvvvvvvvvvv

Especially interesting is that -O3 is smaller than -Os.

^^^^^^^^^^^^^^^^^^

I have no idea why that'd be, but I'm certainly no expert on the matters. But all the more reason to experiment!

If you're into learning and tweaking optimization details, gcc has a whole bunch of "-f<options>" ('man gcc'/'info gcc') where you can explicitly choose (or disable) individual optimization methods. One example might be -funroll-loops, which causes e.g. "for(i=0;i<2;i++) { asm("nop"); }" to be compiled as 'asm("nop"); asm("nop");', removing the loop/counter altogether. In this case, unrolling loops would certainly be smaller than compiling it as a loop. However, all it takes is i<3, or maybe i<4, to cause the loop to be more code-space efficient (albeit slower).

And if you're *really* into it, I think gcc allows for "pragmas" to implement specific optimizations on specific functions.

Thanks for sharing those results here!

  Are you sure? yes | no

Aguilera Dario wrote 01/06/2017 at 20:24 point

I also use the uint8_t type, but in the libraries there are some variables declared int, or something similiar, and that option changes them. So it's also common sense to check that no functionality is lost.

In my case I was testing a part of the code and needed to use the _delay_ms() function from avr-libc. I wouldn't work because the value was too high, and that option was active :-/ It might well turn out to be a double-edged sword.

  Are you sure? yes | no

Eric Hertz wrote 01/06/2017 at 23:40 point

@Aguilera Dario, interesting, as soon as I read your earlier comment about -mint8, I wondered whether the AVR includes would handle it nicely. Interesting to hear they don't (always?) use explicit int-sizes.

Since _delay_ms(), etc. are intended (I think) to work with constant argument-values, I can see that it'd give an error if you try to give a (constant) argument that's out-of-range. But, yeah, that could be risky for other functions/macros where an argument might be passed as a variable, and C might automatically assume you intended to cast to a lower integer-size. 

E.G. 

uint16_t a = 512;

uint8_t b = a;

I *believe* there's a warning option in GCC for that, and further, a means to turn that warning into an error. (was it "implicit casting"?)

So then you know to rewrite it:

uint8_t b = (uint8_t)a;

Though, handy, those sorts of warnings aren't always reliable... in some really complex cases, gcc can't catch everything.

I've turned on so many warnings/errors that I can't keep track of 'em all anymore. Thanks for making me think of it, I've thrown my list up at:

https://hackaday.io/project/3828-commoncode-not-exclusively-for-avrs/log/51503-extra-warningserrors

  Are you sure? yes | no

Karl S wrote 11/30/2016 at 22:47 point

A couple of slight nit-picks.

At the start of the post:

"Most-likely your system's already set up for the highest
optimization-level (-Os), so this most-likely won't save you any
space... But e.g. the free version of xc32-gcc (for PIC32) does not
allow for -Os...
-----------------------

-Os is "optimize for size"

This basically does as much computation (of your code) as
possible during the compilation-process, rather than leaving a bunch of
that code up to your processor to handle in real-time.

Other optimization-levels tend to leave your code alone,
compiling it almost identically to how it's written. (This is easier to
debug!)"

All optimisations (other than -O0, which is a lack of optimisation) will pre-calculate constants. The point of -Os is that it prioritizes size rather than speed by disabling optimisations that take up space and enabling ones that reduce size (which aren't enabled normally because they probably reduce performance slightly).

Optimisations other than -Os such as -O2 or -O3, while not necessarily relevant for a micro, can (and in many cases, will) significantly alter your code. They're optimising for performance instead of size, and as such are normally (for PC software anyway) regarded as a higher optimisation level. I'm not sure how they're handled on micros, but I know that on x86(_64) they can re-order stuff quite a bit.

------

With regards to compilers where -Os is not available, -O1 (if available) will be better than no optimisation. -O2 is also probably worth trying if it is available.

  Are you sure? yes | no

Eric Hertz wrote 12/01/2016 at 11:36 point

Good points, thanks. -O1 is as far as it goes with the free version of xc32-gcc... But you make a good point about *speed* vs. *size* optimization... There may also be cases where one of the higher levels (besides -Os) might in fact *increase* code-size, if it would run faster to do-so (e.g. unrolling short loops).

I'll have to experiment more... It was my understanding that my PIC32 output (-O1) is actually calculating the address of the direction-register from the PORT-register, in realtime. Since I can't do higher-optimization experiments with that, maybe I should do lower-optimization experiments with AVR.

And your note about reordering code... That's bit me in the butt a couple times... When needing e.g. exactly one instruction-cycle between writing a port-value and writing a new one, you might write "valBefore = <some math>; valAfter = <some other math inclusive of valBefore>; PORTA = valBefore; PORTA = valAfter;"

Intentionally writing it to do all the calculations ahead of time, but optimization rendered that out, thinking it was more efficient to move the valAfter calculation between the two PORTA writes. (size-wise, yeah... it was reusing the same register for valBefore and valAfter).

All the more reason to be sure to look at your assembly-output (and write inline-assembly) if timing is important!

And all the more reason to *experiment* rather than just assume a higher optimization-level is going to be "better".

  Are you sure? yes | no

Karl S wrote 12/01/2016 at 21:44 point

I missed that your PIC32 output was using -O1, otherwise I wouldn't have mentioned it. But I find it surprising that it is putting the calculation of constants in code – maybe it doesn't just limit you to -O1 as such, but disables a few optimisations used there as well. I also find it surprising that, seeing GCC is under the GPL, no one has released a version that does at least a bit more optimisation – then again, I'm not a compiler writer so I wouldn't know anything about how much work that'd be.

  Are you sure? yes | no

Eric Hertz wrote 12/02/2016 at 05:19 point

@Karl S

Well... I should probably put on a dunce-cap about now...

I honestly don't recall which old project I grabbed that example from, and it seems there are some newer projects which (still) didn't have their settings set to use -O1, when last-compiled.

(long story involving an xc32-gcc version where -O1 resulted in an int8 math-bug, but the latest version fixed that, so now all my projects use -O1, unless the old version is detected... longer story where I somewhat-blindly added all -f<options> thinking it might improve things).

It's entirely plausible the example I grabbed was one of those.

----   I am by no means an optimizer-expert... as can be seen here (surprises-discovered!). Hopefully anyone who's reading this is taking my explanations of what I [likely mis-] understand with that in mind! ---

You're right on about xc32-gcc... I think there are, actually, projects out there to compile it from source, and allegedly there are those that can bypass the optimizer limitations. Another alternative is to use mips-gcc which doesn't have those limitations. These methods are a bit beyond me, since last I looked into it I couldn't wrap my head around how to use those with the proper linker/header-files. I have been meaning to look into it further, but so far haven't run into any brick-walls regarding speed/code-size with my pic32 projects.

Thanks for your insight, I clearly need to look into this more and do more experimenting, and obviously have a lot to learn.

  Are you sure? yes | no