-
Tiny Circular Buffer - back to linear
12/30/2016 at 04:11 • 0 commentsI need to add elements to the end of a buffer, and remove elements from the beginning... It can have from 0 to 4 elements loaded at a time.
The de-facto answer might be a circular-buffer.
But this buffer is only 4 elements long...
It would seem that implementing this as a simple array would be more efficient, in my case. Yes, it means that when I remove an element from the beginning, I have to shift the remaining data to the beginning...And, the de-facto answer might be a for() loop...
But, again, there's only four elements. So, unroll that loop, as well.
(Note that the optimizer can look at short fixed-count for() loops, and automatically "unroll" them... I'm pretty much certain that this case will be smaller unrolled, and I'm not entirely convinced my optimizer-settings will do-so, so I'll type it manually.)
-----------
Interestingly, doing this as a simple array, rather than a circular-buffer, also dramatically decreased the amount of code in nearly every other function, e.g. buffer_add(), buffer_countElements(), buffer_isFull(), buffer_clear()... In fact, many of these functions are now merely comparisons/assignments to a variable such as buffer_itemCount. So, now, whereas I had multi-line functions that *could*'ve been inlined to reduce code-space in a few cases, now it's *definitely* more code-space (and execution-time!) efficient to inline these functions in every case.
-
pointer idea...
12/30/2016 at 02:25 • 0 commentsI have to add several values from several pointers...
e.g.
uint16_t *pA = NULL; uint16_t *pB = NULL; uint16_t *pC = NULL; < a bunch of code that assigns pA, pB, and/or pC > uint16_t value = *pA + *pB + *pC;
BUT any and/or all of these pointers may be unassigned... in which case, they should not be considered in the sum for value.
Of course, using NULL pointers makes sense, to indicate whether they've been assigned. But, as I understand the C standard, you're not supposed to *access* a NULL address... You're only supposed to *test* whether a pointer is NULL.
(e.g. address-zero may be a reset-vector, which probably contains a "jump" instruction, which most-likely is NOT zero, in value).
So, again, if I understand correctly, the "right" way to handle these potential NULL pointers would be something like:
uint16_t *pA = NULL; uint16_t *pB = NULL; uint16_t *pC = NULL; < a bunch of code that assigns pA, pB, and/or pC > uint16_t value = 0; if(pA) value = *pA; if(pB) value += *pB; if(pC) value += *pC;
That's a lot of tests! Surely they'll add up in code-space...Instead, what about:
uint16_t zeroWord = 0; uint16_t *pA = &zeroWord; uint16_t *pB = &zeroWord; uint16_t *pC = &zeroWord; < a bunch of code that assigns pA, pB, and/or pC > uint16_t value = *pA + *pB + *pC;
-
here's an idea... parsing
12/09/2016 at 17:34 • 0 commentsThis is just a random-realization while working on my project... maybe it's obvious to everyone in-the-know.
Say you're parsing something, like commands from a terminal-window...
Say you've got a whole bunch of commands, but they mostly all follow the same handful of formats, like Ax and Ay, Bx and By, etc.
Lemme think of an example...
Say motor commands (terminated with '\n'):
SMn = stop motor number n (where n is one character, 0-9)
AMnx = advance motor N by x millimeters (where x is any number)
FMnx = move motorN forwards at x mm/sec
and so-forth.
One could, obviously, parse each character as it comes through, then do a whole bunch of nested if-then statements.
Another idea is to combine the first and second characters into a single 16-bit variable, then use a switch() statement. Maybe not ideal for *this* example, but I've found it useful at times.
So, that's one consideration, here's another:
Say this motor-system also has LED-commands:
BLnx = blink LED n x times per second
Obviously the n and the x, here, don't apply to a motor...
So, then, the whole nested-if statement thing makes sense, again, right?
But we're worried about *size* here, not speed... (baud-rate's way slower than your processor, right?)
So, then, maybe it makes sense to parse 'n' and 'x' and store them in argument-variables, and only *after that* handle the actual Command characters.
"But wait! 'SMn' doesn't have an x!"... Right... but here's the idea:
Say everything's stored in a string-buffer... and whatever arrived after the '\n' may very well be data from a previous command... But, your numeric-parser for x terminates as soon as anything non-numeric comes through (\n (or just get rid of the \n and it'll be terminated with \0...))... the variable for argument x will be filled with 0, but even that doesn't matter, because it's not being used, in this case...
Then why parse it if it's not even part of the command, and isn't even there in the first place?
So, here it is without...
uint16_t command = string[0] | (string[1])<<8; uint8_t deviceNum = string[2] - '0'; char *value = &(string[3]); #define commandify(a,b) \ ((uint16_t)a | ((uint16_t)b)<<8) switch(command) { case commandify('S','M'): motor_stop(deviceNum); break; case commandify('A','M'): mm = parseNumber(value); motor_advance(deviceNum, mm); break; ... case commandify('B','L'): rate = parseNumber(value); led_blink(deviceNum, rate); break; ... }
So, now, for the example described, you've either got to call 'parseNumber()' three times for the three commands that use it, or explicitly handle the 'S' case separately from the switch, (makes sense, unless there are *several* such cases, in which case your test becomes quite large, maybe even a second switch-statement).
Or, just parse the number from the start, and don't use it if you don't need to.
And, let's make it even more interesting, what if there's another command:
PSs = Print string s
Could *still* call parseNumber, *and* fill deviceNum (both with garbage) and have a really simple switch-statement (maybe even a lookup-table, at this point):
uint16_t command = string[0] | (string[1])<<8; uint8_t deviceNum = string[2] - '0'; #define commandify(a,b) \ ((uint16_t)a | ((uint16_t)b)<<8) //### No way you're gonna get away with floats // in a 1K project, without an FPU ;) float value = parseNum(&(string[2])); switch(command) { case commandify('S','M'): motor_stop(deviceNum); break; case commandify('A','M'): motor_advance(deviceNum, value); break; ... case commandify('B','L'): led_blink(deviceNum, value); break; ... case commandify('P','S'): printf("%s", &(string[2])); break; ... }
-
code-size helpers (in the form of a makefile)
12/09/2016 at 16:13 • 0 commentsHere's a minimal makefile for tracking your code-size, etc...
(This doesn't yet create the hex-file for flashing!)
default: build lss size #Compile, optimize for size build: avr-gcc -mmcu=atmega8515 -Os -o main.out main.c #Create an assembly-listing (with C code alongside) #Check out main.lss! lss: avr-objdump --disassemble-zeroes -h -S main.out > main.lss #Output the sizes of the various sections # written to flash = .text + .data size: avr-size main.out clean: rm -f main.out main.lss
-
check your optimization-level!
11/28/2016 at 18:04 • 8 commentsIf working with a microcontroller, your system may already be set up for the "-Os" optimization-level, so the information here might not save you any program-memory...
-----------------------AS I UNDERSTAND (I am by no means an expert on any of this!):
-Os is "optimize for size"
-Os basically does as much computation (of your code) as possible during the compilation-process, and tries to look for the most code-size-efficient means to compile it, rather than leaving a bunch of that code up to your processor to handle in real-time.
Contrast that with "-O0" (no optimization), where the code will be compiled almost exactly as you wrote it.
E.G. a really simple example:
With -O0 (default): "a = 1 + 2;" might very well write the value 1 to the register containing the variable a, then add 2 to it. At least two instructions to be executed in realtime on your processor.
With -Os "a = 1 + 2;" most-likely will result in one instruction, writing the value 3 to the register containing variable a.
Other optimization-levels (-01, -02...) aren't discussed here, but check out the comments at the bottom of the page, from @Karl S, and note that they might in fact result in *larger* code than with no optimization, as it might optimize for *speed*).
-------
The key is, the optimization-level may have a lot to do with the size of your compiled-project... And it's not just a matter of "levels", but different types entirely
(e.g. some optimization-"levels" may prefer execution-speed over *size*, etc. In gcc there are also "-f<options>" which allow you to fine-tune your optimizer's preferences, and there are also pragmas(?) to choose specific optimization-levels for specific parts of your code... These are a bit beyond me...)
So you might want to do some reading-up on the matter, and/or experiment!
-------
Here's a [multitude of] "wow"-moments, I've experienced with the matter, but first some overview:
I've a macro that turns a pin on a port into an output called "setoutPORT()".
(This is for an AVR...)
#define setoutPORT(pinNum, PORTx) \ setbit2(DDR_FROM_PORT(PORTx), pin) #define DDRPORTOFFSET 1 #define DDR_FROM_PORT(PORTx) \ ((_MMIO_BYTE(&(PORTx) - DDRPORTOFFSET))) #define setbit2(variable, bitNum) \ (variable = ((variable) | (1 << (bitNum))))
(The point is to use one definition for the port-name to use with all pin-related macros, regardless of which register they actually need to access)
Here's a *really* simple program using it:
#define LED_PIN 1 #define LED_PORT PORTB int main(void) { //set PB1 as an output setoutPORT(LED_PIN, LED_PORT); while(1) {} }
And here's how "main" compiles with my default optimization-level (-Os):
int main(void) { setoutPORT(1, PORTB); 38: b9 9a sbi 0x17, 1 ; 23 3a: ff cf rjmp .-2 ; 0x3a 0000003c <_exit>: 3c: f8 94 cli 0000003e <__stop_program>: 3e: ff cf rjmp .-2 ; 0x3e <__stop_program>
rjmp .-2 is the while(1) loop, it jumps back *to itself* (Sometimes, with optimization, the disassembly-output doesn't show all the original source-code, in this case it forgot the while(1){})
Wow-Moment Number Zero:I've been using this method (setoutPORT and all its dependencies) for *years* with AVRs and have known it to (and relied on it to) compile to a single sbi instruction...
But y'all likely haven't seen it yet, so take a moment to look at all the math involved in the setoutPORT macro... That's a *lot* of math, including pointer-arithmetic.
I guess I was mistaken, because I always thought the Preprocessor was responsible for handling constant math, like that. Or, at least, that the compiler looked for constant-math inherently as an early-stage in the compilation-process (I guess the preprocessor wouldn't know much about pointer-arithmetic).
I figured the -Os part of the optimizer was only handling the conversion of
(variable = ((variable) | (1 << (bitNum))))
into an sbi, which is pretty impressive in-and-of itself.
Today's Wow-Moment:
Here's the output with no optimization (-O0):
int main(void) { 38: cf 93 push r28 3a: df 93 push r29 3c: cd b7 in r28, 0x3d ; 61 3e: de b7 in r29, 0x3e ; 62 setoutPORT(1, PORTB); 40: 87 e3 ldi r24, 0x37 ; 55 42: 90 e0 ldi r25, 0x00 ; 0 44: 27 e3 ldi r18, 0x37 ; 55 46: 30 e0 ldi r19, 0x00 ; 0 48: f9 01 movw r30, r18 4a: 20 81 ld r18, Z 4c: 22 60 ori r18, 0x02 ; 2 4e: fc 01 movw r30, r24 50: 20 83 st Z, r18 while(1) {} 52: ff cf rjmp .-2 ; 0x52 <__SREG__+0x13> 00000054 <_exit>: 54: f8 94 cli 00000056 <__stop_program>: 56: ff cf rjmp .-2 ; 0x56 <__stop_program>
That's a bit much for me to parse, right now... but it would seem, for the most-part, that math is NOT being pre-calculated at compile-time. So my single sbi instruction has gone up to 13 instructions(?!). And, I think, some of those instructions are *two-cycle* instructions!Furthermore, note that a lot of that math involves pointers and pointer-arithmetic... And, again, *all* of that is boiled down to *constants*.
(Interestingly, it looks like the (1<<1) must've been handled outside the optimizer, as the ori is fed a constant 2).
There's some weirdness in there, though... What's with r28? Looks like it's never used...? or does r28/29 make up "Z"? And why's it bother pushing 'em, when there's no return? I'll have to look up the ol' instruction-set before I can parse this.
Regardless... That's a *LOT* of instructions... not only to be stored in program-memory, but also to be executed in real-time.
-----------------
Here's the same macro handled on PIC32...
I've only the free version of xc32-gcc, so -Os is not available. Its highest optimization-level is -O1.
(I'm not sure which of the available optimization options I used when I compiled this)
9d002ec8: 00801021 move v0,a0 9d002ecc: a3a20000 sb v0,0(sp) setoutPORT(Tx0_pin, Tx0_PORT); 9d002ed0: 3c02bf88 lui v0,0xbf88 9d002ed4: 24426420 addiu v0,v0,25632 9d002ed8: 24420034 addiu v0,v0,52 9d002edc: 24030008 li v1,8 9d002ee0: ac430000 sw v1,0(v0) 9d002ee4: 3c02bf88 lui v0,0xbf88 9d002ee8: 24426420 addiu v0,v0,25632 9d002eec: 2442fff4 addiu v0,v0,-12 9d002ef0: 24030008 li v1,8 9d002ef4: ac430000 sw v1,0(v0)
That means every time I call setoutPORT it goes through *all that math*, and that goes for other things like "writePORT" and "setpinPORT", etc. Wow.(Of course, these macros are written slightly differently than shown for the AVR, above, as the PIC32 registers are different, but the concept is identical).
So... I'd say "write your registers directly, if code-space is a concern!"
and
"Don't expect constant-math to always be optimized-out!"
-------
Though, there must be a way to handle this more-efficiently.
One method might be using preprocessor string-concatenation... E.G. PORT##B or DDR##B. I tried that *long* ago, but for some reason replaced it in favor of this. I think I wasn't sure how to have e.g. #define Tx0_PORT B, and have PORT##Tx0_PORT work out... but I think I understand that better, now. And I wrote about this method once before (I have no idea where) and someone (I'm sorry!) commented on using that method, as well... so it should work, right?
Alternatively... Maybe things as oft-used as pin-setting/clearing can be handled *explicitly* in inline-assembly, via macros... as, without -Os, even DDRB |= (1<<1) would most-likely result in several instructions, likely even an unnecessary read-modify-write.
and, maybe even
"If you can't use -Os, then you might want inline-assembly!"
(Thank you @Elliot Williams for your help getting my pic32 disassembler working to show the original source-code! That's what got me to this "Wow-Moment.")
-----------
A less-brief-than-I-intended note on inline-assembly:
It's *confusing as heck* if you want to use C-variables within your assembly-listing. And there's even an equally-confusing means to pass it constants... BUT: If you're working with constants, anyhow, you *can* put macro-names in it... It's been "a minute" since I've done-so... but it's not particularly difficult. E.G. the AVR setoutPORT call was:
sbi 0x17, 1
so, e.g. I think it'd be *SOMETHING LIKE*:
#define DDRB_STRING "0x17" #define PIN1_STRING "1" asm("sbi " DDRB_STRING ", " PIN1_STRING ";"); or: #define QUOTE_THIS(x) #x #define QUOTE_THIS_VALUE(x) QUOTE_THIS(x) #define setout(pinNum, DDR_ADDR) \ asm("sbi" QUOTE_THIS_VALUE(DDR_ADDR) \ ", "QUOTE_THIS_VALUE(pinNum) ";"); called as: setout(1, 0x17); or maybe even: #define LED_PIN 1 #define LED_DDR_REG 0x17 setout(LED_PIN, LED_DDR_REG); But I don't think you can do: setout(1, DDRB) since DDRB is defined as MMIO_BYTE(something) which doesn't boil down to a raw number, rather something involving pointers.
(weee!)
-
Put some code-space in your "savings account"!
11/28/2016 at 03:10 • 5 commentsThis may seem a bit ridiculous, but believe me, it's turned out useful *quite-often* when expecting a project might eventually reach code-space limitations...
Throw something "big" in your project that doesn't do anything important... At the very start of the development-process. E.G.:
char wastedSpace[80] = { [0 ... 78] = 'a' };
Hide it somewhere so you forget about it... Then when your project has gone from 512B to 960B, and suddenly in the next-revision it's gone from 960 to 1025... You'll go "oh sh**", then probably start looking at your code trying to figure out some ways to make it smaller... (maybe a good thing)... Then eventually you might step-back a bit frustrated and... eventually... remember that there's a sizable chunk you can take out with no consequences whatsoever, and continue your progress without having to change anything already-functional. Consider it a terrifying--and then relieving--warning.
This example works for both program-memory as well as RAM... But there are other ways to do similar, and may even be useful in the meantime. E.G. I usually throw in a fading "heartbeat" LED... That code can be removed, entirely, from my project by merely defining HEART_REMOVED, regaining a few hundred bytes on the spot, and rendering a project which would've stalled due to a few bytes to one which can [cautiously] continue development for quite some time thereafter.(NOTE that SOME OPTIMIZERS might look at something like the above and recognize that it's never used, then "optimize it out". So, keep that in mind... Some other methods might be to e.g. throw something in PROGMEM. Might be a good idea to write an empty project, compile it, look at the code-size, then add your "savings account" and make sure that code-size increases as expected).
Another thing I regularly do that "uses up space" is to throw project-info into the Flash-memory in a *human-readable* format... That way I can, years down the line, read-back the flash-memory from a chip and determine things like which project it is, which version-number, and what date it was compiled on... That info is definitely useful later down the road, but not *essential*, so potentially hundreds of bytes can be removed by removing that information. (That information is automatically-generated into "projinfo.h" by my 'makefile', and projinfo.h is then #included in main.c... so to remove it, just comment-out that #include.)
-
Multiplication/Division and Floating-point-removal
11/27/2016 at 14:01 • 0 commentsThere may be a lot of cases where floating-point "just makes sense" in the intuitive-sense...
But Floating-Point math takes a *lot* of code-space on systems which don't have an FPU.
I'm not going to go into too much detail, here... but consider the following
float slope = 4.0/3.0; int pixelY = slope * pixelX + B;
This can be *dramatically* simplified in both execution-time and code-size by using:int slope_numerator = 4; int slope_denominator = 3; int pixelY = slope_numerator * pixelX / slopeDenominator + B;
Note that the order matters!
Do the multiplication *first*, and only do the division at the *very end*.
(Otherwise, you'll lose precision!)
Note, also, that doing said-multiplication *first* might require you to bump up your integer-sizes...
You may know that pixelX and pixelY will fit in uint8_t's, but pixelX*slope_numerator may not.
So, I can never remember the integer-promotion rules, so I usually just cast as best I can recall:
uint8_t B = 0; uint8_t pixelX = 191; uint8_t pixelY = (uint16_t)slope_numerator * (uint16_t)pixelX / slopeDenominator + B;
Don't forget all the caveats, here... You're doing 16-bit math, probably throughout the entire operation, but the result is being written into an 8-bit variable... The case above results in pixelY = 254, but what if B was 2?
------
Regardless of the casting, and the additional variable, this is almost guaranteed to be *much* faster and *much* smaller than using floating-point.
----------
@Radomir Dopieralski strikes again!
I was planning on writing up about iterative-computations, next... but it's apparently already got a name and a decent write-up, so check out: https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm
(whew, so much energy saved by linking!)
THE JIST: The above examples (y=mx+b) should probably *NOT* be used for line-drawing!
They're just easy/recognizable math-examples for this writeup to show where floating-point can be removed.
On a system where you *have* to iterate through, anyhow (e.g. when drawing every pixel on a line, you have to iterate through every pixel on the line), then you can turn the complicated math (e.g. y=mx+b) containing multiplications/divisions into math which *only* contains addition/subtraction.(Think of it this way, how do you calculate 50/5 way back in the day...? "How many times does 5 go into 50?" One way is to subtract 5 from 50, then 5 from 45, then 5 from 40, and so-on, and count the number of times until you reach 0. Whelp, computers are great at that sorta thing, and even better at it when you have to do looping for other things as well.
-
Volatiles!
11/26/2016 at 11:50 • 0 commentsYou may be familiar with "volatile" variables...
If not--and your project works, reliably-enough--then ignore this, because you've got 1kB to fit your code within in a short period of time, and you're not worried about your project threatening lives...
(If those "if"s and "because"s aren't true, then be sure to check out an explanation of volatile here: https://hackaday.io/project/5624/log/49037-interrupts-volatiles-and-atomics-and-maybe-threads )
------
The thing with volatile is that it's absolutely essential to understand how/when to use it, if you're doing *anything* where a person could be hurt.
The thing with fitting your code in 1kB to blink some LEDs or load an LCD-display is that you probably don't care, as long as it works most of the time.
I'm *in no way* suggesting you ignore this stuff habitually. You *definitely* need to be aware of it if you're ever going to do anything where others' safety is concerned, and, realistically, probably need to be aware of it even where *functionality* is concerned.
But, that-said... It's easy to get into the "habit" of believing that "volatile" is a safe-ish way to make sure you won't run into trouble... And that's not exactly the case.
AND, that-said... If you just use volatile, and the other techniques explained at that link, willy-nilly, then you might run into *excessive code-usage*.
So, all's I'mma say, here, is... if you're using them willy-nilly, and if you're in a tremendous space-crunch like this contest, consider those cases carefully... You might save yourself a few (numerous/countless) bytes if you *don't* use them where you *know* you don't *need* them.
-
Multiplication / Division...
11/26/2016 at 10:54 • 19 commentsThese can be *huge* operations... Here are some thought-points on alternatives.
Again, these techniques won't save you much space (nor maybe *any*) if you use libraries which make use of them... So, when space is a concern, you're probably best-off not using others' libraries.
------
So, here's an alternative... Say you need to do a multiplication of an integer by 4...
A *really* fast way of doing-so is (a<<2), two instructions on an AVR.
If you need to do a division by 4? (a>>2), two instructions on an AVR.
(Beware that signed-integer operations may be a bit more difficult).
.....
Another alternative is... a bit less-self-explanatory, and likely quite a bit more messy...
In most cases, there will be *numerous* functions automatically-generated which handle multiplications/divisions between integers of different sizes. That's a lot of code generated which mightn't be necessary with some pre-planning.
(and don't even try to use floating-point... I'm not certain, but I'm guessing a floating-point division function alone is probably close to 1kB).
----------
ON THE OTHER HAND: Some architectures have inbuilt support for some of these things... E.G. whereas
(a<<3)
might require three instructions on any AVR,(uint8_t)a * (uint8_t)8
may be only *one* instruction on a megaAVR containing a MULT instruction, but may be darn-near-countless instructions on a tinyAVR.Read that again... On both architectures, using <<3 may result in exactly *three* instructions, whereas in one architecture (e.g. megaAVR), *8 may result in *one* instruction, whereas in another (e.g. tinyAVR) it may result in loading two registers, jumping to a function, and a return. AND, doing-so not only requires the instructions to *call* that function, but also the function itself, which may be *numerous* instructions...
---------
OTOH, again... Say you're using a TinyAVR, where a MULT instruction isn't already part of the architecture's instruction-set. If you're using other libraries which use the mult8() function, (e.g. by using a*b), mult8() *will* be included, regardless of whether you figure out other means using e.g. << throughout your own code.
There comes a point where even using << may result in *more* instructions than a call to the mult8() function which has already been included by other libraries.
(e.g. <<7 might be seven instructions, but if the mult8() function has already been included, then you only need to load two registers, and jump/call, which is only something like 3 instructions...)
There are lots of caveats, here... It will definitely take *longer* to execute mult8(), but it will take *fewer* (additional) instructions, in the program-memory to call it. Again, that is, assuming mult8() is compiled into your project, via another call from elsewhere.
-----------------------------------------------------------------------------------------------------------------------
TODO: This needs revision. Thank you @Radomir Dopieralski, for bringing it to my attention, in the comments below! As he pointed-out, the level of "micro-optimization" explained in this document can actually bite you in the butt if you're not careful. Optimizers generally know the most-efficient way to handle these sorts of things for the specific architecture, and often find ways that are way more efficient than we might think.
E.G. as explained earlier, (x*64) can be rewritten as (x<<6).
If your microcontroller has a MULT instruction, (x*64) may, in fact, require the fewest number of instructions.
If your microcontroller *doesn't* have MULT, then the optimizer (or you) might choose to replace it with (x<<6), which might result in six left-shift instructions. (or possibly a loop with one left-shift and a counter).
But there are many other cases us mere-mortals may never think of. E.G. some microcontrollers have a "nibble-swap" instruction, where, the high nibble and low-nibble quite literally swap places. So, the optimizer *might* see (x<<6) and instead replace it with, essentially, (nibbleSwap(x & 0x0f) << 2). That's four instructions, rather than six.
And then, as described earlier, there's the case where _mult8() is already in your code, and the optimizer (-Os for *size* not speed) might recognize that it only takes three instructions to call _mult8().
TODO: The point, which I completely forgot in writing this late "last night", wasn't to encourage you to replace your multiplications (e.g. x*64) with shift-operations (x<<6), but to be aware that code *can* be hand-tuned/optimized, when considered *carefully* (this takes a lot of experimentation, too!) and the results may not be ideal for all platforms/architectures or even for all devices in the same architecture! And, further, doing-so *may* bite you in the butt if done from the start... (e.g. you design around *not* using _mult8(), but then later down the road realize you *have to* for something else, now your code-size increases dramatically *and* your "micro-optimizations" are slightly less efficient than merely calling _mult8())
-------
E.G. consider (x*65)...Do you *need* that much precision? If not, you might be able to get away with thinking about how your architecture will handle the operation... If your architecture has a MULT instruction, then you probably don't need to worry about it, but if it *doesn't* x*65 may very well result in *quite a few operations* that you don't need... If x*64 is close-enough, then using that *might* be *significantly* smaller in code-size and execution-time.
Note that this is a bit *in-depth* in that if somewhere else in your code (or libraries you've used) a similar operation is performed, then your compiled code will have a function like _mult8(a,b) linked-in... Calling that may only result in 3 additional instructions ( load registers a and b, call _mult8() ) whereas, again, remember that (1<<6) might result in *six* instructions. BUT: If you *know* that _mult8() is *not* used anywhere else, and you *know* that you don't absolutely need it, then you'll save *dozens* of instructions by making sure it's *never* used (and therefore not linked-in).
Think of this like the floating-point libraries... If you use floating-point, your code-size will likely grow by SEVERAL KB. If you throw usage of things like sin() or whatnot, that'll add significantly more. But if you *don't* use them, then they won't be added to your code-size. (This is similar to what happens with using global-variables which are initialized vs. those which aren't, described in a previous log). These aren't functions that *you've* written, they're essentially libraries that are automatically added whenever deemed-necessary.
Oy this is way too in-depth.
And, really, it requires quite a bit of experimentation.
TODO: A note on optimizers... -Os will most-likely consider other options such as the nibble-swap example given earlier, but some other optimization-levels will take your code word-for-word. Think you can outsmart it? :)
-------------------
Realistically, these techniques may only be useful if you've got complete control over all your code, and they're *considered* along-the-way, but only implemented *at the end* to squeeze out a few extra bytes...
-
AVR project doing nada = 58Bytes, and some experiments/results.
11/26/2016 at 04:46 • 0 commentsFirst, note: I'm using avr-gcc, directly, rather than going through e.g. WinAVR, or Arduino...
And be sure to check that previous log! I am *not* using stdio, as that's *huge*, but it takes some effort to make certain it's not included.
--------------
Here I've created an AVR project with nothing but the following, code-wise...
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> int main(void) { while(1) {} }
This project compiles with the following specs, output by 'avr-size'
_BUILD/minStartingPoint.elf : section size addr .text 58 0 .data 0 8388704 .stab 1200 0 .stabstr 2993 0 .comment 17 0 Total 4268
As I understand the contest's requirements, this qualifies as 58 Bytes toward our 1kB limit.
-------
Now, what happens when we add a global-variable?
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar; // = 0; int main(void) { while(1) {} }
Now we get:
_BUILD/minStartingPoint.elf : section size addr .text 74 0 .data 0 8388704 .bss 1 8388704 .stab 1212 0 .stabstr 3010 0 .comment 17 0 Total 4314
Toward the contest-requirements, I believe this qualifies as 74 Bytes toward our 1kB limit.Note that I did not initialize the global variable... If I'd've initialized it to 0, we'd have *exactly* the same results. (Uninitialized global/static variables are always initialized to 0, per the C standard. THIS DIFFERS from *non-global*/*non-static* local-variables, which are *not* presumed to be 0 by default.)
-----------
But what happens when we initialize it to some other value?
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar = 0x5a; int main(void) { while(1) {} }
_BUILD/minStartingPoint.elf : section size addr .text 80 0 .data 2 8388704 .stab 1212 0 .stabstr 3010 0 .comment 17 0 Total 4321
NOW, note... our ".data" section has increased from 0 to 2. (and our .bss section has dropped from 1 to 0).As I understand, the ".data" section counts toward both your RAM and ROM/Flash usage.
Why both? Because the global-variable is *initialized* to the value 0x5a. The variable itself sits in RAM, but flash-memory is necessary to store the initial-value so it can be written to the RAM at boot.
As I understand, there's a bit of code hidden from us that essentially iterates through a lookup-table writing these initial-values to sequential RAM locations, which will then become your memory-locations for your global/static variables.
Note, again, this didn't happen when the global-variable was uninitialized (or initialized to 0) because there's no need for a lookup-table to store a bunch of "0"s, sequentially. Instead, there's a separate piece of hidden-from-us code that loads '0' to each sequential RAM location used by "uninitialized" globals/statics.
SO...
As I understand, per the contest-requirements, the above example counts as 80+2 = 82 Bytes toward the 1kB limit.
-------
I'm just guessing, here, but I imagine it went to *2* rather than *1* because they indicate the end of the initialization/"lookup-table" with a "null"-character = 0... So, most-likely, if you add a second initialized global-variable the .data section will be 3 Bytes.
Let's Check:
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar = 0x5a; uint8_t globalVar2 = 0xa5; int main(void) { while(1) {} }
Well, color-me-stupid...section size addr .text 80 0 .data 2 8388704 .stab 1224 0 .stabstr 3028 0 .comment 17 0 Total 4351
.... and three?#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar = 0x5a; uint8_t globalVar2 = 0xa5; uint8_t globalVar3 = 0xef; int main(void) { while(1) {} }
section size addr .text 80 0 .data 4 8388704 .stab 1236 0 .stabstr 3046 0 .comment 17 0 Total 4383
Uh Huh...!So, maybe the init-routine handles 16-bit words at a time... might make sense, since 'int' is 16-bits.
Anyways, that's probably irrelevent.
But, do note that the ".text" section hasn't grown at all.
--------
So, again, this last example would most-likely count toward 84 Bytes of the 1kB limit.
......
The key, here, is that when you "Flash" your chip, it will flash ".text" + ".data" bytes to the flash/program memory...
So, regardless of this contest, the end-result is that even if your .text section is less than your flash-memory space (say 8190 bytes), your project still might not "fit" in your flash-memory.
I remember this being *quite confusing* when I first ran into it... so maybe this'll help save you some trouble.
.......
As far as the other sections... The ones listed here, from avr-size, don't really count, they contain stuff like debugging information that's stored in your compiled binary-file, but not written to flash.
Oh, and if I wasn't clear, ".bss," it seems, tells you the amount of RAM used by *uninitialized* global/static variables... which doesn't count toward the amount of program-memory used.
.....
Side-Note: When working with projects with limited memory, it's probably wise to *not* use many large local variables... E.G. say you've a string...
void printHello(void) { char string[80] = "Hello"; char *charPtr = &(string[0]); while(*charPtr != '\0') { uart_putChar(*charPtr); charPtr++; } }
If you have several such functions, it might make more sense to have *one* *global* string array which can be (cautiously) reused between these functions... Otherwise, your stack can fill up quite-quickly, and stack-overflows are *really* confusing when they occur.Similarly, wise not to use Malloc, etc.
And another plus-side of doing-so is that it shows up in your ".bss" section, so you have an idea of how much memory you're using, and how much stack is available.
------
Here's another interesting aside... I just noticed when rereading this:
Did you notice that the ".text" section increased by only 6 bytes when we changed our uninitialized global variable to an initialized one? Seems fishy... I highly doubt they can fit looping through a lookup-table in only six bytes' worth of instructions...
I wonder if they only include each of the two different initialization-routines *when needed*...
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar = 0x5a; uint8_t globalVar2 = 0xa5; uint8_t globalVar3 = 0xef; uint8_t uninitializedGlobalVar; int main(void) { while(1) {} }
section size addr .text 96 0 .data 4 8388704 .bss 1 8388708 .stab 1248 0 .stabstr 3076 0 .comment 17 0 Total 4442
Ah hah! The only change was adding of another "uninitialized" == (initialized-to-zero), global-variable, and now the ".text" size has jumped from 80 Bytes to 96 Bytes.So it would seem that the "zeroing" routine for "uninitialized" global/static variables occupies 16 bytes, and the "lookup-table"-initialization routine occupies 22 Bytes.
Hey, wanna save a few bytes? Can you get away with converting all your global/static variables to either initialized or "uninitialized"? Might be something in there...
----------
Note: The above tests were performed on an ATmega8515...
The last-experiment shown was since run on an ATtiny861...
If anyone wonders about the differences in functionality of different systems, take a look here. Here's the result from the above test on a different processor of the same architecture:
section size addr .text 100 0 .data 4 8388704 .bss 1 8388708 .stab 1248 0 .stabstr 3093 0 .comment 17 0 Total 4463
Check that... The Tiny861 requires 4 more bytes of code-space to do the exact same thing.Is there a lesson, here? Nah... just, remember that the instruction-set may have something to do with code-space-usage. (And, maybe, if you've designed something on a Tiny AVR that *just* exceeds 1024 Bytes, then you might be able to recompile it for a Mega AVR and save a few bytes...)