• ### Two Additions In One Operation CTD -- Two Variables one register

A continuation of the last log:

Say you've got two 4bit variables:

One stores a state, the other stores a signed count from -7 to +7.

{Thumbtyping is hard...}

uint4_t state;

```uint4_t state;
Int4_t count;

Switch{state}
{
Case A:
count++;
Break;
Case B:
Count--;
...
}```

Of course, int4_t is rare, if existent.

But merging them into one 8bit byte in register/RAM would mean a lot of boolean logic and shifts, right?

Maybe not!

in fact, on an AVR [ATmega8515=old], it may be *fewer* operations than using separate 8bit variables, at least in some cases:

```//state in low nibble,
// SIGNED count in high nibble
uint8_t stateAndCount;

Switch{stateAndCount & 0x0f}
{
Case A:
stateAndCount+=0x10;
Break;
Case B:
stateAndCount-=0x10;
...
}```

My big concern was incrementing and extracting the signed count.

[LOL, I'd forgotten that, in the situation I'm considering replacing with this, I was already masking the state-only variable with 0x0f in the switch statement... this thing is just falling together!]

Incrementing is a typical add-immediate, which I think is the same number of cycles as increment, since AVR instructions are always 16bits. [16 for inc/dec, or 8 for add-immediate + 8 for the immediate value=0x10]

That leaves extracting/using the signed count

```int32_t TotalCount +=
((int8_t)stateAndCount)>>4;
```

A Few concerns:

shift-right for signed integers, in C... Is it guaranteed 'arithmetic' rather than 'logical'?

[And what about negatives and rounding toward negative infinity](?]

Does the AVR have a single-instruction arithmetic-shift-right?

...

So. Thumbtyping is exhausting, I'll leave it to you to look up details, for now. But the short answer looks to be that there may even be cases where this takes *Fewer* instructions this way, than to have to load/store the two 4bit (8bit-stored) state/count variables in RAM.

...

I wonder what-all we can do that we aren't with 64 bits?!

• ### Two Additions in one operation!

Maybe everyone knows this already, but I just thought of it for my first time...

Say you've got a 64bit processor, but you're working with two 32bit numbers... if they're stored in a single 64bit variable, and you're sure they won't overflow, you can do two additions simultaneously! Or 4 16bit additions, or 8 8bit!

So, what could this be used for? How often would that really be useful?

I dunno... a screen is significantly smaller than 65536x65536 pixels...

....

Presently I'm using an 8bitter, I have a function that I would like to return two TRUE/FALSE values. I want to keep running sums of those two values from each time it's called.

```//Returns:
//0x01 if button A pressed
//0x10 if button B pressed
//0x11 if both are pressed
uint8_t getButtons(void);

main()
{
uint8_t countsCombined=0;

for(i=0; i<15; i++)
countsCombined+=getButtons();

printf("A presses = %d\n"
"B presses = %d\n",
countsCombined&0x0f,
countsCombined>>4);
}```

But, holy moly, this seems a little cheesy on an 8bitter, but just think what could be done on a 16bitter, or 64bitter!

Maybe you're designing PONG on a 16bit computer in a 256x256 window, the ball moves two pixels up, one left:

```//Upper byte is X, lower byte is Y

uint16_t ballPosition = 0x0000;

uint16_t ballStep = 0x0102;

while(wallNotHit)
{
ballPosition += ballStep;
}
```
• ### Tiny Circular Buffer - back to linear

I need to add elements to the end of a buffer, and remove elements from the beginning... It can have from 0 to 4 elements loaded at a time.

The de-facto answer might be a circular-buffer.

But this buffer is only 4 elements long...

It would seem that implementing this as a simple array would be more efficient, in my case. Yes, it means that when I remove an element from the beginning, I have to shift the remaining data to the beginning...

And, the de-facto answer might be a for() loop...

But, again, there's only four elements. So, unroll that loop, as well.

(Note that the optimizer can look at short fixed-count for() loops, and automatically "unroll" them... I'm pretty much certain that this case will be smaller unrolled, and I'm not entirely convinced my optimizer-settings will do-so, so I'll type it manually.)

-----------

Interestingly, doing this as a simple array, rather than a circular-buffer, also dramatically decreased the amount of code in nearly every other function, e.g. buffer_add(), buffer_countElements(), buffer_isFull(), buffer_clear()... In fact, many of these functions are now merely comparisons/assignments to a variable such as buffer_itemCount. So, now, whereas I had multi-line functions that *could*'ve been inlined to reduce code-space in a few cases, now it's *definitely* more code-space (and execution-time!) efficient to inline these functions in every case.

• ### pointer idea...

I have to add several values from several pointers...

e.g.

```uint16_t *pA = NULL;
uint16_t *pB = NULL;
uint16_t *pC = NULL;

< a bunch of code that assigns pA, pB, and/or pC >

uint16_t value = *pA + *pB + *pC;```

BUT any and/or all of these pointers may be unassigned... in which case, they should not be considered in the sum for value.

Of course, using NULL pointers makes sense, to indicate whether they've been assigned. But, as I understand the C standard, you're not supposed to *access* a NULL address... You're only supposed to *test* whether a pointer is NULL.

(e.g. address-zero may be a reset-vector, which probably contains a "jump" instruction, which most-likely is NOT zero, in value).

So, again, if I understand correctly, the "right" way to handle these potential NULL pointers would be something like:

```uint16_t *pA = NULL;
uint16_t *pB = NULL;
uint16_t *pC = NULL;

< a bunch of code that assigns pA, pB, and/or pC >

uint16_t value = 0;
if(pA)
value = *pA;
if(pB)
value += *pB;
if(pC)
value += *pC;```
That's a lot of tests! Surely they'll add up in code-space...

```uint16_t zeroWord = 0;
uint16_t *pA = &zeroWord;
uint16_t *pB = &zeroWord;
uint16_t *pC = &zeroWord;

< a bunch of code that assigns pA, pB, and/or pC >

uint16_t value = *pA + *pB + *pC;```

• ### here's an idea... parsing

This is just a random-realization while working on my project... maybe it's obvious to everyone in-the-know.

Say you're parsing something, like commands from a terminal-window...

Say you've got a whole bunch of commands, but they mostly all follow the same handful of formats, like Ax and Ay, Bx and By, etc.

Lemme think of an example...

Say motor commands (terminated with '\n'):

SMn = stop motor number n (where n is one character, 0-9)

AMnx = advance motor N by x millimeters (where x is any number)

FMnx = move motorN forwards at x mm/sec

and so-forth.

One could, obviously, parse each character as it comes through, then do a whole bunch of nested if-then statements.

Another idea is to combine the first and second characters into a single 16-bit variable, then use a switch() statement. Maybe not ideal for *this* example, but I've found it useful at times.

So, that's one consideration, here's another:

Say this motor-system also has LED-commands:

BLnx = blink LED n x times per second

Obviously the n and the x, here, don't apply to a motor...

So, then, the whole nested-if statement thing makes sense, again, right?

But we're worried about *size* here, not speed... (baud-rate's way slower than your processor, right?)

So, then, maybe it makes sense to parse 'n' and 'x' and store them in argument-variables, and only *after that* handle the actual Command characters.

"But wait! 'SMn' doesn't have an x!"... Right... but here's the idea:

Say everything's stored in a string-buffer... and whatever arrived after the '\n' may very well be data from a previous command... But, your numeric-parser for x terminates as soon as anything non-numeric comes through (\n (or just get rid of the \n and it'll be terminated with \0...))... the variable for argument x will be filled with 0, but even that doesn't matter, because it's not being used, in this case...

Then why parse it if it's not even part of the command, and isn't even there in the first place?

So, here it is without...

```uint16_t command = string[0] | (string[1])<<8;

uint8_t deviceNum = string[2] - '0';
char *value = &(string[3]);

#define commandify(a,b) \
((uint16_t)a | ((uint16_t)b)<<8)

switch(command)
{
case commandify('S','M'):
motor_stop(deviceNum);
break;
case commandify('A','M'):
mm = parseNumber(value);
break;
...
case commandify('B','L'):
rate = parseNumber(value);
break;
...
}```

So, now, for the example described, you've either got to call 'parseNumber()' three times for the three commands that use it, or explicitly handle the 'S' case separately from the switch, (makes sense, unless there are *several* such cases, in which case your test becomes quite large, maybe even a second switch-statement).

Or, just parse the number from the start, and don't use it if you don't need to.

And, let's make it even more interesting, what if there's another command:

PSs = Print string s

Could *still* call parseNumber, *and* fill deviceNum (both with garbage) and have a really simple switch-statement (maybe even a lookup-table, at this point):

[Note 2023: WHOOPS! This is glitchy!]

```uint16_t command = string[0] | (string[1])<<8;

uint8_t deviceNum = string[2] - '0';

#define commandify(a,b) \
((uint16_t)a | ((uint16_t)b)<<8)

//### No way you're gonna get away with floats
// in a 1K project, without an FPU ;)
float value = parseNum(&(string[2]));

switch(command)
{
case commandify('S','M'):
motor_stop(deviceNum);
break;
case commandify('A','M'):
break;
...
case commandify('B','L'):
break;
...
case commandify('P','S'):
printf("%s", &(string[2]));
break;
...
}```
• ### code-size helpers (in the form of a makefile)

Here's a minimal makefile for tracking your code-size, etc...

(This doesn't yet create the hex-file for flashing!)

```default: build lss size

#Compile, optimize for size
build:
avr-gcc -mmcu=atmega8515 -Os -o main.out main.c

#Create an assembly-listing (with C code alongside)
#Check out main.lss!
lss:
avr-objdump --disassemble-zeroes -h -S main.out > main.lss

#Output the sizes of the various sections
# written to flash = .text + .data
size:
avr-size main.out

clean:
rm -f main.out main.lss
```

If working with a microcontroller, your system may already be set up for the "-Os" optimization-level, so the information here might not save you any program-memory...

-----------------------

AS I UNDERSTAND (I am by no means an expert on any of this!):

-Os is "optimize for size"

-Os basically does as much computation (of your code) as possible during the compilation-process, and tries to look for the most code-size-efficient means to compile it, rather than leaving a bunch of that code up to your processor to handle in real-time.

Contrast that with "-O0" (no optimization), where the code will be compiled almost exactly as you wrote it.

E.G. a really simple example:

With -O0 (default): "a = 1 + 2;" might very well write the value 1 to the register containing the variable a, then add 2 to it. At least two instructions to be executed in realtime on your processor.

With -Os "a = 1 + 2;" most-likely will result in one instruction, writing the value 3 to the register containing variable a.

Other optimization-levels (-01, -02...) aren't discussed here, but check out the comments at the bottom of the page, from @Karl S, and note that they might in fact result in *larger* code than with no optimization, as it might optimize for *speed*).

-------

The key is, the optimization-level may have a lot to do with the size of your compiled-project... And it's not just a matter of "levels", but different types entirely

(e.g. some optimization-"levels" may prefer execution-speed over *size*, etc. In gcc there are also "-f<options>" which allow you to fine-tune your optimizer's preferences, and there are also pragmas(?) to choose specific optimization-levels for specific parts of your code... These are a bit beyond me...)

So you might want to do some reading-up on the matter, and/or experiment!

-------

Here's a [multitude of] "wow"-moments, I've experienced with the matter, but first some overview:

I've a macro that turns a pin on a port into an output called "setoutPORT()".

(This is for an AVR...)

```#define setoutPORT(pinNum, PORTx)   \
setbit2(DDR_FROM_PORT(PORTx), pin)
#define DDRPORTOFFSET   1
#define DDR_FROM_PORT(PORTx) \
((_MMIO_BYTE(&(PORTx) - DDRPORTOFFSET)))
#define setbit2(variable, bitNum) \
(variable = ((variable) | (1 << (bitNum))))
```

(The point is to use one definition for the port-name to use with all pin-related macros, regardless of which register they actually need to access)

Here's a *really* simple program using it:

```#define LED_PIN  1
#define LED_PORT PORTB

int main(void)
{
//set PB1 as an output
setoutPORT(LED_PIN, LED_PORT);
while(1)
{}
}
```

And here's how "main" compiles with my default optimization-level (-Os):

``` int main(void)
{
setoutPORT(1, PORTB);
38: b9 9a          sbi   0x17, 1  ; 23
3a: ff cf          rjmp  .-2         ; 0x3a
0000003c <_exit>:
3c: f8 94          cli
0000003e <__stop_program>:
3e: ff cf          rjmp  .-2         ; 0x3e <__stop_program>

```

rjmp .-2 is the while(1) loop, it jumps back *to itself* (Sometimes, with optimization, the disassembly-output doesn't show all the original source-code, in this case it forgot the while(1){})

Wow-Moment Number Zero:

I've been using this method (setoutPORT and all its dependencies) for *years* with AVRs and have known it to (and relied on it to) compile to a single sbi instruction...

But y'all likely haven't seen it yet, so take a moment to look at all the math involved in the setoutPORT macro... That's a *lot* of math, including pointer-arithmetic.

I guess I was mistaken, because I always thought the Preprocessor was responsible for handling constant math, like that. Or, at least, that the compiler looked for constant-math inherently as an early-stage in the compilation-process (I guess the preprocessor wouldn't know much about pointer-arithmetic).

I figured the -Os part of the optimizer was only handling the conversion of

```(variable = ((variable) | (1 << (bitNum))))
```

into an sbi, which is pretty impressive in-and-of itself.

Today's Wow-Moment:

Here's the output with no optimization (-O0):

```int main(void)
{
38: cf 93          push  r28
3a: df 93          push  r29
3c: cd b7          in r28, 0x3d   ; 61
3e: de b7          in r29, 0x3e   ; 62
setoutPORT(1, PORTB);
40: 87 e3          ldi   r24, 0x37   ; 55
42: 90 e0          ldi   r25, 0x00   ; 0
44: 27 e3          ldi   r18, 0x37   ; 55
46: 30 e0          ldi   r19, 0x00   ; 0
48: f9 01          movw  r30, r18
4a: 20 81          ld r18, Z
4c: 22 60          ori   r18, 0x02   ; 2
4e: fc 01          movw  r30, r24
50: 20 83          st Z, r18

while(1)
{}
52: ff cf          rjmp  .-2         ; 0x52 <__SREG__+0x13>

00000054 <_exit>:
54: f8 94          cli

00000056 <__stop_program>:
56: ff cf          rjmp  .-2         ; 0x56 <__stop_program>
```
That's a bit much for me to parse, right now... but it would seem, for the most-part, that math is NOT being pre-calculated at compile-time. So my single sbi instruction has gone up to 13 instructions(?!). And, I think, some of those instructions are *two-cycle* instructions!

Furthermore, note that a lot of that math involves pointers and pointer-arithmetic... And, again, *all* of that is boiled down to *constants*.

(Interestingly, it looks like the (1<<1) must've been handled outside the optimizer, as the ori is fed a constant 2).

There's some weirdness in there, though... What's with r28? Looks like it's never used...? or does r28/29 make up "Z"? And why's it bother pushing 'em, when there's no return? I'll have to look up the ol' instruction-set before I can parse this.

Regardless... That's a *LOT* of instructions... not only to be stored in program-memory, but also to be executed in real-time.

-----------------

Here's the same macro handled on PIC32...

I've only the free version of xc32-gcc, so -Os is not available. Its highest optimization-level is -O1.

(I'm not sure which of the available optimization options I used when I compiled this)

```9d002ec8:   00801021    move  v0,a0
9d002ecc:   a3a20000    sb v0,0(sp)
setoutPORT(Tx0_pin, Tx0_PORT);
9d002ed0:   3c02bf88    lui   v0,0xbf88
9d002edc:   24030008    li v1,8
9d002ee0:   ac430000    sw v1,0(v0)
9d002ee4:   3c02bf88    lui   v0,0xbf88
9d002ef0:   24030008    li v1,8
9d002ef4:   ac430000    sw v1,0(v0)
```
That means every time I call setoutPORT it goes through *all that math*, and that goes for other things like "writePORT" and "setpinPORT", etc. Wow.

(Of course, these macros are written slightly differently than shown for the AVR, above, as the PIC32 registers are different, but the concept is identical).

So... I'd say "write your registers directly, if code-space is a concern!"

and

"Don't expect constant-math to always be optimized-out!"

-------

Though, there must be a way to handle this more-efficiently.

One method might be using preprocessor string-concatenation... E.G. PORT##B or DDR##B. I tried that *long* ago, but for some reason replaced it in favor of this. I think I wasn't sure how to have e.g. #define Tx0_PORT B, and have PORT##Tx0_PORT work out... but I think I understand that better, now. And I wrote about this method once before (I have no idea where) and someone (I'm sorry!) commented on using that method, as well... so it should work, right?

Alternatively... Maybe things as oft-used as pin-setting/clearing can be handled *explicitly* in inline-assembly, via macros... as, without -Os, even DDRB |= (1<<1) would most-likely result in several instructions, likely even an unnecessary read-modify-write.

and, maybe even

"If you can't use -Os, then you might want inline-assembly!"

(Thank you @Elliot Williams for your help getting my pic32 disassembler working to show the original source-code! That's what got me to this "Wow-Moment.")

-----------

A less-brief-than-I-intended note on inline-assembly:

It's *confusing as heck* if you want to use C-variables within your assembly-listing. And there's even an equally-confusing means to pass it constants... BUT: If you're working with constants, anyhow, you *can* put macro-names in it... It's been "a minute" since I've done-so... but it's not particularly difficult. E.G. the AVR setoutPORT call was:

`sbi   0x17, 1`

so, e.g. I think it'd be *SOMETHING LIKE*:

```#define DDRB_STRING "0x17"
#define PIN1_STRING "1"

asm("sbi " DDRB_STRING ", " PIN1_STRING ";");

or:

#define QUOTE_THIS(x) #x
#define QUOTE_THIS_VALUE(x) QUOTE_THIS(x)

", "QUOTE_THIS_VALUE(pinNum) ";");

called as:

setout(1, 0x17);

or maybe even:

#define LED_PIN 1
#define LED_DDR_REG 0x17
setout(LED_PIN, LED_DDR_REG);

But I don't think you can do: setout(1, DDRB)
since DDRB is defined as MMIO_BYTE(something)
which doesn't boil down to a raw number,
rather something involving pointers.```

(weee!)

• ### Put some code-space in your "savings account"!

This may seem a bit ridiculous, but believe me, it's turned out useful *quite-often* when expecting a project might eventually reach code-space limitations...

Throw something "big" in your project that doesn't do anything important... At the very start of the development-process. E.G.:

`char wastedSpace[80] = { [0 ... 78] = 'a' };`

Hide it somewhere so you forget about it... Then when your project has gone from 512B to 960B, and suddenly in the next-revision it's gone from 960 to 1025... You'll go "oh sh**", then probably start looking at your code trying to figure out some ways to make it smaller... (maybe a good thing)... Then eventually you might step-back a bit frustrated and... eventually... remember that there's a sizable chunk you can take out with no consequences whatsoever, and continue your progress without having to change anything already-functional. Consider it a terrifying--and then relieving--warning.

This example works for both program-memory as well as RAM... But there are other ways to do similar, and may even be useful in the meantime. E.G. I usually throw in a fading "heartbeat" LED... That code can be removed, entirely, from my project by merely defining HEART_REMOVED, regaining a few hundred bytes on the spot, and rendering a project which would've stalled due to a few bytes to one which can [cautiously] continue development for quite some time thereafter.

(NOTE that SOME OPTIMIZERS might look at something like the above and recognize that it's never used, then "optimize it out". So, keep that in mind... Some other methods might be to e.g. throw something in PROGMEM. Might be a good idea to write an empty project, compile it, look at the code-size, then add your "savings account" and make sure that code-size increases as expected).

Another thing I regularly do that "uses up space" is to throw project-info into the Flash-memory in a *human-readable* format... That way I can, years down the line, read-back the flash-memory from a chip and determine things like which project it is, which version-number, and what date it was compiled on... That info is definitely useful later down the road, but not *essential*, so potentially hundreds of bytes can be removed by removing that information. (That information is automatically-generated into "projinfo.h" by my 'makefile', and projinfo.h is then #included in main.c... so to remove it, just comment-out that #include.)

• ### Multiplication/Division and Floating-point-removal

There may be a lot of cases where floating-point "just makes sense" in the intuitive-sense...

But Floating-Point math takes a *lot* of code-space on systems which don't have an FPU.

I'm not going to go into too much detail, here... but consider the following

```float slope = 4.0/3.0;

int pixelY = slope * pixelX + B;```
This can be *dramatically* simplified in both execution-time and code-size by using:
```int slope_numerator = 4;
int slope_denominator = 3;

int pixelY = slope_numerator * pixelX / slopeDenominator + B;
```

Note that the order matters!

Do the multiplication *first*, and only do the division at the *very end*.

(Otherwise, you'll lose precision!)

Note, also, that doing said-multiplication *first* might require you to bump up your integer-sizes...

You may know that pixelX and pixelY will fit in uint8_t's, but pixelX*slope_numerator may not.

So, I can never remember the integer-promotion rules, so I usually just cast as best I can recall:

```uint8_t B = 0;
uint8_t pixelX = 191;
uint8_t pixelY = (uint16_t)slope_numerator * (uint16_t)pixelX
/ slopeDenominator + B;```

Don't forget all the caveats, here... You're doing 16-bit math, probably throughout the entire operation, but the result is being written into an 8-bit variable... The case above results in pixelY = 254, but what if B was 2?

------

Regardless of the casting, and the additional variable, this is almost guaranteed to be *much* faster and *much* smaller than using floating-point.

----------

I was planning on writing up about iterative-computations, next... but it's apparently already got a name and a decent write-up, so check out: https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm

(whew, so much energy saved by linking!)

THE JIST: The above examples (y=mx+b) should probably *NOT* be used for line-drawing!

They're just easy/recognizable math-examples for this writeup to show where floating-point can be removed.

On a system where you *have* to iterate through, anyhow (e.g. when drawing every pixel on a line, you have to iterate through every pixel on the line), then you can turn the complicated math (e.g. y=mx+b) containing multiplications/divisions into math which *only* contains addition/subtraction.

(Think of it this way, how do you calculate 50/5 way back in the day...? "How many times does 5 go into 50?" One way is to subtract 5 from 50, then 5 from 45, then 5 from 40, and so-on, and count the number of times until you reach 0. Whelp, computers are great at that sorta thing, and even better at it when you have to do looping for other things as well.

• ### Volatiles!

You may be familiar with "volatile" variables...

If not--and your project works, reliably-enough--then ignore this, because you've got 1kB to fit your code within in a short period of time, and you're not worried about your project threatening lives...

(If those "if"s and "because"s aren't true, then be sure to check out an explanation of volatile here: https://hackaday.io/project/5624/log/49037-interrupts-volatiles-and-atomics-and-maybe-threads )

------

The thing with volatile is that it's absolutely essential to understand how/when to use it, if you're doing *anything* where a person could be hurt.

The thing with fitting your code in 1kB to blink some LEDs or load an LCD-display is that you probably don't care, as long as it works most of the time.

I'm *in no way* suggesting you ignore this stuff habitually. You *definitely* need to be aware of it if you're ever going to do anything where others' safety is concerned, and, realistically, probably need to be aware of it even where *functionality* is concerned.

But, that-said... It's easy to get into the "habit" of believing that "volatile" is a safe-ish way to make sure you won't run into trouble... And that's not exactly the case.

AND, that-said... If you just use volatile, and the other techniques explained at that link, willy-nilly, then you might run into *excessive code-usage*.

So, all's I'mma say, here, is... if you're using them willy-nilly, and if you're in a tremendous space-crunch like this contest, consider those cases carefully... You might save yourself a few (numerous/countless) bytes if you *don't* use them where you *know* you don't *need* them.