Close

Multiplication / Division...

A project log for limited-code hacks/ideas

When you're running out of code-space... sometimes yah's gots to hack. Here are some ideas.

eric-hertzEric Hertz 11/26/2016 at 10:5419 Comments

These can be *huge* operations... Here are some thought-points on alternatives.

Again, these techniques won't save you much space (nor maybe *any*) if you use libraries which make use of them... So, when space is a concern, you're probably best-off not using others' libraries.

------

So, here's an alternative... Say you need to do a multiplication of an integer by 4...

A *really* fast way of doing-so is (a<<2), two instructions on an AVR.

If you need to do a division by 4? (a>>2), two instructions on an AVR.

(Beware that signed-integer operations may be a bit more difficult).

.....

Another alternative is... a bit less-self-explanatory, and likely quite a bit more messy...

In most cases, there will be *numerous* functions automatically-generated which handle multiplications/divisions between integers of different sizes. That's a lot of code generated which mightn't be necessary with some pre-planning.

(and don't even try to use floating-point... I'm not certain, but I'm guessing a floating-point division function alone is probably close to 1kB).

----------

ON THE OTHER HAND: Some architectures have inbuilt support for some of these things... E.G. whereas

(a<<3)
might require three instructions on any AVR,
(uint8_t)a * (uint8_t)8
may be only *one* instruction on a megaAVR containing a MULT instruction, but may be darn-near-countless instructions on a tinyAVR.

Read that again... On both architectures, using <<3 may result in exactly *three* instructions, whereas in one architecture (e.g. megaAVR), *8 may result in *one* instruction, whereas in another (e.g. tinyAVR) it may result in loading two registers, jumping to a function, and a return. AND, doing-so not only requires the instructions to *call* that function, but also the function itself, which may be *numerous* instructions...

---------

OTOH, again... Say you're using a TinyAVR, where a MULT instruction isn't already part of the architecture's instruction-set. If you're using other libraries which use the mult8() function, (e.g. by using a*b), mult8() *will* be included, regardless of whether you figure out other means using e.g. << throughout your own code.

There comes a point where even using << may result in *more* instructions than a call to the mult8() function which has already been included by other libraries.

(e.g. <<7 might be seven instructions, but if the mult8() function has already been included, then you only need to load two registers, and jump/call, which is only something like 3 instructions...)

There are lots of caveats, here... It will definitely take *longer* to execute mult8(), but it will take *fewer* (additional) instructions, in the program-memory to call it. Again, that is, assuming mult8() is compiled into your project, via another call from elsewhere.

-----------------------------------------------------------------------------------------------------------------------

TODO: This needs revision. Thank you @Radomir Dopieralski, for bringing it to my attention, in the comments below! As he pointed-out, the level of "micro-optimization" explained in this document can actually bite you in the butt if you're not careful. Optimizers generally know the most-efficient way to handle these sorts of things for the specific architecture, and often find ways that are way more efficient than we might think.

E.G. as explained earlier, (x*64) can be rewritten as (x<<6).

If your microcontroller has a MULT instruction, (x*64) may, in fact, require the fewest number of instructions.

If your microcontroller *doesn't* have MULT, then the optimizer (or you) might choose to replace it with (x<<6), which might result in six left-shift instructions. (or possibly a loop with one left-shift and a counter).

But there are many other cases us mere-mortals may never think of. E.G. some microcontrollers have a "nibble-swap" instruction, where, the high nibble and low-nibble quite literally swap places. So, the optimizer *might* see (x<<6) and instead replace it with, essentially, (nibbleSwap(x & 0x0f) << 2). That's four instructions, rather than six.

And then, as described earlier, there's the case where _mult8() is already in your code, and the optimizer (-Os for *size* not speed) might recognize that it only takes three instructions to call _mult8().

TODO: The point, which I completely forgot in writing this late "last night", wasn't to encourage you to replace your multiplications (e.g. x*64) with shift-operations (x<<6), but to be aware that code *can* be hand-tuned/optimized, when considered *carefully* (this takes a lot of experimentation, too!) and the results may not be ideal for all platforms/architectures or even for all devices in the same architecture! And, further, doing-so *may* bite you in the butt if done from the start... (e.g. you design around *not* using _mult8(), but then later down the road realize you *have to* for something else, now your code-size increases dramatically *and* your "micro-optimizations" are slightly less efficient than merely calling _mult8())

-------

E.G. consider (x*65)...

Do you *need* that much precision? If not, you might be able to get away with thinking about how your architecture will handle the operation... If your architecture has a MULT instruction, then you probably don't need to worry about it, but if it *doesn't* x*65 may very well result in *quite a few operations* that you don't need... If x*64 is close-enough, then using that *might* be *significantly* smaller in code-size and execution-time.

Note that this is a bit *in-depth* in that if somewhere else in your code (or libraries you've used) a similar operation is performed, then your compiled code will have a function like _mult8(a,b) linked-in... Calling that may only result in 3 additional instructions ( load registers a and b, call _mult8() ) whereas, again, remember that (1<<6) might result in *six* instructions. BUT: If you *know* that _mult8() is *not* used anywhere else, and you *know* that you don't absolutely need it, then you'll save *dozens* of instructions by making sure it's *never* used (and therefore not linked-in).

Think of this like the floating-point libraries... If you use floating-point, your code-size will likely grow by SEVERAL KB. If you throw usage of things like sin() or whatnot, that'll add significantly more. But if you *don't* use them, then they won't be added to your code-size. (This is similar to what happens with using global-variables which are initialized vs. those which aren't, described in a previous log). These aren't functions that *you've* written, they're essentially libraries that are automatically added whenever deemed-necessary.

Oy this is way too in-depth.

And, really, it requires quite a bit of experimentation.

TODO: A note on optimizers... -Os will most-likely consider other options such as the nibble-swap example given earlier, but some other optimization-levels will take your code word-for-word. Think you can outsmart it? :)

-------------------

Realistically, these techniques may only be useful if you've got complete control over all your code, and they're *considered* along-the-way, but only implemented *at the end* to squeeze out a few extra bytes...

Discussions

deʃhipu wrote 11/27/2016 at 16:59 point

Another technique that often lets you avoid complex calculations and costly operations is to calculate things iteratively -- similar to how the Bresenham algorithms work.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/27/2016 at 19:02 point

use the Newton-Raphson algorithm to compute the reciprocal :-)

  Are you sure? yes | no

Eric Hertz wrote 11/28/2016 at 00:25 point

Oh you crazy Computery-guys with your brains for named-algorithms... 

I suppose you remember what 1/s means, and can look at one of these without going cross-eyed:

Y'all are more than welcome to join this "project" and contribute!

(@Radomir Dopieralski, you too, since you probably won't get an email 'bout this comment, unless I name-drop)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/28/2016 at 00:29 point

Is that some sort of PID ?

  Are you sure? yes | no

Eric Hertz wrote 11/28/2016 at 01:17 point

<--- cross-eyed. 

I have no idea what it is, it's just an example feedback-system from wikipedia.

  Are you sure? yes | no

rubypanther wrote 11/27/2016 at 06:02 point

This is why I was asking about op-amps. If your math doesn't have to be too accurate, maybe analog calculus saves some bytes.

I'm dubious of the value of trying to squish a lot into 1k. Even the traditional 2k or 4k "demo" doesn't usually do much. I mean, lets say you squeeze really hard and you have 100k more than the next guy, what does that buy you? An extra pin to toggle?

Also, you can multiply two 3 bit numbers in only 48 logic gates if you have some discrete 7400-series sitting around. You can always de-solder later, right?

I think the winning strategy is to think of the most useful thing you can do with just some GPIO. Or, how to not be a fancy blinky.c!

  Are you sure? yes | no

Eric Hertz wrote 11/28/2016 at 02:16 point

Wow, this turned out long-winded, apologies if it seems a bit ridiculous...

I think you're amongst good folk, here, yo! You've checked out Yann's projects, right? He's got some excellent ideas for how to make things we typically do with microcontrollers in different ways, TTL-computers, even a relay-computer! And Radomir's contest-submissions have some great logs regarding how he's managed to shrink his code-size dramatically.

I dig the *concept* of analog circuits, but for some reason I can seldom get them to work the way I intended. I always seem to forget about something like CMRR, or something that can be neglected when working within the normal operating-conditions, but somehow I always seem to work outside them (and forget that I am). That's just been my own lot in life, so you won't see me using analog too often.

But, obviously, op-amp circuits work quite well, otherwise they wouldn't be as ubiquitous as they are, and I *totally* dig the concept... Analog systems existed once (and still do, to some extent) that did amazing things most of us think only computers can do, everything from tracking the stars in the sky for telescopes, to positioning-systems for satellites, and more.

And, "the real-world is analog," yahknow, so dang-near every sensor that exists is analog-in-nature... Accelerometers generally output analog voltages... If you periodically sample the raw data from (some?) accelerometers, your sample-data can be skewed by the sample-rate/resolution of your ADC causing it to miss subtle but important variances on the output, and that error can add up! One way to fix it? Throw in some analog circuitry to keep that "error" in the circuit. Digital-cameras' CCD elements are analog, capacitive-touch-pads sense analog effects... And high-speed digital electronics begin to look analog, too.

So, then, why *not* use more analog circuitry in our otherwise digital systems? I dig it.

Personally, not particularly good at the math, but I dig it.

There's a Cypress product called PSoC I think, that if I recall correctly is basically like an FPGA but with *analog* circuitry inside, op-amps, etc. configurable resistance-values, etc. Pretty durn cool concept.

Man I really harped on that analog thing.

------

Discrete logic is one of my favorite things... 7400-series is pretty durn amazing... I used to love just browsing the catalog from the late 70's (already 20 years old at the time I acquired it) just to see what sorts of "blocks" they had available, and come up with ideas for what to do with 'em. Nevermind they had schematics for the internals, could learn quite a bit from that. Most of the *really* sophisticated ones are no longer made in the newer speed-grade families, a bit of a shame, but I guess things like ALUs, etc. are so regularly-handled by processors or FPGAs that there's not much demand for discrete ones. But a lot can be done with what is still available. (Heck, *anything* digital can be done with enough 7400 NAND-gates :)

I dig your line-of-thought...

Oh, and if you're soldering-up 7400-series that you might later want to reuse for another project, consider using IC-sockets, instead ;)

------

As far as this contest goes... I can understand folks trying to "demo-scene" it, trying to squeeze every byte out of their microcontrollers, and I think it's pretty interesting to see what they're capable of accomplishing (and how). OTOH, I can see others building amazing things with nary a microcontroller to be found, and I think that's a pretty durn cool thing, too. And everything inbetween and outside, which is pretty awesome. Contest-wise: As long as it fits within the requirements, and either does something cool or is implemented in a cool way, eh?

As far as my write-ups with this "project," I'm just throwing out some ideas that I've found useful when running out of code-space, not trying to encourage demo-scening, just some things people might find useful if their 960Byte project suddenly becomes 1025Bytes, that it's not a show-stopper.

I've often thought of using old microcontrollers (the sort which rely on external memory) as nothing more than 16-bit counters (load NOP on the data-bus, and let it iterate through addresses). And there's a cool thing someone did once by allowing the data-bus to "float" causing random jumps and such to cause random-ish output on the address-bus, creating interesting LED-patterns (he made a "moody-face") with nary a byte of program-memory. And then there's folks creating program-memory with diodes and resistors... Weee!

(Can you tell I've drank some coffee for the first time in months?)

Where's your project, yo?!

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/28/2016 at 02:24 point

Stop coffeeing :-P

  Are you sure? yes | no

Eric Hertz wrote 11/28/2016 at 05:06 point

(Can you tell I drank *a lot* of coffee, for the first time touching it, in months?) OY.

  Are you sure? yes | no

Eric Hertz wrote 11/28/2016 at 18:06 point

Well well, look at that impeccable timing... almost exactly two years ago to the day:

https://hackaday.com/2014/11/26/anthropomorphizing-microprocessors/

  Are you sure? yes | no

rubypanther wrote 11/29/2016 at 08:02 point

I didn't want to sound like I'm anti-demo or anything. I was just thinking the ones with impressive output usually have the best colors, rather than the innovative fractal generator or whatever.

In the boxes at the makerspace are included some 80s programmable logic chips that have, not op-amps, but at least comparators. The Universal  Programmer (looks like 2 or 3 Apple ][ computers melded together) is there too, but there is no way I'm going to test that out anytime soon.

I doubt I'll even use op-amps, though it is quite possible. I'm not into all that math either, or the very narrow operating conditions. I mean, I do use them to do fancy math in circuits, but it is the amp doing the math, not me.

I did want to generate some sine waves with the correct phase offset though, to drive a recycled hard drive motor. I experimented with some transistors and realized, wow, that could work but it would be years of pain. So now I'm using a shift register and inverter as a Johnson counter. Much better. I'm not afraid of mixed circuits, but logic is great. Analog is difficult to convert to logic other than as an input or output. But if somebody did build an analog ALU, I would totally want it to win the Grand Prize. I may not really be generating a real sine wave, but it is in-phase and I'm OK with just that much. I'm not real engineer, I just own a lot of solderless breadboards.

As far as sockets go, the only bag of sockets at the makerspace are 8 pin, non-chainable, and the only 8 pin chips in there are some 555s and optical isolators. In my home stash I have a bunch of 14 pin sockets, chainable, but I paid 12 cents for them and that is already what a 7400 series is worth! The only thing I use them for is 2 at a time for atmega328 chips. Because sometimes I play around with 32k crystals, and my programmer doesn't work at that speed. So I brick a few. Also, sometimes it easier to program them outside the circuit.

I don't have an old catalog, but I do have junk boxes to rummage. The thing is, it pains me to have access to all this junk. Because, it is also a bit of a treasure! But at the current rate of use, bags of hundreds of transistors, diodes, ICs, etc. will eventually end up in some sort of trash pile. I'd much rather build them into circuits. That's what I meant about memory storage; right now this logic is stored in bags. If it was stored in-circuit, even if it was useless and went back in the parts box it would at least have some slight chance of archaeological significance some day. Or maybe even, it could win a prize and I'd be stuck with a useless doodad! Yay!

I do have a project in progress, not yet uploaded, that qualifies. It is just a bike light controller, with blinker and emergency flasher support. It is only about 700k, and I was actually trying to write maintainable code instead of binary golfing, so I can squeeze it down a lot and hopefully fit in PWM control for a buck controller.

I might also try to do some sort of PID controller for the HD motor.

If I can get I2C "working" (single client, single host) with 7400 series I'm going to see if I can squeeze an ssd1306 display into one of these projects! Full I2C with the hardware lib is just over 400 bytes, and I don't need all that sanity checking and acknowledgment, etc. A font is already 5 bytes per character. I'd rather just bang those bytes out to a shift register. So if I get that working, I'll get it up early so others can use it too. The display really only needs about 25 bytes of commands for setup. Obviously the Adafruit 5 wire version with hardware reset would need even less code, but I use the cheapo ones mailed direct from China. 4 pins, undocumented software reset, lovin'it. My favorite part, the datasheet swaps the words "hardware" and "software" when explaining that. It doesn't actually give the needed codes, but it does warn that they're needed.

I bought a sheet metal nibbling cutter so that I can do some Manhattan architecture for the build. Maybe with an acrylic greenhouse dome.

Me, I drink coffee every day. ;)

  Are you sure? yes | no

Eric Hertz wrote 12/01/2016 at 11:06 point

Hah, I way overdid the coffee thing... I was sick that evening and into the next day. It'll probably be another few months 'till I try again, in moderation.

I totally understand wanting to make use of parts, rather than see 'em get wasted. OTOH, I suppose with hacker-spaces and maker-spaces, there's more-likely a reasonable home for them than the trash-bin.

Sounds like you've some interesting projects/ideas. They don't have to be complete in any way, nor even documented at all, to have project-pages up here! Or you could throw a few short blurbs in "Things I've Built" on your profile-page.

  Are you sure? yes | no

deʃhipu wrote 11/26/2016 at 23:46 point

Don't do such micro-optimizations yourself, your compiler does them for you much better. As long as the value by which you are multiplying (or dividing) is known, the compiler will replace "a * 4" with "a << 2" as needed.

  Are you sure? yes | no

Eric Hertz wrote 11/27/2016 at 00:38 point

Very good point. Thank you for bringing it up, as I managed to slip on where I was going with this!

  Are you sure? yes | no

Elliot Williams wrote 11/27/2016 at 12:59 point

^^ This, in spades!  The AVR GCC optimizes multiplies and divides very well for you already -- have a look at the disassembled versions with avr-objdump -d -S.  You will most likely not be able to beat these.

_BUT_! The point is that if you know they're there, and can re-work your code to take advantage of them (in particular powers of two for divisors) it can really make a difference in the code's speed.  

So if you're scraping for every last cycle, you still need to know the optimizations, or at least what's out there, because you'll want to code for it.  But you don't need to write "x>>2", instead just "x/4".   

(Save the bit-shifts for when shifting bits is actually your intent, IMO.)

  Are you sure? yes | no

deʃhipu wrote 11/27/2016 at 13:16 point

That's true, I remember my surprise at the speedup when I experimented with different buffer sizes for a circular buffer, and tried 16 bytes, and suddenly all the modulo operations became super-fast.

  Are you sure? yes | no

Eric Hertz wrote 11/28/2016 at 02:39 point

"NO DISASSEMBLE!"

Nah, actually, disassembly is a *great* suggestion, I might think about throwing up a "log" about that. 

(Any thoughts on general-purpose means to assure that the disassembly is commented with the code...? I don't really understand how it works, since the objdump command doesn't have an argument for directing it to the original code... and for some reason the same arguments work with my AVR projects, but for my PIC32 projects all I get is the raw disassembly but no code alongside. Maybe a gcc/linker-option or output-file-format I'm forgetting?)

Very concise response... You should write a book! (haha). 

Rather, I should look into your book. http://littlehacks.org/avr-programming

  Are you sure? yes | no

Elliot Williams wrote 11/28/2016 at 14:04 point

@esot.eric  If you compile with -g for debugging symbols, and then run the objdump command, it'll interleave your C code with the assembler, as well as it can.  Which is pretty well, but never perfect because of optimizations.

If you pull up the list of AVR assembly commands from one of the datasheets, you can actually teach yourself assembly pretty well this way.  Write a for loop, compile, see what it turns into in assembly, etc.  It's the bees knees for optimizing.

  Are you sure? yes | no

Eric Hertz wrote 11/28/2016 at 15:37 point

Awesome, thank you for that!

Indeed, I now see C code interleaved with the disassembly output for both architectures! Excellent, because as you've pointed-out, this is a great way to learn assembly. 

I've done a little with AVR, but hadn't yet been able to with the PIC32. And... WHOA... The optimizer (-Os) for avr-gcc is *way* better than the maximum optimization-level in the free version of xc32-gcc. I knew I relied on the optimizer to do a lot, but I didn't realize just how much!

New "log" coming up...

  Are you sure? yes | no