faster speed: how Optimize math and process tasks

when it comes to optimizations there are several things to consider. it is not always a profitable thing to do to get things to work at the fastest, after all the development time is a consideration. but if you want to make friends it is also best to not require a 200$ piece of a cpu for an otherwise 10$ project. and then prices always get lower. but if you make something efficient it will usually work better on newer hardware and also allow more performance for other cool tasks.

I'll start here with some basics.simple things like:

*always multiply instead if divide.

if possible have the 1/x equivalent is x *0.25

*store variables outside of loop to reduce variable load and unload in stack.

can speed up deep loops over 100% (it does increase ram usage at compile however.)

instead of :

void function(){

float math 1;

float math 2:

...}

do this:

float math 1;

float math 2:

...

function(){

}

*table values when possible. so the math can be done before compile

*if you must / and* try powers of 2., use uint32_t or uint16_t and

Shift >> to lower /2 and shift << to *2

*never just assign int's assign int16_t or int32_t otherwise 8bt and 32bit will not be good friends.

there are several types of math differences in how 16 bit math is done versus 32bit math. this is especially important in multiplication, and in bit shifting. also int default to int32_t on 32 bit and int16_t on 8bit.

Project Logs

Collapse

faster sqrt but less accurate
jamesdanielv • 09/17/2019 at 15:25 • 0 comments

i found this in the background somewhere on the internet.
https://community.arm.com/developer/tools-software/tools/f/keil-forum/20313/speed-up-square-root-computation-approximation
and indeed it is faster . i had to correct the 1L to 127L as explained in the article bellow with some users having issues with it originally. it however has an error rate of up to 8% but for whole numbers <6%
float fastsqrt(float val) {
long tmp = *(long *)&val;
tmp -= 127L<<23; /* Remove IEEE bias from exponent (-2^23) */
/* tmp is now an appoximation to logbase2(val) */
tmp = tmp >> 1; /* divide by 2 */
tmp += 127L<<23; /* restore the IEEE bias from the exponent (+2^23) */
return *(float *)&tmp;
}
this code makes error rate with whole number <3%
float fastsqrt(float val) {
float invertDivide=0.66666666666 ;// ~1/1.5 rounded down to float precision single float
long tmp = *(long *)&val;
val*=0.22474487139;
long tmp2 = *(long *)&val;
tmp -= 127L<<23; //* Remove IEEE bias from exponent (-2^23) */1065353216
tmp2 -= 127L<<23; //* Remove IEEE bias from exponent (-2^23) */1065353216
temp2=tmp>>20;//any time number is negative it is more error rate
tmp = tmp >> 1; //* divide by 2 *
tmp2 = tmp2 >> 1; //* divide by 2 *
temp=tmp;//any time number is negative it is more error rate
// if (tmp <0) tmp=tmp+1L<<22;//invert -10066330
//23-8bits
temp2=tmp>>23;//any time number is negative it is more error rate
//when tmp=0 error rate is high also -2,-1
tmp +=1065353216; /* restore the IEEE bias from the exponent (+2^23) */
tmp2 +=1065353216; /* restore the IEEE bias from the exponent (+2^23) */
float offset=*(float *)&tmp2;
val= *(float *)&tmp;
return (val+offset)*invertDivide;

}
first version <4.3% numb range 0 to 100
2 1.414214===== 1=1.5 variation % = 4.289323
version with offset  <2.7% numb range 0 to 100
2 1.414214===== 1=1.466326 variation % == 2.605647
but it gets tricky for the numbers with digits before 0.
original has an error rate of >24% for below as new version is 11.5%
0.01 0.100000===== 1=0.101153 variation % == 11.530593
so some work needs to be done to make it adjust to the precision of the float
i'll play around with it more. it seems something that is useful for some people, but i would like precision to be closer to 0.5% to 1% or even better.
I'll update if i get something faster or better.....
speed up i2c, faster than it can be normally. caching data you use.
jamesdanielv • 09/10/2019 at 13:32 • 0 comments

when reading i2c data, there are performance penalties from reading a few bytes at a time, and performance penalties from reading too much at a time. so probably the best method is to read only the range that you would need and nothing more. from my data it seems that from 32-64 words at a time is the ideal caching.
some cpu designs have cache for i2c, and some have dma access. loading more i2c than needed in many cases will have an overhead of cpu cycles. so you want to have the optimum amount of cache.
https://i2c.info/i2c-bus-specification
All i have is a data comparison for reading the mlx90640 sensor. it is because the sensor in my case requires reading of 768 words of data so 768x2 or 1536 bytes of data. with using an uno or nano processor with only 2k of ram this is not that practical, also there is a lot of overhead, the processor can not do anything else while getting that much data at a time.
i wanted to determine the optimum i2c cache where the benefits of reading data are maximized.
keep in mind that start and stop bits require about a byte cycle as well. (some times it is only 2 bits of clocks)
for example i2c in my case can go up to 1mhz. the protocol for i2c is the following
address of device to talk to start 2 bytes (this gets up to 10 bit address for access)
address of start of memory to read 1 to 2 bytes. in this is also a bit that says write or read
address of last byte of memory to read 2 bytes
there are are also start and stop bits and clocking of serial data. we treat those as bytes as well.
os for each byte of data sent back (1 byte), but we request 2 at a time minimum for word data
so for our purposes we can assume that reading 2 bytes of
data (minimum for memory read values for mlx90640 sensor)
READING 1 uint16_t (word) == 9 bytes of data to receive (clock pulses serially)
-------------------startbit---device_ID_---start_address_+end_address_DATA__end bit
read 2bytes =1byte + 1byte + 2bytes + 2bytes + 2bytes + 1byte
so to read only 1 word of data from an i2c device in or case takes 9 bytes
READING 2 uint16_t (word) == 11 bytes of data to receive (clock pulses serially)
to read 2 words of data takes 11 bytes
-------------------startbit---device_ID_---start_address_+end_address_DATA____DATA_end bit

read 2bytes =1byte + 1byte + 2bytes + 2bytes + 2bytes + 2bytes + 1byte
the over head of 6 bytes is in every transfer even if we get several at a time
here is the theoretical performance increase based on the amount of bytes read
a 1 would be 100%
1 word is at 22% efficient
16 words 80% efficient or 3.6 times faster than 1 byte
32 words is 90% efficient. 4.1 times faster than 1 byte
64 words is 94% efficient 4.3 times faster than 1 byte
128 words is 97% efficient 4.4 times faster than 1 byte
the table that shows the efficiency is at the end of this log.
here is the i2c code i use. i just read 32 or 64 at a time, and only when a cache miss occurs do i reload. i make sure that my reads of the data go as linear as possible this prevents over use of the caching.
the cache can be enabled and disabled. and it works without any other coding consideration other than trying to read data within 32 words in order. even if you don't it will still work.
you can see why i would be using a 64 word buffer, and am considering going down to a 32 word buffer so i can use the ram saved to cache other information (serial data caching to burst to screen)

in my case 16 or 32 words of data would be appropriate. i have other issues causing more overhead than the speed of the data transfer of i2c data. but you can see from below what the efficiency would be on size of buffer
this example uses i2c data to read 1 word to...
Read more »
create a fast lookup table for squaring numbers if you know the range
jamesdanielv • 09/08/2019 at 15:20 • 0 comments

I recently had some math optimizations that required improving the performance of POW operations.

these numbers were only of powers of 2. and only went up to 48.

this is currently about 100x the speed of the normal method. it would be 1000x or more but it is not in ram.

so i created a table, and decided the fastest way to store them without using much memory was to create this table so it would be in progmem, for Arduino that is stored in its flash for review later.

this may be modified further with a predictive cache, or a compressed table store as getting values returned from flash is slower than ram, and ram is valuable on Arudino. for processors with ram to spare this table should be located in ram.

here is the function that pulls values to be used in equations. this table processes 2^x with x being 0 to 63

float SimplePowFast2s(uint8_t x){//we cause 2^x

return pgm_read_float_near(power_of2table+ x);

}

here is the table

const float power_of2table[] PROGMEM = {
1,
2,
4,
8,
16,
32,
64,
128,
256,
512,
1024,
2048,
4096,
8192,
16384,
32768,
65536,
131072,
262144,
524288,
1048576,
2097152,
4194304,
8388608,
16777216,
33554432,
67108864,
134217728,
268435456,
536870912,
1073741824,
2147483648,
4294967296,
8589934592,
17179869184,
34359738368,
68719476736,
137438953472,
274877906944,
549755813888,
1099511627776,
2199023255552,
4398046511104,
8796093022208,
17592186044416,
35184372088832,
70368744177664,
140737488355328,
281474976710656,
562949953421312,
1125899906842624,
2251799813685248,
4503599627370496,
9007199254740992,
18014398509481984,
36028797018963970,
72057594037927940,
144115188075855870,
288230376151711740,
576460752303423500,
1152921504606847000,
2305843009213694000,
4611686018427388000,
9223372036854776000,
};
if you can cache a problem
jamesdanielv • 09/08/2019 at 13:00 • 0 comments

this method of boosting performance is universal. so it is probably one of the most effective ways to increase performance across designs, as it just reduces the amount redundant work. it does not use a specific syntax, or a special mode or method of a cpu design. all it uses is some variables outside of the loop.
there are times that a problem is solved more than one time. in some cases it is solved several times with the same results. there are also tables of data, that some data does not change. here is an example of a single function that can be called several times. the 2 values can be replaces with an array with reference to the cell it is being measured from. here is a basic view of how caching math works.
here is an example of a function that caches results. this is a simple version it only caches results if they are the same.
In order to cache data we need two values stored outside of the loop.
We need the ResultCache value. this is just a number we output without doing any further work
and we need the reference Compare Cached value. this is the value we use as a reference to verify that the problem has not changed.
here is an example function DataconvertDegCTodegF that takes a deg in c and changes it to deg in f.
We have several requests for this a second, but we never know how often it actually changes. several values are like this especially with sensors and data sets. many values are the same most of the time.
this is a simple example of Math caching.
float DataconvertDegCTodegF(float temperature_Deg_C ){//this value is read several times unknown interval of change
//we output a conversion to deg c from F
//(0°C × 9/5) + 32 = 32°F
if (reference_Compare_Cached_value !=float temperature_Deg_C ){
//we want to see if reference value has changed if it has we do work
ResultCache_value=(temperature_Deg_C *1.8)+32;//we get deg in F
reference_Compare_Cached_value ==temperature_Deg_C ;//we update reference
}
//we have completed work if it changed, if not we send back what we have cached
return ResultCache_value;
}
this loop is not complex, but if it is ran several times a second it can have an impact on performance. some of the code i did for the amg8833 sensor used code for caching.

View all 4 project logs

faster speed: how Optimize math and process tasks

Description

Details

Project Logs

Collapse

faster sqrt but less accurate

speed up i2c, faster than it can be normally. caching data you use.

create a fast lookup table for squaring numbers if you know the range

if you can cache a problem

Discussions

Similar Projects

Lightning Sensor and Logger

mlx90640 sensor works w 800 bytes

(AT) Mega Enigma, a 2560 Pro Mini Enigma Simulator

ISS Overhead

faster speed: how Optimize math and process tasks

Become a Hackaday.io member

Just one more thing

Description

Details

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse