If you'll remember, I want to sample voltage and current at 3840Hz. With a 2MHz instruction clock, that gives me 520 instructions between samples. On face value, that should be enough to capture 6 ADC values, multiply 16-bit numbers 9 times, and add 16 and 32-bit values 9 times. That's all I need to do, and without doing any calculations I naively figured that would certainly take less than 500 instructions. Well when I actually simulated it, it took...wait for it.... 4398 instructions! What! Even at the max oscillator frequency of 48MHz, that only allows 3125 instructions per 3840Hz. My fallback plan was shot.
So I took a look at the assembly the XC8 compiler generated and it's largely garbage. Lots of NOPs, useless branches, useless moving values around (they're never used). After looking around the web for a bit I found this is normal for the free version. I can understand not optimizing the assembly for the free version, but come on inserting garbage instructions is not cool! I really don't want to write any assembly for this project if I don't have to, so I took a look at how I can change the C code around to produce fewer assembly instructions.
Starting off with the
tempV[i] = tempV[i] >= midPtV ? tempV[i]-midPtV : midPtV-tempV[i];
line, I tried changing it to
if (tempV[i] >= midPtV) tempV[i] -= midPtV;
else tempV[i] = midPtV-tempV[i];
That saved 9 instructions total for the entire function. Since there is another line that's basically the same thing except with tempA, that' 18 total saved (it actually saves 26 with both of them..go figure)... This is going to take a while.... UNTIL I stepped through the code to this part:
dp[i].instVolts += (uint32_t)tempV[i] * tempV[i];
It takes 370-390 cycles to do one of those lines, and with the loop there is a total of 9, which is insane. But that's not the interesting part. I found that if you step into this line it branches to a file called Umul32.c in the sources\common directory of XC8. This file has multiple methods for doing a 32x32 bit multiplication with a 32-bit result. One of these methods has an
#if ... && defined(__OPTIMIZE_SPEED__)
directive preceding it. That was not the method the compiler chose for me, of course. After a couple failed attempts at defining that value, commenting out the #if lines and the slower methods (it still executed the comments when stepping through!), I just copied the faster code into a new function and used that instead.
I'm not sure on the legality of posting the code, but the 32x32 algorithm is probably a standard one (again, not sure), and I believe I've given you enough to go on to replicate it. The results were pretty fantastic. Cycle count for the GatherData() function went from 4362 (after the first optimization I tried) to 2190 with the new 32-bit multiply code. Awesome. Now I can run the PIC at 48MHz and have it complete the data gathering with cycles to spare. Mission accomplished!
This can of course be improved upon further, but I don't want to take the time to do that. If I really wanted to easily change things I would go with a 32-bit micro (PIC32) that could natively handle these calculations without having to convert between types, and it has a 32-bit hardware multiplier instead of the 8-bit in PIC18. Knowing what I know now I would have gone with the PIC32 from the start. I found out its compiler is gcc based, and you can find instructions on the web how to enable the paid optimizations for free on it. XC8 is a different compiler entirely and is hobbled in the free version. In other words, if this project gets as far as a Rev 2.0 and power consumption is an issue, look for it using a PIC32 at a lower clock speed. It would be ironic after all for a power meter to be power inefficient, right?