Dimensional analysis fails me again

A project log for Vintage Z80 palmtop compy hackery

It even has a keyboard!

esot.ericesot.eric 09/26/2021 at 21:040 Comments

I never seem to get it right. It makes /so much/ sense, and yet completely eludes me whenever i try to apply it in even the tiniest way outside the norm.

In fact, now I'm at a complete loss for how ft-lb or A-hr work.

So, I can't really explain how this works, but I'm near certain it does...

I measure 49 clock pulses, coming in at 41.667k/s in 263 loops... then want to find out how many delay loops I need to create a UART bitrate at 10.4kbps.

One loop is 21 T-states (CPU clock cycles), though I've found that to be irrelevant, because I managed to make both the measurement loop and the delay loop exactly the same length.

Now, 41.667/10.4 happens to be almost exactly 4. But, later I might like to reuse this with 9600baud, or maybe a different clock. 41667/9600 is... 4. Unless you're using floating-point. And I'm definitely not about to handcode floating-point. So, I can't just multiply the measured loop-count by 4...

So, back to our measurements:

To convert an arbitrary measurement of loops and clocks to a number of loops per an arbitrary bit-rate:

( A[loops] / B[clocks] ) * ( C[clocks/sec] / D[bits/sec] ) = ((A * C) / (B * D)) [loops/bit]

Sounds about right... our C/D = 4.006, so 4 should be fine, but not if I want 9600bps.

So I multiply C and D for higher precision...

I'd prefer to use 16bit math, since that's in the z80 defacto. So, turns out 125[clk/sec] / 31[b/sec] = 4.03 is closer, anyhow, and 125/29 is pretty good for 9600.

(Why didn't I go with 128? Some random prof-of-concept math got me there, I might actually change it to 256, and stop sampling at 256 samples, so the top number (A*C) is always 65536 which would make later math/comparisons easier)

So, now, it starts getting confusing...

Each loop takes 21 T-States. Why is this important, here? It's the same for sampling... forget T-states. Erm...

OK, so, the problem is, I'm trying to relate this to /units/, and I can't.

So each loop is essentially one subtraction in the process of division.

The denominator (B*D) gets added each loop, until it reaches the numerator. When it does, we've delayed the right amount to toggle the next UART bit.

So, lemme give a simple example: we know that it should quit delaying after four loops, right? No, because it's 263loops/49clks. Oy, let's round... 250/50->25/5->5/1... five loops per clock, say. And 4 clocks per bit... so 20 loops per bit.


Now, 20/1=0 in the first loop, 20/2=0 in the second loop... 20/19=0 in the 19th loop, then finally 20/20=1 is therefore the final loop...

Division, but using addition, instead.

OK, so if I'da done that with real numbers,

 ( A[loops] / B[clocks] ) * ( C[clocks/sec] / D[bits/sec] ) = ((A * C) / (B * D)) [loops/bit]

A=263 loops per...

B=49 clocks.

C=125 clocks per...

D=31 bits.

A*C=32875 [loop*clocks] (?!)

B*D=1519 [clock*bits] (?!)

And if I do the math, then for a few loops:

Loop1: 1519 < 32875, so continue

Loop2: 3038 < 32875

Loop3: 4557...

Loop21: 31899 < 32875, so continue

Loop22: 33418 > 32875, so stop

So, it overshot a bit, (no more than one loop, or 21 T-States, or roughly 4.5us at 4.7MHz) but that's nowhere near one bit duration of 96us, and I can account for the overshoot in the next bit's delay, such that overall the error doesn't add up.

So, i think you can see that if i changed my scale-factor (clks/bits) such that the numerator is 65536, then I don't have to compare two 16bit numbers, and can instead just keep adding to the 16bit numerator until it overflows from 64000ish to 0-1000ish, which, during that last addition, causes the "carry" flag to be set, telling it to stop looping/delaying, /and/ prepares the next delay to compensate for this one's overshoot.

OK, but now,what the heck are we counting, here? I mean, units-wise...

( A[loops] / B[clocks] ) * ( C[clocks/sec] / D[bits/sec] ) = ((A * C) / (B * D)) [loops/bit]

A=263 loops per...

B=49 clocks.

C=125 clocks per...

D=31 bits.

A*C=32875 [loop*clocks] (?!)

B*D=1519 [clock*bits] (?!)

What the heck does it mean "how many clock-bits are there in a loop-clock?"

I have no friggin clue. What the heck is a clock*bit? Or a loop*clock? Or, for that matter, a foot-pound or an Amp-hour? And, frankly, now I'm at a near complete loss. OK, Amp-hours in a battery are per battery recharge. So, realistically, the units of "Amp-hours" are equal to "Recharges?" And foot-pounds... well, now it's starting to seem a bit like the x*y units equate to some sort of unit meaning something like "available".

4Ah/1Available=1, then 4Ah=1Available, then 4A=1Available/hr... ?i dunno, I'm lost again. And, here, I'd been planning to use dimensional analysis to figure out /whether/ my algorithm is sound, /and/ to figure out the next hurdle, which is:

There are /other/ things going-on between my delay-loops... I've got to extract the next bit from the byte,then output that to the port. I've got to /call/ the delay function, and return from it... So, that means, I need to subtract those calculations'/instructions' times from my delay... but how do i convert from T-States to clock-bits?!

Best I can figure is that I know X clock-bits is one 21T-State loop. So, if there are, say, 190 T-States of calculations inbetween my last delay loop and this one, then I need to divide 190/21 then multiply that by X clock-bits, again, whatever the heck those are.

I'm pretty certain this'll work, and I did come up with it, but I just can't quite wrap my head around it.


Holy crud, I'd been working on this for days... i just about got it, except for having to calculate the number of Loops from the number of intermediate T-states, then converting it to clk-bits... which would take well into the hundreds of T-states with a multiplication and division.

So then the thought to just require X t-states between calls (throw in some NOPS) so that could be precalculated, but no, it can't, the whole point is that our T-states are of a /measured/ rate, thus needs to be calculated based on measurements... then the thought of having a delay_setup() function that does that calc once before actually calling the delays.... (Why'd I stop there, it woulda worked!)

But I stepped back, /finally/...

263 [Loops] * 21 [T-States/Loop] * 41.67 [kclks/sec]


49 [clks] * 10.4 [kbits/sec]

==== [ T-States / bit ]

And everything except 263 and 49 are constant....

21*41.67/10.4 = 84.14

9.6Kbps? 91.15

Rounding gives 84 and 91, each with way under 1% error.

Now it's simply: 263 * 84 / 49.

If I'm /really/ smart I might limit the samples to 256, then it's all 8-bit math. Wait, no... that aint right. Anyhow, I already wrote my uint8 * uint16 function. And the clock bursts are too short to make loops*84 overflow a uint16... (right?) So I'm all set.

The answer gives T-States per UART bit... with these numbers, 450 and 489 for 10.4k and 9.6k. 

I already wrote my T-State delay function. DAYS AGO.





Using the earlier logic, I CAN do this /without/ an explicit division.

Count up to 263*84=22092

In steps of 49

Instead of counting to 450 T-States in steps of 21 T-States per loop.



LOL multiple fail, again.

First-off: 263 * 84 / 49 gives T-States/bit, 450 of them.

Which is great. But if I do the divide-by-49 in the delay loop [repeatedly adding 49 until it reaches 263*84] then I get 450 /loops/, each of which is 21 T-States.

So, no, I need to repeatedly add 49*21, which gives 21 loops. OK, no problem. And, that happens to make my math WAY easier when calculating how many loops to subtract due to intermediate T-States used for setup calculations, call/return, and toggling the Tx line... say the call/setup overhead is 190 T-States, then I just multiply 190*49 and start the loops from there, successively adding 21*49 to that, thereafter. WAY easier, I thought I'd have to divide 190/49!

Still, I need an 8-bit times 16-bit multiplier, and I need it to take fewer than 450-190 T-States, and I need it to take a constant number of T-States, regardless of the input values. Presently, I think mine's around 400 T-States. Heh.

So, 21*49 is pretty doable, first-off it only has to be done once, maybe even before even beginning UART stuff at all. Secondly, 21 is constant, so it could be as simple as A<<4 + A<<2 + A.

The call/setup overhead is also constant, 190T, presently, which could be done similarly, but also similarly could be calculated once before even doing the first delay.

That leaves the intermediate T-States (which also have to be removed from the number of delay loops) used for setting up the data on the wire, etc. This varies... E.G. grabbing the byte from the buffer, after the UART Start-Bit, then merely shifting it between each bit-delay thereafter. So, when I was convinced there was going to be a division involved, I decided maybe to just allocate a fixed number of inbetween-call T-states, and expect the caller to pad with NOPs as necessary. Then, those'd just be grouped with the 190 overhead t-states.

Now, it seems, it's not a division, but a multiplication... hmmm... which can be much faster... hmm...

Another trick I thought of was to just require the caller to use an integer-multiple of 21T-states in the intermediary between delay calls. Though, now that number still needs to be multiplied by a variable [49, this time] in the overhead between each call. 

Oh, and, BTW, shift-left in z80 assembly only occurs one shift at a time... don't let C's "A<<4" operator fool you into believing that's a single shift-instruction, heh! In fact, since shifts don't exist for 16-bit registers, I read the trick is simply to add hl,hl to do a single left-shift. Which is kinda backstepping for me, having learned so many tricks (like shifting to multiply) that I'd almost forgot multiplication is repeated addition, heh!

Anyhow, where was I?

Oh, so that big moment where I realized I could simply add 49 until it reached 263*84 (forgetting, of course, that I needed to multiply 49*21 first), I had completely forgotten how much overhead doing-so would add to the setup/call/intermediate T-State calculations... which is pretty stupid, considering that's /exactly/ what I'd been wracking my brain over for days prior, and even just 20min prior. Heh.

OK, BUT, this time I think I've got it... subtract, from 263*84, the intermediate T-States * 49. Intermediate T-states could be forced into some amount of conformity with NOPs, the math could be done in one delay_setup() call before any/all delays, and could even be /really/ slow.

Am I missing anything?


I did something like this on my AVR-based raster-scanned-LCD controller... a delay used to display "row-segments" of varying lengths, with a minimum length due to calculation overhead, but which could ultimately be delayed precisely down to a specific CPU clock cycle... HAH, imagine trying to do that with a machine whose shortest instruction is 4 CPU-Cycles, and most take far more... [BUT they're /not/ integer-multiples of 4, so I think it could be done!] Challenge begrudgingly not accepted. Though, it'll probably be hard to take off the stove.

Jitter, due to the 21T-State loop-length, should be more than acceptable, here, as long as it doesn't add-up... Which this system is designed to account for (subtract the last delay's overshoot from this delay).

So... do I have it?

Count to 263*84 in steps of 49*21

But, first, start the count at the last count's overshoot (conveniently already in whatever units 49[clks]*21[Tstates/loop] necessitate/mean)

Then add the overhead/intermediate T-States, times 49[clks], for some only vaguely-understood reason...

And only /Then/ start the counting-up, from there, to 263*84


There's a reason I'm doing all this, rather than merely delaying a specific number of T-States.

The error would add up/compound, otherwise.

The resolution would be quite small.

Floating point would be necessary.

Division is a /slow/ process. Especially if the algorithm has to handle any input values.

Division is /exactly/ what we /want/ to do, here, in its simplest form; keep subtracting until you can't anymore... that /is/ our delay.


Doing it with 49*21 per 21T-State loop allows for a higher resolution over the long-term...

If I'd've done the math in terms of /loops/, instead: well, 263*84=22092, /49=450.857

But, that's T-States, and we're not using floating-point, so now we've lost 0.8 T-states per delay. That, would be, if we counted each loop in steps of 21 (T-States) to 450.

Now, what if we counted in loops, instead?

450.857T/21[T/Loop]=21.469 loops...

Which'd be 21...

21 loops is 21*21T=441T, which is already more than 2% error, /per bit/. And, again, that error would compound.

That means, after 50 bits, or five bytes, we've managed to stuff in an extra bit. Somewhere that data is corrupt.

Say we did it with T-States, 450, rather'n 450.857, tiny error, right? But, 0.2% means we stuff an extra bit every 500bits, which is 50 bytes.

This system doesn't deal with large bursts, as far as I'm aware... I think I heard 13 bytes, maximum, so counting T-states /should/ be fine... but, there's plenty more to consider. What error was introduced in my measurement of the clock? 263loops/49clks sounds pretty good, right? But what if it was 263.999loops/49clks?

That's a 0.4% error which combined with our 0.2 above starts to add up... potentially near 1% again.

What if it was 263/49.999? That's a 2% error, right there, combined with our earlier seemingly-negligible 0.2 could add up quite a bit. I honestly don't recall how to do that math, but every bit can add up.

2% measurement error is quite a bit, could interfere with every 5 bytes. But there's really nothing I can do about that... but I can try to minimize the error's being compounded. And, who knows, maybe I'll figure out a way to increase the measurement accuracy. (The clocks come in bursts, of apparently 49, but I could maybe sample numerous such bursts and average...)

Alternatively, maybe, I can sample each bit twice, and if it's not measured the same then reset the delay overshoot to zero or something.

(I hadn't run those percentages, yet... 2%! Oof.)

/Transmitting/ isn't as much a problem, the transmit line of a UART /should/ be OK with an idle of any duration longer than a stop-bit. I could add two, or more. The receiver should be OK with that.

But, /receiving/... There's no guarantee the transmitter will add /any/ idle between bits. Just a stop-bit... which could mean that 130 bits come through back-to-back.

Now, ideally, I could reset my delays with the edge of each start-bit... 2% should only cause a problem after 50 bits, but if I reset every 10 bits then it should be OK. Will there be enough time, though, in thr 450.875 T-states, between the last bit in one byte and the start-bit in the next, to store the received byte AND do all that setup for the next?

Then there's the error introduced by the system watching for the start-bit edge... I dunno if I can keep that as low as 21T/loop, which means there's a 4us window where the start-bit may arrive... 4us in 96us/bit is well over 4% error... no biggy if everything's otherwise synced, that 4% doesn't compound, but still, could add up with the others.

So, ideally, I'll introduce as little extra error as possible.

And one way to do that is by counting my 21T-State loops in a different set of units, with a larger, more-precise, step-size. and counting up to a larger, more precise, value in whatever weird units I came up with. And keeping track of the overshoot, which may be as large as 4us (4% error!) for each bit, but in keeping track of that overshoot, subtracting it for the next bit.

I think it works.


*SIGH* WHY am I doing this, again? I already wrote a t-state delay routine that handles the jitter issue, etc.

Oh yeah... to calculate the number of T-States /to/ delay, I need division...


Because, I'm a bit stuck, again, trying to convert the intermediate T-States into these weird units.

16x8 multiplication, looks to be about 400T-States, which doesn't leave enough for the delay overhead. Right.

So, I've got to do some setup /before/ starting the delays, anyhow... but, my division, to calculate T-States... 263*84/49... nah, 16/8 (bits)... OK...

And the overall additive error...? Surely I can setup a new delay sequence with each start-bit... so the additive error would have to be nearly 10% per bit to really be a problem... right?


I guess the main thing was trying to do the /49 in realtime, rather'n precalculating. I dunno. I definitely don't feel like writing a division routine.

T-State-delayer, pre-calculated division, go!