Close

M1 wait-state? Revisited

A project log for Vintage Z80 palmtop compy hackery (TI-86)

It even has a keyboard!

Eric HertzEric Hertz 01/17/2022 at 20:533 Comments

I've several past logs regarding the possibility the T6A43 Z80-alike makes use of a wait-state on M1 machine cycles.

These are my latest findings:

First, the setup:

I have a 41.667Kb/s serial signal, pulse-width-modulated (a 1 is a 16us high, 8us low, a zero is 8us high, 16us low) and a 4800bps UART serial signal to compare to.

41.6K is too fast to process in realtime, so I first take 1024 samples then post-process. A packet usually consists of 49 pwm-bits, and I generally measure around 272 samples containing a frame. I use the INIR instruction to grab 256 samples at a time, and pad with a nop between INIRs. Thus, each sample should be 21T-States (unless there are added wait states).

Yes, I disable LCD DMA and interrupts.

272samples/49pbits×21T/sample×41667pbits/sec gives 4.857Million T-States per second.

Then, in the same program I "autobaud" by measuring the number of T-States during a "?" received by the UART at 4800b/s. The use of "?" for autobaud makes it easy to see a distinct change between the start bit and the first data bit, and the last data-bit and the stop bit.

I measure the number of loops between detection of the start of the first bit and detection of the end of the last bit. Each loop is 35T-States long, assuming no wait-states. Though, the loop looking for the first bit is 29T-States.

The counting loops look like:

inc hl ; 6T

in a,(7) ;11T

and b ; (rxMask) 4T

cp d ; (rxOne) 4T

jp z, loop ;10T

If M1 waits were implemented everywhere, this loop would be 40T-States instead of 35, and the INIRs would be 23T instead of 21.

The autobaud measurement determines 993T/bit at 4800b/s, which amounts to 4.766Million T-States per second vs. the other measurement's 4.857. That's an error of about 2% which is enough to begin being concerned about UART timing.

So, I started looking into potential error sources and whether they could account for the difference. First and foremost, if an M1 wait-state went unaccounted-for those numbers wouldn't get *closer*, they'd in fact veer even further apart. 5.45MHz vs 5.32MHz, 102.4%, vs 4.77 vs 4.86, 101.8%. A minor difference, I suppose... I could've sworn it was further off than that.

So then I tried to account for the error elsewhere. E.G. what if the edge of a UART bit occurred immediately after sampling, then there'd be nearly 35T of error. And if the edge of the last bit occurred immediately before sampling, then there could be an additional nearly 35T of error.

But... at 993T/bit, or 7945T/8-bit frame, that measurement error is nowhere near enough to account for 2%.

This is important, again, to my project because ultimately I won't have autobaud as an option and will have to generate my UART timing based on the 41.6Khz signal. I looked at it with a scope and measured darn-near exactly 41667Hz. So, it's plausible my 4800bps UART connected to my computer (via USB) is off by some UART-Acceptable percentage. And/Or it's possible M1-Waits are a thing...

But I ran some numbers and found another plausibly-reasonable explanation...

Note, I went through all this because: If I autobaud accounting for M1-waits, and also account for M1-waits in my UART code, it sends/receives garbage.

The other plausible explanation I came up with for the 2% error between the 41.6k signal and the 4.8k signal is that I/O may have wait-states. If there is no M1 wait-state, but there is one wait-state for port reads, then the numbers drop from 2% error to 1%. If, plausibly, INIR has two wait-states (why?) then the measurements from the two sources align almost perfectly at 

... DAGNABBIT!

I did these calcs yesterday, and they did add-up. Why The Heck aren't they now?!

I thumb-typed this whole thing for no reason?!

...

GAAAAHHHHH!!!

This Is Really Frustrating.

I spent *weeks* making all the code switchable between M1-waits and no M1-waits, had to recount every friggin T-State in every friggin function *several* times, because at first I was *certain* M1-waits were the culprit, despite the fact autobaud and my UART code worked together without accounting for it. So, at first, I just deleted all the T-State counts that *didn't* include M1-WAITS... Then recalculated everything based on M1-waits, then it stopped working. So then I spent several days /again/ pouring over all that code, again, to put Non-M1 counts and calcs *back*... Then switching between them was too much a pain, so I rewrote the code with .equs that allowed for quickly switching... And, STILL, it only worked with *not* accounting for M1-waits... I mean, I've looked that code over *Numerous* times.

Last night I *Finally* came up with an explanation... Got that error between autobaud and the 41.6k signal down to a fraction of a percent by a reasonable explanation (a wait-state on I/O transactions)... And now those numbers don't add up.

Heh.

Problem is: 2% error isn't a whole lot, but it can be additive in the case of UART bitbanging. At the end of 10 bits, it could be off by half a bit or more, which could be too much. But, more likely problematic is that that 2% will be added to the error in my measurements and in my bitbanging delays... Which I designed to be as error-free as possible, expecting maybe 2% worst-case. They might cancel out, or they might sum-up, now to 4%. Heh. I could just try it, as is, forgetting the error, and it may or may not work. it may work most of the time, but fail seemingly randomly.

THIS part *should* be pretty easy to "get right"... the pwm and autobaud sampling loops are *tiny* in comparison to the bitbanging loops. Surely I can account for a difference when dealing with 21T and 35T loops on the same source.

Maybe I need to run the sampling/post-processing on the same autobaud source and see how they align. SHEESH. This Just Keeps Dragging Out!

...

oh, but worse... Because the UART code does NOT work when counting M1-WAITS... Because, even though the error is about the same for the two methods, around 2%, regardless of M1... The error is *reversed*. In one case pwm-sampling measures the higher clock frequency, in the other case autobaud does. What The Heck.

Same source... I need to measure the same source. *Sigh*

...

Hours later, autobaud now runs both ways, from the same source ('?' at 4800bps); sample-first, then count samples, and count loops between highs and lows in realtime.

Dang-near /exact/ same results.

4.82-4.84 million T-States when sampling, 4.74-4.76 when watching for edges in realtime. 

I've checked the realtime edge-detection code several times since designing it, and again, now. It's a tiny bit confusing because the loops *start* with the increment, and exit after sampling. So, technically, if you think of each loop as a single unit, then the counting is off by one... so it's a bit confusing and could lead to an extra or missing count if not handled right... I can't count how many times (and ways) I've drawn a timing diagram to make sure... But, even still... even *Numerous* missing loop-counts don't account for 100kHz difference.

Is it plausible the RC oscillator varies /that/ much, consistently, (and so quickly!) based on the types of instructions executing?! Maybe due to voltage sagging with power-usage or something?!

This sh** be crazy!

...

So, if today's math is right, there's no confirmation either way regarding M1-waits. It's about the same percentage-error either way (just swapped).

And if today's experiment is right, that the clock frequency may vary somewhat dramatically depending on what instructions are executing, then even my /original/ autobaud/UART results (that the UART code produces/receives garbage when accounting for M1 waits) may be non-indicative, as well....

Again, the difference between the sampling and edge-detection methods is 21-23T states vs 35-40T... just a handful of instructions. But the UART bitbanging code is *huge* in comparison... About 200T-States to handle each bit, calculate the remaining delay, and then the delay loop itself. So, being mostly small instructions, an additional T-State on each results in a much greater difference in the end.

If this thing *does* use M1-waits, my accounting for them in the code may very well be overrun by the newly-discovered plausibility of inter-instruction clock-variance! 

Despite ALL those weeks of careful calculations, it may've been just LUCK that my two systems worked together based on a bunk assumption, and *don't* based on the reality of the system!

GGGGGAAAAAAAAHHHHHHHH!

Discussions

ziggurat29 wrote 01/20/2022 at 02:30 point

I suspect if you really wanted to empirically get to the bottom of the M1 wait state mystery, you could:
* disable LCD (i.e. stop the DMA from happening)
* DI
* HALT
this (I think) should put the processor in a perpetual M1 cycle.  Then you can use a 2-channel scope to monitor the oscillator and the /CS of wherever you're running code.  If it's 4 clock cycles, then there is no wait state.  If it's more, then the excess is the wait states.  You'll need to power cycle to get out of that loop.

It strikes me odd that they would have the implicit wait state committed in silicon this young but maybe the creators were planning on very cost-sensitive applications and using really slow rom.  It would be interesting to know if this only applies to things running on CS1, since we already know that bank is treated somewhat specially (e.g. the need to use CS3 to support writes to flash).

  Are you sure? yes | no

Eric Hertz wrote 01/20/2022 at 18:07 point

...yes, attaching an oscilloscope or logic analyzer might be a worthwhile endeavor. Currently that means investing in a 1,000ft extension-cord, or waiting for a sunny day and setting up equipment in a parking lot. Or spending a couple/few hundred bucks on portable/battery-powered versions of test equipment I already have, or investing in a quality true-sinewave inverter... Nevermind, of course, figuring out how to attach probes to the circuitry on the back and access buttons/screen on the front. Heh. Checking the 41.6k signal was comparatively easy ;)

While I'd like to know about the reality of the M1-situation, it's rather irrelevant if different instructions take different amounts of time due, e.g., to power-rail loading affecting the RC clock :/

If something like that is the case, then it's more a matter of measurement and calibration rather than T-State counting. Unless, I suppose, the RC variance could be accounted-for for each instruction. Hah! Still, though, even that might be rather difficult since other factors are at play, affecting power rails.

/CS0 (and /CS3) are for the ROM, but my code is executing from RAM (/CS1)... It would seem strange if they used M1-waits on one but not the other, having developed the /chip/, implying the ROM/RAM, being external, might change... Maybe one or two of the undocumented "port" bits enable/disable waits (and then, not necessarily limited to M1). heh.

...

Rereading this, though, it occurs to me maybe I couldn't get the math to work out the next day because I was considering No M1-waits while considering Port[7] waits. Did I originally do that math while accounting for M1-waits /AND/ additional Port7 waits? hmm...

  Are you sure? yes | no

Eric Hertz wrote 01/20/2022 at 20:54 point

Using Sampling: 384 samples for 8 bits at 4800bps.

Using Edge-Detection loops: 227 loops for 8 bits at 4800bps.

Without M1, that's 21T/sample or 4.84MHz and 35T/loop or 4.77MHz.

With M1, 23T/sample, 40T/loop, 5.30MHz and 5.45MHz.

With M1-wait AND an additional wait-state for port-reads, 24T/Sample and 41T/Loop, gives 5.53MHz and 5.58MHz. 

Now I think we're well within the area where being off by a loop may account for the difference. In Fact, I pretty consistently measured 0x180=384 Samples, BUT got 0x3E1=993T/bit (*8/35=227loops) about half the time and 0x3DC=988T/bit (226loops) about half the time. Which makes some sense, the lower resolution of longer loops means the count may depend more on where within a bit it starts detecting.

Oddly, though, fewer loops (226 vs 227 used in my calcs) ... oh, no, that fits: 226*4800/8*35T makes the NON-M1-wait case even further off (35T->4.746MHz). But WITH M1: 5.424MHz (vs INIR at 5.30). Now also with a port-wait: 5.56MHz... And, again, that same case with INIR would be 5.53MHz. 

HMM...

Now, 8/10 runs gave 384 samples, 2 runs gave 383. and exactly 5/10 gave 227 loops while the other half gave 226. So, I'm pretty sure that means our 4800bps 8-bit frame is just shy of 384 times the length of an INIR... Lotsa speculation to happen here, but if that same frame is about half-way between two edge-detection loops in length, then the INIR results are more accurate...

I suppose I could speculate about the *actual* duration of a frame... OK, so, 80% * 384 + 20% * 383? 383.8?

No M1 (21T/sample):

8059T-States/frame, 4.836MHz

8059/35(T/loop) would give 230.3 loops (not 226-227)

M1:

8827T-states/frame

/40=220 loops (even further from 226/7)

M1+port-wait:

9211T

/41=224.7 loops

... And now we're entering territory where the position of the edge-detection sampling matters...

For shits-n-giggles, Since, e.g. the description of INIR *almost* makes it sound like it tries to do two operations at once (allowing the data bus to hold the port-read while addressing a port-write on the memory-bus (how?!) maybe that bus-trickery wasn't really reliable in a system like this, so an extra T-state was added to read-in the port data, then latch it back out... 25T/INIR would be 9595T/8bits@4800bps

and at still 41T/loop that'd be 

... WTH... 234loops is nowhere near 226/7... But, dagnabbit, didn't I *just* do that same math the other way?!

... OK... so, two wait-states per port read (and M1-waits) in both cases, gives 228.45 loops... which is even closer than before...

Or I could get it almost spot-on if I could somehow account for 26T in INIR and 44T in my loop...

226.8 loops

INIR is 2T longer, using M1-waits, that's 23T, then 3 extra T... But my loop is 5 instructions, each with one M1, so 44T can't be accounted-for by even multi-T M1-waits alone... Nor by 1 M1-wait and Port waits.

But Wait!

An M1-wait is typically 1 /full/ T-State... right? but that's two clock edges. What if the M1-wait is half a clock? 22T for INIR, 37.5 for my loop. 383.8 samples at 22T is 225.2 loops at 37.5T.

Huh.

If you haven't noticed, this is all WILD speculation.

Oh, and I haven't considered regular wait-states for regular memory-accesses... heh!

  Are you sure? yes | no