08/23/2020 at 02:42 •
The log 120. TAP v.2 introduced the basic feature of the new circuit : the same counter is used for both phases (in and out) of the communication cycle. This means that it can't use a "normal" counter with its RESET pin and the circuit has been redesigned with partially asynchronous features. The result is this circuit :
But in practice, this creates some issues in simulation, both with VHDL and circuitjs. The culprit is the infamous double-NOR2 latch that creates oscillations at the start of the sim when both inputs are cleared.
With circuitjs the oscillation stops after "a while", and I have no idea why :
But with GHDL the sim stops before it begins because it can't reach a stable initial state.
And now look at this version:
This one has no inversion so no unstable state and no oscillation. This circuit is less favoured because the AND and OR are "2nd order gates" and each requires an additional inverter. But in CMOS there is an extra trick !
The gate OAN21 (or OA1 in Actel parlance) uses 8 transistors : 6 for the function itself and 2 for the final inverter. One inverter stage is saved thanks to the nested topology :
Meanwhile, each NOR2 (or NAND2) already takes 4 transistors, for a total of 8 transistors for the flip-flop as well, but with possible oscillations. The choice seems obvious in this case.
There is another difference though : the CLR signal's polarity is negated. Usually it is not a problem because the signal can be be negated in a way or another upstream. In the above example, I simply tied the XOR to the inverted output of the DFF but other solutions exist, such as a XNOR.
And there is also a more direct solution, with the OA21NB gate (or OA1B for Actel) where a little swap trick saves another inverter :
So the Gray6s unit can be updated.
This raises a big question though. R/S flip-flops are used in other places (such as the Selector) and this "bare metal" construct can't be analysed by my library tools because logical loops are explicitly considered as "zombies". They must be explicitly abstracted and this means I can't just replace the 2×NOR2 with one OA1B and call it a day.
Let's look at the A3P vocabulary : there are no RS FF but it contains these T-latch gates
"DLI0 ", "DLI1 ", "DLN0 ", "DLN1 ", "DLI0C0 ", "DLI1C0 ", "DLN0C0 ", "DLN1C0 ", "DLI0C1 ", "DLI1C1 ", "DLN0C1 ", "DLN1C1 ", "DLI0P0 ", "DLI1P0 ", "DLN0P0 ", "DLN1P0 ", "DLI0P1 ", "DLI1P1 ", "DLN0P1 ", "DLN1P1 ", "DLI0P1C1", "DLI1P1C1", "DLN0P1C1", "DLN1P1C1"
This looks confusing but there is some kind of logic in this madness. C means Clear and P means Preset, the following digit gives the active level.
- We can already forget the first line because these gates have only a data and clock input.
- Other gates with either C0/C1 or P0/P1 are more useful when a fixed datum is input and the extra pin forces the negative state. There are 16 cases and many degenerate ones...
- The last line with both C and P is interesting because it gives both active-1 Clear and Preset as well as an inverted output (if needed).
There are many ways to build a Set/Reset or Reset/Set gate from these macros but we're not there yet. So let's now enumerate the desirable cases and build a table of the required conditions !
Mapped to Set 0 0 S0R0 AO1A Set 0 1 S0R1 AO1C Set 1 0 S1R0 AO1, AON21 Set 1 1 S1R1 AO1B, DLI1P1C1, AON21B Reset 0 0 R0S0 OA1A Reset 0 1 R1S0 OA1C Reset 1 0 R0S1 OA1, OAN21 Reset 1 1 R1S1 OA1B, DLN1P1C1, OAN21B
The precedence is given by the gate that drives the output (AND for Reset, OR for Set). Conveniently the wsclib013 library also provides the reciprocal gates AON21 and AON21B so 4 combinations are directly available in ASIC (if the B or a2 input is fed from the output).
Other cases could be built from the inverting versions (AOIxx, OAIxx, exercise left to the needy user).
This is cool but this is still not the end of the story because I still can't analyse these constructs. One way is to reuse the existing sequential gates and coerce them into behaving as RS FF.
Looking at the A3P latches, let's focus on DLN1P1C1 and forget about the D and G inputs (let's pretend they're tied to 0).
CLR has precedence over PRE and this is equivalent to the newly defined R1S1 macro.
DLN0P1C1 and DLI0P1C1 are irrelevant at this level because this does only affect the polarity of the G input.
Then you have DLI1P1C1 which is the same, but with negated output : the functions of PRE and CLR are swapped so the renamed/equivalent macro gives precedence to PRE, equal to the S1R1 macro. Look at the truth table and see by yourself :
However, due to how the A3P fabric works, DLI1P1C1 is the exact same macro as DLN1P1C1 but the inputs of all the drains of Q are inverted, instead of changing the active level of the Q output. The gate is defined anyway and it's good to know it. But the fundamental problem is not solved.
So what is the real problem anyway ?
The #VHDL library for gate-level verification is meant to help with DFT and in particular ATPG (Automatic Test Pattern Generation). This is possible when the ATPG system can establish the role of all the gates and nets but so far, only boolean gates are covered. Sequential gates are still absent because there is no sufficient theoretical framework to handle that, and that's where the "real problem" arises. In the end, we want to know
- the logic depth between an input port and an output port
- if/how one net can be affected by the input ports and affect the output ports (is a given net observable and controllable).
So far, I consider treating the DFF gates like 2 sides of a "pseudo-port" such that the remaining logic is treated as a single layer. This is more or less accurate but sufficiently so to be useful and functioning. However the world of logical design is not always a dichotomy between boolean and sequential gates... It is more like a continuum but I have identified 4 major levels :
- pure boolean gates : the typical gates that don't hold any state.
- interlocked gates : like the R/S flip flop, some state is held BUT there is a strong boolean component.
- Transparent latch : Can be a pass-through. Or not. So the data can still be considered as boolean, but it may not be so. The latching control input is not a boolean signal however, a new type appears !
- The DFF can not pass through so it could be considered as a non-boolean gate. The clock is a special type of signal (just like the "enable" for conditional storing) but the optional Clear and Preset signals are pass-through and could be considered boolean !
This little panorama gives a rough idea of the complexity of the whole situation. Usually, digital design tools only care for cases 1 and 4 but with cross-clock-domain designs (such as the TAP), this is far from enough.
Going through the "sequential spectrum" allows one to find distinctive features that require special handling. Signals can be classed depending on the ((almost) direct) traversal of the gate (and holding a value is less significant).
- The input signal directly affects the output : it's boolean. This applies to the T-laches (G, D) and also the CLR and PRE inputs of other gates.
- The input signal does not affect the output at this clock cycle : it's purely sequential. CLK and D inputs of DFF apply and can be treated as output ports, while the Q output is treated as an input port.
So each gate can have various attributes and signal types, it's not a gate-only thing (for example DNF1 only has CLK, D and Q ports and is purely sequential, while DFN1P1C1 is mixed because PRE and CLR are pass-through).
The above dichotomy would work in many cases but does not help with the interlocked gates. For the RS flip-flops, the loopback net(s) create a timing paradox, an endless loop that should be avoided, right ?
These backwards nets need to be tested, of course, and can't be ignored. However their immediate effect should not break the design: they are detected and the associated gates are flagged as "zombies". For the DUT to work, these backwards nets must be labelled, which should not be automatic (you can't trust a computer to take the right decisions, do you ?)
So some nets must be explicitly flagged, but VHDL can't allow this (at least in a practical way because the backwards-going net might also be used downstream). The other solution is to flag the input port(s) of the gate(s) that receive(s) backwards data, to prevent the gate from being flagged as a "zombie" and warn the ATPG that some data storage (or maybe oscillation ?) is going on and it's not an error.
It could be handled by adding another "generic" property to each gate (off by default) as is already the case to prevent the test of certain "impossible input combinations" (the generic(exclude: std_logic_vector); that is already used with other gates).
Another more explicit system would be a "virtual time-travelling component" that would synthesise as a wire but handle the loopback in a graceful way, separating the "forward going" net from the "backwards" one. Depth checking would be disabled for the output net of this "loop" component (or its depth would be -1 ?)
So far the jury is still out, the idea is fresh and there could be more options. But soon, more meta-macro-gates will appear, that will help the design of the YGREC8 :-)
After a night sleeping on the subject, the appropriate solution seems to be the "time traveller" component, which should be discussed on the appropriate page :-) And the new RS FF macros are being integrated into the relevant TAP units.
08/22/2020 at 23:03 •
Let's now talk about the physical interface to the TAP :-)
As already mentioned, this TAP is designed to
- interface directly to a SPI master interface
- require as few pins as possible
- be as simple as possible so the host SW can do all the smart stuff
The typical application uses something like the Raspberry Pi and a half-duplex link with 3 signal wires :
- /WR input selects the data direction (0: write, 1 read from the core)
- Dat : multiplexed serial data in and out signal
- CLK input
The debug system could also access the /RESET pin but this should not be necessary.
/WR, CLK and Dat should have a weak pull-down resistor to prevent any spurious activity when the debugger is not connected.
However the state at these pins could be sampled by the Y8 during Power-On Reset (P.O.R.) to control the state when /RESET is released.
- /WR should not be high on its own because any unsuspecting device connected to the Dat pin could also be a driver and result in a "driver conflict". Thus /WR high would mean a debugger is actively connected and requesting to take over the pins.
- CLK is also usually low but the counter has "a condition" where the synchroniser oscillates during initialisation. Oscillations are removed when CLK is high at startup and it has no noticeable effect on the initialisation sequence.
- Dat should be weakly pulled to 0 because you never know what state would be output when /WR goes high, and a string of 0s is/will be a command for internal TAP reset.
The TAP interface could be implemented in a familiar 4-wires way, as a traditional SPI slave :
But since Din and Dout are not active simultaneously, a little tristate buffer circuit can save one pin on the TAP's side :
Ideally, with a microcontroller for example, the /WR signal is driven by a GPIO pin. However the Raspberry Pi has "some latency" (around one microsecond in some cases) which could be saved with another trick : use both SPI slave select pins, yet use only one CE pin. This also helps when dealing with the proper polarity, since /WR is active on both levels but Slave Select is usually active low.
Multiple TAPs could be accessed with the same port because the interface requires both /WR and CLK to work. If one of them, or both, is disabled (with a demultiplexer, 74HC238 or 239 as in the example below), commands can not be interpreted.
If /WR can go high for one TAP only, then the other TAPs will float their Dat pin, which implements the desired multiplexing. If CLK is not demultiplexed as in the above example, make sure to start all the new communications with a blank byte, which ensures that all the other TAPs will not interpret the data as valid commands.
08/17/2020 at 16:33 •
You might remember the Selector unit from the log 114. The TAP selector: 19 Flip-Flops ! Of course these gates were used and shared but that looks a bit excessive... Using a "prefix approach" lets me reuse the gates so one set of 5 DFF is enough to store the group select prefix and the unit-specific command.
Once the "Group Select Prefix" is received, it must be decoded. As with the first version, only codes 001 to 110 are decoded so there is an expansion rate of 2 : going from 3 input bits to 6 output bits. It would then make sense to latch the 3 encoded bits but there are other considerations : if only 2 or 3 units are implemented, it doesn't make a difference. For now I favour the approach where the decoded signal is latched, so it could be located close to, or inside the addressed unit.
But logically, the new Selector unit has inputs from the pins and the Gray6s counter, and outputs a latched "Group Select Prefix" vector with a desired width. I tried to test the circuit with Falstad's circuitjs and saw that I couldn't go far without having the whole Gray6s circuit where I could tap in some already existing signals. The test circuit is quite large (and I must use minified links now) but only because of the Gray6s unit :-)
I succeeded to turn the selected output ON but some questions remain, as seen as the red-encircled areas.
- On the left : the original circuit expected the clock to go low before /WR changes.
- On the right : can this circuit be reduced to a simpler Set/Reset circuit ? (There are some spurious signals that seem to prevent it)
Anyway it is looking smaller than the first Selector and the decoding gates can be easily reused by the addressed units :-)
Well the problem on the left is solved with a MUX2 / T-latch :-)
There is no need to connect the clock gate to the above Set/Reset latch.
However : this works ONLY if the clock level after and before the /WR change is the same. This still creates a spurious spike when
- CLK = 0 when /WR goes up and
- CLK = 1 when /WR goes down
But this should not happen, right ?
The new circuit is here :-)
This helps to define the modifications required for the Gray6s unit :
- Less4 is now NOR3 that counts /WR as input.
- nobyte and SelB3 can be done externally, in the Selector
- SelB3 needs OV(1)
The output latch can indeed be reduced to a Set/Reset cell, here made with a OA1 but a pair of NOR2 could work as well. The rationale for this choice :
- it takes much less silicon
- The spurious pulse comes after the first normal pulse due to an interaction with the clock. If the latch is already set, there is no harm to set it again one cycle later (it doesn't change the state).
The state change as soon as the 3rd clock pulse is received, which could create issues in the other units. I'm testing if/how I can activate it on the falling edge of CLK.
And here is the Selector with 5 decoded outputs:
I know I should have factored the decoder's inverters but it's only an illustration that will be refined.
The duplicated NOR3s with SelB3 and CLK should also get a special treatment to reduce the logic and wiring... So there it is !
The predecoders are also shared with the rest of the TAP to reduce efforts&complexity. These predecoders are also provided for the other bits of the shift register :-)
Speaking of predecoders : another signal that the other circuits will need is when the first byte is shifted in. I modified the existing circuit by adding 2 gates :
- "firstbyte" is much like nobyte, but instead of checking 3 Gray bits for '0' (which gives a granularity of 4 counts), the DFFa(2) signal is traded for the CLK input to remove the glitch further downstream (and the granularity is 8 counts).
- CMD_en goes high at the end of the 7th clock pulse, ready for being latched in the 8th pulse. It's a bit like FB (Full Byte) but restricted to the first byte only.
preFB is glitchy because it uses logic results from the counter, but it seems that gating the signal with /CLK solves the problem, the DFF for FB could even disappear one day.
At this point, the circuit looks complete and just waits to be implemented in VHDL.
The selector is now provided in the latest archive, starting with YGREC8_VHDL.20200821.tgz. The unit uses a Generic to enable/disable the implementation of individual selected outputs. By default, all the outputs are enabled :
entity Selector is generic( implemented : SLV6 := "111111" );
If you set one of the bits to '0' then the corresponding output will be tied to '0' and no latch is implemented.
This should save a few gates... but no such elimination is provided for the 8 predecoder outputs, the dangling outputs and gates will be detected/handled by the synthesiser.
08/16/2020 at 01:58 •
As I rebuild the tests for the latest archive YGREC8_VHDL.20200815.tgz, I find a weird behaviour : the last test of the TAP takes what feels like ages to complete.
The usual procedure, after you decompress the archive, is to run the script ./test_units.sh which will ensure everything is OK (tools, files, regression tests and so on). Self-tests are run, one after another, and I don't use a makefile because it tends to obfuscate things. If I wanted to use all the CPU I could parallelise some parts but for now, I don't run ALL the possible tests, often because they are pointless. There are various versions of the same unit and some are not relevant.
For the TAP, I have the "behavioural" version of the Gray6s counter, and the "tiles" version, both work as intended. The test.sh script works nicely, with behavioural completing in no time, while the Tiles takes 1/2s. But it appears that the Tiles version has some unexpected side-effects when used along with the MUX64 and I can't find why. test_readback takes more than 7s to complete ! After tweaking the order of the initialisation lines, that time is halved but the 3.5s is still too much for doing nothing (I tried to find where/how that time was wasted but it still eludes me). I want the whole script to run fast and smoothly, I'm not running an exhaustive test but thorough enough that one can be confident to explore the rest of the options with a working system.
For now the solution is to run the piso_order and test_readback programs with the behavioural counter. I suspect is could be an obscure VHDL thing with the "discrete" R/S latch of the resynch circuit but I can't be sure yet. Anyway the purpose is to run decently fast simulations and the Tiles version is not required at that point, as long as the behaviour matches with the higher-level description.
Apparently the gates library adds some significant overhead when the number of gates increases. The TAP simulates faster when the "simple" build is chosen, without the fancy stats stuff. However regeneration of the library would take almost as much time as running the system with its about 100 gates. A tiles-less version is chosen for the TAP's first tests, but other options are possible (commented out in the scripts). Behavioural simulation is fast enough anyway in the first checks. More detailed simulations can be performed later...
The archive is re-uploaded and the ./test_units.sh script completes in < 11s.
Oh, it just dawns on me that I could compile the "simple" gates library in a different directory, with the same name, and select the version by the directory path... This is now explained in Abnormal initialisation time and workaround. and the archive is updated with the new "dual style" system.
08/13/2020 at 22:16 •
The MUX64 is unchanged, the Gray7s circuit is redesigned, the Instruction Slice remains quite the same... The new general diagram shows that we're left with the redesign of the Execute module :
There is one gotcha though, compared to the behaviour of the previous versions :
If you read the MUX beyond the 64 bits, the Gray counter will loop and wrap around in the reverse direction, before going in the forward direction again, and so on...
I might have to reduce the size again because it seeeeeems I might not use the whole counter range. See below.
The timing diagrams have not changed much because the TAP is still driven by the same type of devices (microcontrollers or SBC with a hardwired byte-wide SPI master). So sending a single byte will look like this :
There is a required "little dance" on /WR and CLK after powerup to ensure that the internal state is determined:
Strobe CLK up-down (at least once, could be 8 times), then /WR up-down, and you can read and write as expected (just make sure the serial pin is not driven when /WR is up). Other registers will need to be initialised as well, as there is no real "reset" pin.
There is a significant change on the horizon though. I'd like to use a prefix (byte ?) to also simplify the Selector (it has disappeared on the main diagram). The first byte could be a "short command" to the FSM for example, but for longer messages it also steers the data to the right shift register. Some clock resynch is required for these sub-clocks to prevent a spurious rising edge but I think I found a solution tonight.
Furthermore I have split the first byte into the prefix and the command. This saves some gates because the prefix is latched first, then the command can reuse the same DFF gates. With 3 bits of prefix and 5 bits of command, only 5 bits need to be stored by the new "selector". Furthermore, since decoding logic is shared between the prefix and the command, some further gates could potentially be saved. Then, only 6 units can be selected because prefixes 000 and 111 are reserved (it can reuse the same decoding logic as the previous selector). To save further on gates, each unit will contain its own selector latch and logic.
I like it because that is one internal state fewer to init and/or keep in mind when programming. The command is the state and the system needs fewer cycles to get into a nominal functioning state (and it uses fewer gates). It is less resilient though and the commands must be carefully documented because there is no ASCII mnemonic.
The other benefit is with the timing : the serial interface generates the pulses that were hard to come up with the first version. The /WR pulse can be used to strobe transparent latches, for example.
As a result : most of the commands that were defined so far in 116. TAP summary & protocol will be either 1 or 3 bytes long...
08/10/2020 at 10:42 •
20200814 : back to 6 bits because it looks unlikely I'll use 16 byte-long messages. Look at this circuit :-)
With the new TAP v.2, I reconsider the detailed design of the whole circuit and merge the two counters into one. This means that I must remove the /RESET input of the DFFs, which in fact are not desired because basic ASIC gates don't have one anyway. I must also increase the size of the counter a bit and add a SAT output (plus some pre-decoded bits such as FB or NULL). With these enhancements, the same counter can drive both the MUX tree for Dout and the other decoders for Din.
The log 109. Gray counter explains all the details of the construction of a modular/cascaded Gray counter, check it out if you haven't seen already !
From there the first step is to expand the counter to 7 bits and add a saturation bit :
Then the DFF with RESET must be substituted with a DFF and a AND2 gate.
Let's start with the MSB : SAT and B6
The funny thing is that the AND and XOR gates can be understood as a half adder, because B6 is toggled every time OV is on, and SAT is enabled when both OV and DIR are on. This could help simplify a bit if a "half adder" gate is available in the ASIC PDK but H2 seems best merged with the following OR2.
The DFFs have no RESET input as expected. The SAT output could even drop the DFF but it would be ON during cycle 127 and not 128, thus reducing the usefulness of the whole circuit. The DFF delays the flag by one cycle and allows the use of the full 16 counts.
The middle module(s) have nothing specific to be said about...
I simply added the AND2 at the output of the DFF and removed the RESET pin.
Same for the the LSB : it's a simple adaptation.
.The modules are gathered in this link so they can be reused and adapted later for other eventual purposes.
I hope it will be useful to others ;-)
The whole counter is there : it looks like such a mess that I'm glad it's modular ;-)
(of course this is not the typical way to use it but it works anyway)
Now, writing it in VHDL is another story.
Stay tuned !
Oh I almost forgot ! The earlier Counter unit has a FB output that is needed in most of the other circuits. It turns out it's quite easy to generate but not as I originally thought : just AND the DIR and OV signals from the LSB module. The circuitjs diagram shows that the result of the AND is a bit glitchy on the the 4th (mod 8) cycle but the DFF resynchs the signal. The AND result is provided as a partially decoded signal preFB, in case it's needed in other places...
The trick is to ensure that the /WR toggles work as intended, there is a AND after the DFF, and there is no need to AND before because the OV and DIR signals are already ANDed anyway.
It is also useful to provide pre-decoded flags for when the byte count is low. I added the Less4 output signal that is a NOR3 of SAT, b6 and b5, such that it is 1 when the count is less than 4 bytes.
As the circuit has grown beyond the linking capacity of the site, I saved the description as Gray7s-fb-l4.cjs in the archive.
The whole thing is pretty large, now...
VHDL implementation was not difficult, thanks to the previous version and all the planning that is logged on these pages. It compiled (almost) right away and thanks to rigorous checks during the writing, only one small numbering mistake remained and was easily spotted.
Total count : 46 gates (incl. resynch, sat, FB), while the earlier Counter was 31 and Gray6 was 21 (and with special DFF with Reset). So the net gain is 6 gates but there are 16 DFF that have been replaced with a smaller version without RESET.
This new version will greatly ease the design of the other modules !
08/07/2020 at 22:09 •
I tried to run my new code through Synplify (in the Libero SOC suite) and got some interesting results.
I finally understand how to create and use external libraries, in particular the SLV lib worked right out of the box, after I searched for the right method. It's some of those dumb painful GUI clickodrome that looks nice during a presentation but is not possible to automate... Anyway, SLV_utils.vhdl was added smoothly.
I forgot an important "detail" about how Synplify wants its external entities : "old style"... So I had to adapt/modify a lot of lines. Nothing changed except the syntax. It's more verbose, you have to add a declaration for each block you use... But now it works.
I could check, verify and compare the behaviour of the synthesiser with various versions of one unit.
In particular I verified that the "balanced control tree" approach is beneficial compared to the dumb/usual approach. Log 25. MUX trees gets a graphical update :-)
Oh and I found how to manually place & lock gates, so here is one test with INC8 :-)
The system was not able to optimise this unit more so I guess I'm not far from a great design.
Now, I just have to find how to generate these coordinates with a program and send them to the tool...
All the modified and/or tested VHDL files have been re-integrated into the code tree with the following line :
-- SYNTHESIS OK
So it's easy to check/list all the final files with grep, and separate them from the simulation-only files :-)
I have gained more insight, refreshed my skills and proved that my method works.
08/04/2020 at 22:20 •
As I am near completion the design of the TAP system, I realise I have harder and harder timing problems to solve... And I could even save some gates !
The Counter has 32 gates. The Gray counter has 17. Both count on CLK's rising edge and mayyyybe... they could be merged ?
It's not difficult to adapt the Gray counter to provide the additional signals FB (Full Byte : just a NOR3) and SAT (an added DFF and a couple of gates), as well as individual decoded size output signals for 1 to 8 bytes (though so far only 2 and 4 are used). Overall, the Gray counter would expand to maybe 30 gates, which would overall save maybe 20 gates compared to the split design, and the DFF would not have a RESET input, which might be absent in ASIC gates libraries (and use more silicon)...
The harder part though is the transition between the 2 phases and the reset of the counters. I think I have an idea but it will force me to deconstruct the Gray counter into a more traditional logic+DFF system, because... it will become a classic digital sequential circuit, with the current state and the expected new value that may (or may not) be latched.
Some more thoughts gave this result :
This easily solves the question of using the counter with BOTH phases, at the cost of 1×DFF, 1×XNOR and 1×AND2 per counter DFF.
Here is how it shoul look with wavedrom :
This does not solve however the case where /WR is toggled up and down without CLK activity. The DIFF internal signal should be "sticky" and go back down when CLK has a rising edge...
There, it should work now :-) (ok it doesn't because of a race condition with the clock)
Each time /WR changes, the added DFF is RESET, and later set again by a positive edge on CLK. The question now is how to emulate this DFF with individual gates. The following circuit seems to work well and is adapted to ASIC implementation :
The two NOR2 use little surface on a die. The output inverter works as a buffer. There is no oscillation condition, as proved in the above trace : the SET has precedence over CLR, which avoids the race condition found with the initial idea using a DFF. The initial value of the latch is determined by toggling the /WR and CLK pins and a short initialisation sequence brings the circuit to a know state:
- bring /WR low
- pulse CLK once (at least, could be more) => first DFF state is known
- set /WR high => changes XOR => CLR the SR latch
- pulse CLK once (at least, could be more) => SET the SR latch
The output data can be ignored, shift 0s in to make NOPs (just in case). So this could be summed up as : shift a NUL byte in, then shift a dummy byte out.
Here is the new version :
Note : this works ideally when the CLK input is LOW when /WR changes. However : to create a rising edge, CLK must go down before going up again, this half-clock phase (when low) will be the "clear state". Ensure this period is long enough and that the CLK state is appropriate (check on the 'scope to be sure !!! I'm looking at you, Raspberry Pi...). This is not critical while shifting bits in, however it is a delicate thing to ensure when reading from the TAP (in particular, the first "volatile bit").
This is solved by changing the precedence of the RS latch to RESET/DIFF/WR, as in this updated circuit :
And the extended chronogram :
Now CLK can go low before or after /WR changes.
Just by changing where a wire is connected.
08/03/2020 at 17:52 •
The eXecute module of the TAP connects one domain with no RESET but 2 clocks, to another with one RESET and one clock. This makes it more complex than the others, as hinted by the end of the previous log The TAP's eXecute module.
- On the TAP side : CLK and /WR are two sources of clocking. CLK goes to the counter and the shift registers, /WR goes to the decode logic that takes the control at the end of each message.
CLK would not go very fast : 10MHz is reasonable (wires and other external effects would probably disturb the signal) and leaves 100ns between consecutive rising clock edges.
There MUST be a reasonable margin (100ns ?) between the last rising clock edge and the rising edge of /WR.
There is no RESET for several reasons :
- The TAP must be able to work while the rest is in /RESET
- Adding a TAP-specific RESET pin would increase the external footprint and wiring
- The TAP can control /RESET from the inside
- Routing another /RESET could burden the rest of the chip
- By design, the TAP will work with the proper init sequence. (JTAG can also work without /RESET pin)
- The core has a free-running clock, as well as a HW /RESET external signal.
The clock could be as slow or fast as one wants, or even weird...
The /RESET can also be overtaken by the TAP.
For the communication with the FSM, the signal goes through two DFF as shown below :
If the FSM clock is fast enough, the OR can be removed but... you're never too sure ! For example going from RESET to START triggers the reload of the instruction memory, which can take 4K cycles at least.
The first DFF triggers on /WR going up, which is the necessary condition to detect the end of the message, or else the "valid" address could be trigered by enough random data flowing through the shift register. The asynchronous RESET allows the crossing of clock domains, and the clearing always trails the setting by at least one FSM clock cycle, as delayed by the next DFF.
The DFF on the right also re-synchronises the input data so it is valid at the start of each FSM clock cycle. Otherwise the data could arrive late in the cycle and create race conditions and invalid boolean calculations.
- On the TAP side : CLK and /WR are two sources of clocking. CLK goes to the counter and the shift registers, /WR goes to the decode logic that takes the control at the end of each message.
08/01/2020 at 12:43 •
The previous modules are quite simple, easy, self-contained, while the X command (described earlier) subtly touches more things at once.
Talking to the Instruction slice is not very hard, but requires some decoding first, and some of it would be best shared with the Selector. The "addresses" 'S' and 'X' are very close and and this would save some gates.
I think it's the perfect time to talk about how I mapped the S-decoder to gates :-)
It started easy enough for the 'S' condition :
valid <= '1' when SRi( 7 downto 0)="01010011" -- signature and SRi(15 downto 11)="00110" -- command : MSB select ASCII chars '0'-'7' and SAT='0' and W='0' and J1='1' and J0='0' -- else '0';
Then another simple step is to sort the '0' and '1' to put them in two separate equations, one with AND for the '1's and the '0's are gathered with a big NOR :
norx <= not (SAT or W or J0 or SRi(15) or SRi(14) or SRi(11) or SRi(7) or SRi(5) or SRi(3) or SRi(2)); valid <= SRi(13) and SRi(12) and SRi(6) and SRi(4) and SRi(1) and SRi(0) and J1 and norx;
Then it's easy to group the ORs and ANDs together into 3-inputs gates. And when there are not enough inputs for the AND gates, they can be used to input the result of the NORs :-)
Finally, bubble-pushing can transform two consecutive ANDs into a NAND followed by a NOR.
So let's do this all over again but this time the X condition is also decoded so some gates are common.
S <= '1' when SRi(7 downto 0)="01010011" and SRi(15 downto 11)="00110" and SAT='0' and W='0' and J1='1' and J0='0' else '0'; X <= '1' when SRi(7 downto 0)="01011000" and SAT='0' and W='0' and J2='1' and J1='0' else '0';
The common terms are
COM <= '1' when SRi(7 downto 4)="0101" and SRi(2)='0' and SAT='0' and W='0' else '0';
and S and X can be taken separatey :
S <= '1' when SRi(3)='0' and SRi(1 downto 0)="11" and SRi(15 downto 11)="00110" and J1='1' and J0='0' else '0'; X <= '1' when SRi(3)='1' and SRi(1 downto 0)="00" and J2='1' and J1='0' else '0';
Now these 3 can be checked in parallel, let's separate their bits according to their value.
X <= SRi(3) and J2 and not ( SRi(1) or SRi(0) or J1); -- nice fit for this one ! S <= SRi(1) and SRi(0) and SRi(13) and SRi(12) and J1 and not (SRi(3) or J0 or SRi(15) or SRi(14) or SRi(11)); COM <= SRi(6) and SRi(4) and not (SAT or W or SRi(2) or SRi(7) or SRi(5));
From there the gates are easy to cluster and bubble-push.
The result is 11 gates, the speed is not striking but 4 or 5 gates of latency shouldn't be limiting for this slow circuit and it is only 2 more gates than the previous circuit.
sa: entity OR3 port map(A=>SRi(15), B=>SRi(14), C=>SRi(11), Y=>tSo ); sb: entity NOR3 port map(A=>J1 , B=>SRi( 3), C=>tSo , Y=>tSn ); sc: entity AND3 port map(A=>SRi(13), B=>SRi(12), C=>tSn , Y=>S2 ); sd: entity AND3 port map(A=>SRi( 1), B=>SRi( 0), C=>J0 , Y=>S1 ); c1: entity OR3 port map(A=>SRi(2) , B=>SRi( 7), C=>SRi( 5), Y=>Co1 ); c2: entity NOR3 port map(A=>SAT , B=>W , C=>Co1 , Y=>Co2 ); co: entity AND3 port map(A=>SRi(6) , B=>SRi( 4), C=>Co2 , Y=>COM ); x1: entity NOR3 port map(A=>SRi(1) , B=>SRi( 0), C=>J2 , Y=>tXo ); x2: entity AND3 port map(A=>tXo , B=>SRi( 3), C=>J1 , Y=>tX ); vx: entity AND3 port map(A=>tX , B=>COM , C=>FB , Y=>X ); vld: entity AND3 port map(A=>S1 , B=>COM , C=>S2 , Y=>valid);
(one thing I dislike about VHDL is the requirement to label ALL the instantiated entities, it really gets nasty fast).
- Now, the Selector decodes the execute address with only 2 gates of overhead.
- The clock to the slice is only gated by /WR, already done by the Selector.
- The data to the slice shift register comes directly from the Selector as well (the MSB of the Command bus)
But the slice requires more than these signals and the FSM is an even tougher beast... Let's just focus on the control of the slice :
- Imux : the source of the instruction is selected by the current command (STEPX, NOPX ?) which requires some decoding.
- TrapEn : only active with the START instruction with Trap flag active. This will need extra care to prevent weird conditions when more units are added !!! For now it's only gated by /WR.
- MaskEn : only active with the WrMask command, gated by /WR. as well (beware of timing and levels).
- And there are the START/STEP/STOP/RESET signals to send to the FSM...
But this is getting hard because these signals cross clock domain boundaries.
On top of that, I can't use the other timing tricks because one can't ensure the length of the binary stream (since it's always shifted, no gating) and I can't use a latch because /WR must be the clock => a transparent latch would output a strobe/signal before the end of the message, which could be longer and be an invalid/spurious signal...
For most of the signals, I have chosen to use a DFF, some are simple (output the result of the decoding logic), others (the FSM strobes) are "set" by the decoding logic, then the FSM itself will send an ACK/Clear signal that will asynchronously reset the signal. The DFF's output will then be resynchronised by another DFF inside the FSM.
And then, you need to reset these DFF because they are in unknown state on power-up : one of the Selector addresses could be used for this.
But two other signals create really big timing problems.
- Mask latch : this is a strobe, needs a DFF, but the DFF will keep the value even after the /WR strobe is back to 0. A AND must be added to the output, or maybe /WR could be tied to a /RESET input ? (the timing and logic would be very dirty and unreliabe/unportable)
- the Write Instr Mem command depends on an externally clocked SRAM array so a handshake is required too...
continued in The TAP crosses 3 clock domains !with some diagrams...