Close
0%
0%

YGREC8

A byte-wide stripped-down version of the YGREC16 architecture

Similar projects worth following
YGREC can stand for many things, such as "YG's Relay Electric Computer", "Yann's Germanium and Relay Equipped Computers" or "YG's Ridiculous Electronic Contraption". You decide !

#YGREC16 is getting pretty large and moving away from the original #AMBAP inspiration, making it less likely to be implemented within my lifetime. So here is a "back to minimalism" version with
* 256 bytes of Data RAM (plus parity ?)
* 8 registers, 8 bits each (including PC)
* fewer relays/gates than the YGREC16
This core is so simple that I focus now on other issues, such as the debug/test access port, the register set's structure, I/O, power reduction...
Like the others, it's suitable for implementation with relays, transistors, SSI TTL, FPGA, ASIC, you name it (as long it uses boolean logic)!

After the explorations with #YGREC-РЭС15-bis, I reached several limits and I decided to scale it down as much as possible. And this one will be implemented both with relays and VHDL, since the YGREC8 is a great replacement for Microchip's PICs.

A significant reduction of the register set's size is required so I/O must be managed differently, through specific instructions. The register map is now:

  • D1  <= for NOP
  • A1
  • D2
  • A2
  • R1
  • R2
  • R3
  • PC  <= for INV

The instruction word is shrunk down to 16 bits. It is still reminiscent of the YGREC16 older brother but I had to make clear cuts... The YGREC8 is a 1R1W machine (like x86) instead of the RISCy YGREC16, to remove one field. Speed should be decent, with a pretty short critical datapath, and all the instructions execute in one clock cycle (except the LDCx instructions and computed writes to PC).

The fields have evolved with time (I have tried various locations and sizes). For example:

20171116: The latest evolution of the instruction format has added a 9-bits immediate field address for the I/O instructions.
20180112: Imm9 is now removed again...
20181024: changed the names of some fields
20181101: modified the conditions to change Imm3 into Imm4
20180112: Imm9 back again !

There are 18 useful opcodes (plus INV, and the pseudo-opcodes HLT and NOP), and most share two instruction forms : either an IMM8 field, or a source & condition field. The source field can be a register or a short immediate field (4 bits only but essential for conditional short jumps or increments/decrements).

The main opcode field has 4 bits and the following values:

Logic group :

  • OR
  • XOR
  • AND
  • ANDN

Arithmetic group:

  • CMPU
  • CMPS
  • SUB
  • ADD

Beware : There is no point to ADD 0, so ADD with short immediate (Imm4) will skip the value 0 and the range is now from -8 to -1 and +1 to +8. (see 17. Basic assembly programming idioms)

Shift group (optional)

  • SH/SA direction is sign of shift, I/R(bit9) is Logic/Arithmetic flag.
  • RO/RC direction is sign of shift, I/R(bit 9) allows carry to be rotated.

Control group:

The COND field has 3 bits (for Imm4) or 4 bits, more than YGREC16, so we can add more direct binary input signals. CALL is moved to the opcodes so one more code is available. All conditions can be negated so we have :

  • Always
  • Z (Zero, all bits cleared)
  • C (Carry)
  • S (Sign, MSB)
  • B0, B1, B2, B3 (for register-register form, we can select 4 bits to test from user-defined sources)

Instruction code 0000h should map to NOP, and the NEVER condition, hence ALWAYS is coded as 1.

Instruction code FFFFh should map to INV, which traps or reboots the CPU (through the overlay mechanism): condition is implicitly ALWAYS because it's a IMM8 format.

Overall, it's still orthogonal and very simple to decode, despite the added complexity of dealing with 1R1W code.


This project is more than an ISA or one implementation : the goal is to become a platform. See log 82. Project organisation

Logs:
1. Honey, I forgot the MOV
2. Small progress
3. Breakpoints !
4. The YGREC debug system
5. YGREC in VHDL, ALU redesign
6. ALU in VHDL, day 2
7. Programming the YGREC8
8. And a shifter, and a register set...
9. I/O registers
10. Timer(s)
11. Structure update
12. Instruction cycle counter
13. First synthesis
14. Coloration syntaxique pour Nano
15. Assembly language and syntax
16. Inspect and control the core
17. Basic assembly programming idioms
18. Constant tables in program space
19. Trap/Interrupt vector table
20. Automated upload of overlays into program memory
21. Making room for another instruction
22. Opcode map
23. Sequencing the core
24. Synchronous Serial Debugging
25. MUX trees
26. Flags, PC and IO ports
27. Binary translation (updated)
28. Even better register set
29. A better relay-based MUX64
30. Register set again
31. Rename that...

Read more »

x-compressed-tar - 278.49 kB - 08/13/2020 at 03:24

Download

YGREC8_VHDL.20200812.tgz

MUX64 & Gray7s behav.

x-compressed-tar - 278.29 kB - 08/12/2020 at 15:24

Download

YGREC8_VHDL.20200811.tgz

TAP reboot, "SYNTHESIS OK" on some files

x-compressed-tar - 259.30 kB - 08/11/2020 at 06:26

Download

YGREC8_VHDL.20200801.tgz

TAP/Slice in the works...

x-compressed-tar - 358.84 kB - 08/02/2020 at 02:00

Download

YGREC8_VHDL.20200730.tgz

Instruction debug slice added.

x-compressed-tar - 334.00 kB - 07/30/2020 at 21:30

Download

View all 48 files

  • TAP v.2 : where it's going

    Yann Guidon / YGDESa day ago 0 comments

    The MUX64 is unchanged, the Gray7s circuit is redesigned, the Instruction Slice remains quite the same... The new general diagram shows that we're left with the redesign of the Execute module :

    There is one gotcha though, compared to the behaviour of the previous versions :

    If you read the MUX beyond the 64 bits, the Gray counter will loop and wrap around in the reverse direction, before going in the forward direction again, and so on...

    I might have to reduce the size again because it seeeeeems I might not use the whole counter range. See below.


    The timing diagrams have not changed much because the TAP is still driven by the same type of devices (microcontrollers or SBC with a hardwired byte-wide SPI master). So sending a single byte will look like this :

    There is a required "little dance" on /WR and CLK after powerup to ensure that the internal state is determined:

    Strobe CLK up-down (at least once, could be 8 times), then /WR up-down, and you can read and write as expected (just make sure the serial pin is not driven when /WR is up). Other registers will need to be initialised as well, as there is no real "reset" pin.


    There is a significant change on the horizon though. I'd like to use a prefix (byte ?) to also simplify the Selector (it has disappeared on the main diagram). The first byte could be a "short command" to the FSM for example, but for longer messages it also steers the data to the right shift register. Some clock resynch is required for these sub-clocks to prevent a spurious rising edge but I think I found a solution tonight.

    Furthermore I have split the first byte into the prefix and the command. This saves some gates because the prefix is latched first, then the command can reuse the same DFF gates. With 3 bits of prefix and 5 bits of command, only 5 bits need to be stored by the new "selector". Furthermore, since decoding logic is shared between the prefix and the command, some further gates could potentially be saved. Then, only 6 units can be selected because prefixes 000 and 111 are reserved (it can reuse the same decoding logic as the previous selector). To save further on gates, each unit will contain its own selector latch and logic.

    I like it because that is one internal state fewer to init and/or keep in mind when programming. The command is the state and the system needs fewer cycles to get into a nominal functioning state (and it uses fewer gates). It is less resilient though and the commands must be carefully documented because there is no ASCII mnemonic.

    The other benefit is with the timing : the serial interface generates the pulses that were hard to come up with the first version. The /WR pulse can be used to strobe transparent latches, for example.

    As a result : most of the commands that were defined so far in 116. TAP summary & protocol will be either 1 or 3 bytes long...

  • Updated Gray Counter

    Yann Guidon / YGDES5 days ago 0 comments

    With the new TAP v.2, I reconsider the detailed design of the whole circuit and merge the two counters into one. This means that I must remove the /RESET input of the DFFs, which in fact are not desired because basic ASIC gates don't have one anyway. I must also increase the size of the counter a bit and add a SAT output (plus some pre-decoded bits such as FB or NULL). With these enhancements, the same counter can drive both the MUX tree for Dout and the other decoders for Din.

    The log 109. Gray counter explains all the details of the construction of a modular/cascaded Gray counter, check it out if you haven't seen already !

    From there the first step is to expand the counter to 7 bits and add a saturation bit :

    Then the DFF with RESET must be substituted with a DFF and a AND2 gate.

    Let's start with the MSB : SAT and B6

    The funny thing is that the AND and XOR gates can be understood as a half adder, because B6 is toggled every time OV is on, and SAT is enabled when both OV and DIR are on. This could help simplify a bit if a "half adder" gate is available in the ASIC PDK but H2 seems best merged with the following OR2.

    The DFFs have no RESET input as expected. The SAT output could even drop the DFF but it would be ON during cycle 127 and not 128, thus reducing the usefulness of the whole circuit. The DFF delays the flag by one cycle and allows the use of the full 16 counts.

    The middle module(s) have nothing specific to be said about...

    I simply added the AND2 at the output of the DFF and removed the RESET pin.

    Same for the the LSB : it's a simple adaptation.

    .The modules are gathered in this link so they can be reused and adapted later for other eventual purposes.

    I hope it will be useful to others ;-)

    The whole counter is there : it looks like such a mess that I'm glad it's modular ;-)

    And it works nicely when driven by the circuit described in 120. TAP v.2 :

    (of course this is not the typical way to use it but it works anyway)

    Now, writing it in VHDL is another story.

    Stay tuned !


    Oh I almost forgot ! The earlier Counter unit has a FB output that is needed in most of the other circuits. It turns out it's quite easy to generate but not as I originally thought : just AND the DIR and OV signals from the LSB module. The circuitjs diagram shows that the result of the AND is a bit glitchy on the the 4th (mod 8) cycle but the DFF resynchs the signal. The AND result is provided as a partially decoded signal preFB, in case it's needed in other places...

    The trick is to ensure that the /WR toggles work as intended, there is a AND after the DFF, and there is no need to AND before because the OV and DIR signals are already ANDed anyway.

    It is also useful to provide pre-decoded flags for when the byte count is low. I added the Less4 output signal that is a NOR3 of SAT, b6 and b5, such that it is 1 when the count is less than 4 bytes.

    As the circuit has grown beyond the linking capacity of the site, I saved the description as Gray7s-fb-l4.cjs in the archive.

    The whole thing is pretty large, now...


    VHDL implementation was not difficult, thanks to the previous version and all the planning that is logged on these pages. It compiled (almost) right away and thanks to rigorous checks during the writing, only one small numbering mistake remained and was easily spotted.

    Total count : 46 gates (incl. resynch, sat, FB), while the earlier Counter was 31 and Gray6 was 21 (and with special DFF with Reset). So the net gain is 6 gates but there are 16 DFF that have been replaced with a smaller version without RESET.

    This new version will greatly ease the design of the other modules !

  • Synthesis checks

    Yann Guidon / YGDES08/07/2020 at 22:09 0 comments

    I tried to run my new code through Synplify (in the Libero SOC suite) and got some interesting results.

    First :

    I finally understand how to create and use external libraries, in particular the SLV lib worked right out of the box, after I searched for the right method. It's some of those dumb painful GUI clickodrome that looks nice during a presentation but is not possible to automate... Anyway, SLV_utils.vhdl was added smoothly.

    Second :

    I forgot an important "detail" about how Synplify wants its external entities : "old style"... So I had to adapt/modify a lot of lines. Nothing changed except the syntax. It's more verbose, you have to add a declaration for each block you use... But now it works.

    Third :

    I could check, verify and compare the behaviour of the synthesiser with various versions of one unit.

    In particular I verified that the "balanced control tree" approach is beneficial compared to the dumb/usual approach. Log 25. MUX trees gets a graphical update :-)

    Oh and I found how to manually place & lock gates, so here is one test with INC8 :-)

    The system was not able to optimise this unit more so I guess I'm not far from a great design.

    Now, I just have to find how to generate these coordinates with a program and send them to the tool...

    Finally :

    All the modified and/or tested VHDL files have been re-integrated into the code tree with the following line :

    -- SYNTHESIS OK

    So it's easy to check/list all the final files with grep, and separate them from the simulation-only files :-)

    I have gained more insight, refreshed my skills and proved that my method works.



  • TAP v.2

    Yann Guidon / YGDES08/04/2020 at 22:20 0 comments

    As I am near completion the design of the TAP system, I realise I have harder and harder timing problems to solve... And I could even save some gates !

    The Counter has 32 gates. The Gray counter has 17. Both count on CLK's rising edge and mayyyybe... they could be merged ?

    It's not difficult to adapt the Gray counter to provide the additional signals FB (Full Byte : just a NOR3) and SAT (an added DFF and a couple of gates), as well as individual decoded size output signals for 1 to 8 bytes (though so far only 2 and 4 are used). Overall, the Gray counter would expand to maybe 30 gates, which would overall save maybe 20 gates compared to the split design, and the DFF would not have a RESET input, which might be absent in ASIC gates libraries (and use more silicon)...

    The harder part though is the transition between the 2 phases and the reset of the counters. I think I have an idea but it will force me to deconstruct the Gray counter into a more traditional logic+DFF system, because... it will become a classic digital sequential circuit, with the current state and the expected new value that may (or may not) be latched.


    20200808 :

    Some more thoughts gave this result :

    This easily solves the question of using the counter with BOTH phases, at the cost of 1×DFF, 1×XNOR and 1×AND2 per counter DFF.

    Here is how it shoul look with wavedrom :

    This does not solve however the case where /WR is toggled up and down without CLK activity. The DIFF internal signal should be "sticky" and go back down when CLK has a rising edge...

    There, it should work now :-) (ok it doesn't because of a race condition with the clock)

    Each time /WR changes, the added DFF is RESET, and later set again by a positive edge on CLK. The question now is how to emulate this DFF with individual gates. The following circuit seems to work well and is adapted to ASIC implementation :

    The two NOR2 use little surface on a die. The output inverter works as a buffer. There is no oscillation condition, as proved in the above trace : the SET has precedence over CLR, which avoids the race condition found with the initial idea using a DFF. The initial value of the latch is determined by toggling the /WR and CLK pins and a short initialisation sequence brings the circuit to a know state:

    • bring /WR low
    • pulse CLK once (at least, could be more) => first DFF state is known
    • set /WR high => changes XOR => CLR the SR latch
    • pulse CLK once (at least, could be more) => SET the SR latch

    The output data can be ignored, shift 0s in to make NOPs (just in case). So this could be summed up as : shift a NUL byte in, then shift a dummy byte out.
    Here is the new version :

    Note : this works ideally when the CLK input is LOW when /WR changes. However : to create a rising edge, CLK must go down before going up again, this half-clock phase (when low) will be the "clear state". Ensure this period is long enough and that the CLK state is appropriate (check on the 'scope to be sure !!! I'm looking at you, Raspberry Pi...). This is not critical while shifting bits in, however it is a delicate thing to ensure when reading from the TAP (in particular, the first "volatile bit").

    This is solved by changing the precedence of the RS latch to RESET/DIFF/WR, as in this updated circuit :

    And the extended chronogram :

    Now CLK can go low before or after /WR changes.

    Just by changing where a wire is connected.

  • The TAP crosses 3 clock domains !

    Yann Guidon / YGDES08/03/2020 at 17:52 0 comments

    The eXecute module of the TAP connects one domain with no RESET but 2 clocks, to another with one RESET and one clock. This makes it more complex than the others, as hinted by the end of the previous log The TAP's eXecute module.

    • On the TAP side : CLK and /WR are two sources of clocking. CLK goes to the counter and the shift registers, /WR goes to the decode logic that takes the control at the end of each message.
      CLK would not go very fast : 10MHz is reasonable (wires and other external effects would probably disturb the signal) and leaves 100ns between consecutive rising clock edges.
      There MUST be a reasonable margin (100ns ?) between the last rising clock edge and the rising edge of /WR.
      There is no RESET for several reasons :
      • The TAP must be able to work while the rest is in /RESET
      • Adding a TAP-specific RESET pin would increase the external footprint and wiring
      • The TAP can control /RESET from the inside
      • Routing another /RESET could burden the rest of the chip
      • By design, the TAP will work with the proper init sequence. (JTAG can also work without /RESET pin)
    • The core has a free-running clock, as well as a HW /RESET external signal.
      The clock could be as slow or fast as one wants, or even weird...
      The /RESET can also be overtaken by the TAP.

    .

    .

    For the communication with the FSM, the signal goes through two DFF as shown below :

    If the FSM clock is fast enough, the OR can be removed but... you're never too sure ! For example going from RESET to START triggers the reload of the instruction memory, which can take 4K cycles at least.

    The first DFF triggers on /WR going up, which is the necessary condition to detect the end of the message, or else the "valid" address could be trigered by enough random data flowing through the shift register. The asynchronous RESET allows the crossing of clock domains, and the clearing always trails the setting by at least one FSM clock cycle, as delayed by the next DFF.

    The DFF on the right also re-synchronises the input data so it is valid at the start of each FSM clock cycle. Otherwise the data could arrive late in the cycle and create race conditions and invalid boolean calculations.

    .

    .

  • The TAP's eXecute module

    Yann Guidon / YGDES08/01/2020 at 12:43 0 comments

    The previous modules are quite simple, easy, self-contained, while the X command (described earlier) subtly touches more things at once.

    Talking to the Instruction slice is not very hard, but requires some decoding first, and some of it would be best shared with the Selector. The "addresses" 'S' and 'X' are very close and and this would save some gates.

    I think it's the perfect time to talk about how I mapped the S-decoder to gates :-)

    It started easy enough for the 'S' condition :

    valid <= '1' when  SRi( 7 downto 0)="01010011" -- signature
            and SRi(15 downto 11)="00110"   -- command : MSB select ASCII chars '0'-'7'
            and SAT='0' and W='0' and J1='1' and J0='0' --    else '0';

    Then another simple step is to sort the '0' and '1' to put them in two separate equations, one with AND for the '1's and the '0's are gathered with a big NOR :

    norx <= not (SAT or W or J0 or SRi(15) or SRi(14) or SRi(11)
                 or SRi(7)  or SRi(5)  or SRi(3)  or SRi(2));
    valid <=  SRi(13) and SRi(12) and SRi(6) and SRi(4)
                  and SRi(1) and SRi(0) and J1 and norx;
    

     Then it's easy to group the ORs and ANDs together into 3-inputs gates. And when there are not enough inputs for the AND gates, they can be used to input the result of the NORs :-)

    Finally, bubble-pushing can transform two consecutive ANDs into a NAND followed by a NOR.

    So let's do this all over again but this time the X condition is also decoded so some gates are common.

    S <= '1' when SRi(7 downto 0)="01010011" and SRi(15 downto 11)="00110"
                        and SAT='0' and W='0' and J1='1' and J0='0'
       else '0';
    X <= '1' when SRi(7 downto 0)="01011000"
                        and SAT='0' and W='0' and J2='1' and J1='0'
       else '0';

    The common terms are

    COM <= '1' when SRi(7 downto 4)="0101" and SRi(2)='0' and SAT='0' and W='0'
      else '0';

    and S and X can be taken separatey :

    S <= '1' when SRi(3)='0' and SRi(1 downto 0)="11" and SRi(15 downto 11)="00110"
                        and J1='1' and J0='0'
       else '0';
    X <= '1' when SRi(3)='1' and SRi(1 downto 0)="00" and J2='1' and J1='0'
       else '0';

    Now these 3 can be checked in parallel, let's separate their bits according to their value.

    X <= SRi(3) and J2 and
           not ( SRi(1) or SRi(0) or J1); -- nice fit for this one !
    S <= SRi(1) and SRi(0) and SRi(13) and SRi(12) and J1 and
           not (SRi(3) or J0 or SRi(15) or SRi(14) or SRi(11));
    COM <= SRi(6) and SRi(4) and
           not (SAT or W or SRi(2) or SRi(7) or SRi(5));

    From there the gates are easy to cluster and bubble-push.

    The result is 11 gates, the speed is not striking but 4 or 5 gates of latency shouldn't be limiting for this slow circuit and it is only 2 more gates than the previous circuit.

       sa: entity  OR3 port map(A=>SRi(15), B=>SRi(14), C=>SRi(11), Y=>tSo  );
       sb: entity NOR3 port map(A=>J1     , B=>SRi( 3), C=>tSo    , Y=>tSn  );
       sc: entity AND3 port map(A=>SRi(13), B=>SRi(12), C=>tSn    , Y=>S2   );
       sd: entity AND3 port map(A=>SRi( 1), B=>SRi( 0), C=>J0     , Y=>S1   );
    
       c1: entity  OR3 port map(A=>SRi(2) , B=>SRi( 7), C=>SRi( 5), Y=>Co1  );
       c2: entity NOR3 port map(A=>SAT    , B=>W      , C=>Co1    , Y=>Co2  );
       co: entity AND3 port map(A=>SRi(6) , B=>SRi( 4), C=>Co2    , Y=>COM  );
    
       x1: entity NOR3 port map(A=>SRi(1) , B=>SRi( 0), C=>J2     , Y=>tXo  );
       x2: entity AND3 port map(A=>tXo    , B=>SRi( 3), C=>J1     , Y=>tX   );
    
      vx:  entity AND3 port map(A=>tX     , B=>COM    , C=>FB     , Y=>X    );
      vld: entity AND3 port map(A=>S1     , B=>COM    , C=>S2     , Y=>valid);
    

    (one thing I dislike about VHDL is the requirement to label ALL the instantiated entities, it really gets nasty fast).


    OK !

    • Now, the Selector decodes the execute address with only 2 gates of overhead.
    • The clock to the slice is only gated by /WR, already done by the Selector.
    • The data to the slice shift register comes directly from the Selector as well (the MSB of the Command bus)

    But the slice requires more than these signals and the FSM is an even tougher beast... Let's just focus on the control of the slice :

    • Imux : the source of the instruction is selected by the current command (STEPX, NOPX ?) which requires some decoding....
    Read more »

  • Trap on instruction

    Yann Guidon / YGDES07/30/2020 at 06:15 0 comments

    From the very beginning, the Y8 core is designed to allow extensive debugging features. Look at the early logs 3. Breakpoints ! and 4. The YGREC debug system to see the approach.

    As the TAP system is being defined and implemented, more details emerge and here I describe one sub-sub-part of the debug system : the slice inserted between the instruction memory and the instruction decoder.

    This DFF+MUX2 is pretty easy to design & layout, and the insertion delay is short enough, so why not add more features ?

    The early drafts promised a trap on a given instruction. This can be refined by masking some of the bits to compare and we get two registers (CoMPare and Match). Since latches uses 1/2 the size of DFF and we have a DFF very close, 2 latches are chosen.

    This is very helpful during debugging because you don't have to focus on a particular instruction.

    • Want to know how many times a given opcode is executed ?
    • Want to know which instructions write to a give register ?
    • Want to know why a given I/O register or range is overwritten ?

    Just set the mask to select the desired field (opcode, register, immediate...) and select the behaviour (trap or count) and you're done.


    Some more considerations and compromises...

    I dumped one latch to save space. That's the difference between 80 and 96 gates, in a core that is already quite small.

    This means that the mask latch must be loaded first then another command loads the instruction chain again, and can't change it at all, so a specific command must also assert Trap_en only when /WR is high while sending a "START" command to the FSM.

    The control logic is slightly more complex but the compactness matters. Fewer gates means fewer sources of errors, delay or power sinks.


    Just added YGREC8_VHDL.20200730.tgz that includes the TAP/Slice circuit shown above.

  • TAP summary & protocol

    Yann Guidon / YGDES07/29/2020 at 15:34 0 comments

    First, here are the logs that describe the design of the Test Access Port :

    4. The YGREC debug system contains the first high-level description, the principle applies equally for any technology/implementation.
    16. Inspect and control the core
    24. Synchronous Serial Debugging
    25. MUX trees
    109. Gray counter (reboot of the low-level design)
    110. The art of large MUXes
    111. The first half of the TAP
    112. Design of a TAP : the SIPO Controller
    113. The TAP's bits counter
    114. The TAP selector
    115. The TAP is coming together
    118. The TAP's eXecute module
    119. The TAP crosses 3 clock domains !
    .


    This log summarises the high-level view from the debugger's perspective. The TAP is "just" a low-level port, a few pins that serialise data in and out of the core, and could be implemented in whatever way (the current TAP is serial but could be made in byte-parallel for the relay version for example).

    This TAP is obviously byte-oriented and designed for SPI mode 0 : this eases programming a lot because most CPUs have a byte-oriented SPI controller. Using variable sized framing would operate slower on platforms such as the Raspberry Pi for example. JTAG often handles sequences of bits in groups other than 8...


    Timing

    The diagram below shows the typical timing with only one transmitted byte shown:

    The TAP works as 2 phases in half-duplex, so Din and Dout may share a tristate pin for example. The /WR pin controls the phase and things happen during these transitions.

    • Going from high to low starts transfer on Din into the TAP, "full bytes" at a time (the number of bits is always a multiple of 8, MSB first to follow the common SPI standard). Each bit is sampled on the rising edge of the clock. The delay from a to b , as well as c to d, is typically one clock cycle to give enough settling time to the internal counter.
    • Going from low to high starts the shifting out of the data from the TAP to the host controller. The 64 bits are serialised with the MSB first, followed by bits from shuffled positions. Bit 63 is presented very soon after the transition (see d->e) so it can be polled without having to shift data or trigger a SPI byte shift. If more than 64 clock pulses are sent, the internal counter wraps around and serialises the same sequence of bits (though their values might have changed since).

    The MUX

    For practical reasons, the Y8 has a selection of 64 bits to provide a (partial but sufficient) snapshot of the core's state. Instead of reading all the registers, only 4 byte values are available (SND, SRI, Result & PC), which already amounts to 32 bits. The remaining bits are further halved by providing the current instruction (16 bits). The rest is shared by the Status Flags (C, S, Z : 3 bits), the FSM status and a free byte (possibly multiplexed with the scan chain for a loopback test).

    In a sense, the order matters little because the bits are scrambled anyway. With the serial TAP, the user must stream 64 bits every time to get everything (this is not the case though for the scan chains and this saves some time). However it's "good" that they fit with the structure of the tree, so it helps with place&route. "Just in case" I placed the fields in increasing order of granularity and relevance to the debugger (in case a byte-wide, or non-scrambled, interface is developed).

    TAP/MUX64 allocation of the inputs :
    8 bits : Status (Flags & FSM)
    8 bits : (undefined, variable, switchable, maybe the selector address ?)
    16 bits : Current Instruction being decoded
    8 bits : PC
    8 bits : RES
    8 bits : SRI
    8 bits : SND

    Notes:

    • If a byte-parallel interface is defined, it gets the status immediately, without having to scan past the other bytes that might be unnecessary in a given context)
    • This map is defined to be valid after a Null command, where the Selector is reset. Other registers (such as the breakpoints) could be selected by the Selector
    • The debugger gets these 64 bits, regardless of the actual implementation : that is the "view" for the GUI...
    Read more »

  • The TAP is coming together

    Yann Guidon / YGDES07/28/2020 at 14:36 0 comments

    After about a week of intense work on the sub-parts, they are coming together as a TAP core module that lets us configure any structure at will.

    The 4 sub-parts are combined to let me add chains of any length, either "transient" or latched. These chains can be very simple or with multiple checks, answer to arbitrary signatures or start at any position after the Selector's 16 bits.

    Now comes the time to think about how to use it.

    First possible example is to stream the program into the instruction SRAM :

    • set /WR low (or send a NULL command if unsure)
    • select the appropriate chain/function with the command '1S'
    • toggle /WR high and low
    • stream the 512 bytes of instructions
    • send the signature byte (TBD) to validate the operation
    • set /WR high

    Reading back is something else, one has to go through the "normal path" : set PC to the address and read the instruction buffer. So how do you do that ?

    Well first you have to stop the core, which means you also need to start or even step it : the Start/Step/Stop trinity is one of the messages that are sent to control the internal FSM with a command register. The FSM state is read back to confirm and acknowledge. This could go to the register address "F" for example.

    But there is no "read the register" command. There is even better : the tree reads the values of the SND, SRI, Result and PC busses. All there is to do is inject an instruction in the decoder's input and not let the core record the result.

    So far this chain looks like this :

    • Instruction : 16 bits (MSB first)
    • suffix1 : FSM state/message/command byte
    • suffix2 : 'X' = 01011000

    The Suffix1 byte can reuse the command register in the Selector unit. The instruction shift register can then be freely routed close to the decoder, and a MUX2 selects if the instruction comes from the TAP or the instruction memory. 

    The MUX2 is controlled by a single latch bit from the command byte : the whole shifted word is half-transient because the instruction doesn't need to be latched, but the command does.

  • The TAP selector

    Yann Guidon / YGDES07/26/2020 at 16:34 0 comments

    As shown in the log 111. Design of a TAP : the SIPO Controller, the first module is the "selector", used by the other modules to enable a given sub-chain or another, or none (when the "null" command is given).

    A preliminary version is simulated in Falstad :

    This module provides both an early signature decoder, as well as the SIPO chain for the first 16 bits, available to other modules.

    The latching mechanism is also specific, unlike the latches of the other modules : if the FB (Full Byte) signal is off, then the selector register is cleared when /WR goes high. This catches most of the wrong sizes, including NULL, to prevent unwanted spurious behaviours.

    The 3 cells can be replicated as needed if more outputs are required but 8 is already enough for a small circuit like the Y8. The codes 000 and 111 are avoided to further prevent spurious operations. The cell structure is unusual : a AND is inserted between the loopback MUX and the DFF, which has no /RESET input (just send an invalid command to clear). This system is fully synchronous, using 2 non-onverlapping clocks (CLK and /WR must be kept separate by the host)

    Another subtlety : this module has a "permanent" output that must remain valid after more than one command, and it must be cleared by invalid commands so a DFF is used, instead of a latch. /WR is not used by the decoding logic, but other simpler modules will use latches and /WR must be decoded.

    The TAP is looking better each day...


    20200728 : I changed the command prefix to match the ASCII 0-7 characters :-)

View all 123 project logs

Enjoy this project?

Share

Discussions

salec wrote 10/09/2019 at 09:18 point

YGREC can stand for so many things, but since my wife has been learning French on Duolingo I can't avoid noticing that it is also a wordplay on French spelling of "Y". 

:-)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/09/2019 at 10:03 point

oh, of course, yes, too ;-)

  Are you sure? yes | no

salec wrote 10/09/2019 at 12:04 point

always have an opening joke/tease for audience :D

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/09/2019 at 12:46 point

@salec  always !

  Are you sure? yes | no

[deleted]

[this comment has been deleted]

Yann Guidon / YGDES wrote 04/14/2019 at 08:56 point

That "purposeful sense" may look drowned into the proliferation of projects, angles and ideas but it is still clear to me since it's my main hobby since 1998 at least :-D

I'm glad you enjoy !

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/04/2018 at 07:11 point

Another note for later :
writing to A1 or A2 starts a fetch from RAM. In theory the latency is the same as instruction memory and one wait state would be introduced. However the processor can also write directly so the wait state would be only on read to the paired data register...

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/04/2018 at 06:55 point

Note for later : don't forget the transparent latch on the destination register address field, for the (rare) case of LDCx, because the 2nd cycle doesn't preserve the opcode etc.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/04/2018 at 07:18 point

OK, not a transparent latch, but a DFF and a mux, plus some logic to control it.

-- DFF, every cycle :

SND_latched <= SND_field;

LDCx_flag <= '1' when (LDCx_flag='0' and opcode=opc_LDC and writeBack_enabled='1')   else '0';

-- MUX2 :

WriteAddress <= SND_latched when LDCx_flag = '1' else SND_field;

______

Note : LDCx into PC must work without wait state because it's connected directly to SRI, as an IMM8, and no extra delay is required. PC wait state is required for ADD/ROP2/SHL and IN.

  Are you sure? yes | no

Frank Buss wrote 10/27/2018 at 12:51 point

Do you really plan 8 byte-wide registers? This would require thousands of relays :-)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/27/2018 at 14:26 point

no :-)

8 registers, 8 bits each = 64 storage bits.
1 relay per bit => 64 registers


The trick is to use the hysteretic mode of the relays :-)

  Are you sure? yes | no

Frank Buss wrote 10/27/2018 at 16:17 point

Ok, makes sense. Maybe change the project description, someone might think you are planning a 64 bit architecture.
BTW, could this be parametrized for the address and data size? If you implement it in VHDL, you could use generics for this, would be no additional work to use just the generic names instead of hard coded numbers. Except maybe some work for extending the instruction opcodes.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/27/2018 at 17:16 point

Frank : DAMNIT you're right !

I updated the description...

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/27/2018 at 17:19 point

For the parameterization : it doesn't make sense at this scale. Every fraction of bit counts and must be wisely allocated.

Larger architectures such at #YASEP Yet Another Small Embedded Processor  and #F-CPU  have much more headroom for this.

  Are you sure? yes | no

Bartosz wrote 11/08/2017 at 16:40 point

this will working on epiphany or oHm or other cheap machine?

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/08/2017 at 18:07 point

I'm preparing a version that would hopefully use less than half of a A3P060 FPGA, which is already the smallest of that family that can reasonably implement a microcontroller.

But it's a lot less fun than making one with hundreds of SPDT relays !

  Are you sure? yes | no

Bartosz wrote 11/14/2017 at 14:13 point

Question is price and posibility to buy

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/14/2017 at 16:08 point

@Bartosz : what do you want to buy ?

If you can simulate and/or synthesise VHDL, the source code is being developed and available for free, though I can't support all FPGA vendors.

If you want a ready-made FPGA board, that could be made too.

If you want relays, it's a bit more tricky ;-)

I have just enough RES15 to make my project and it might take a long while to succeed. There will be many PCB and other stuff.

However if, in the end, I see strong interest from potential buyers, I might make a cost-reduced version with easily-found minirelays. I don't remember well but the Chinese models I found cost around 1/2$ a piece. Factor in PCB and other costs and you get a very rough price estimate... It's not cheap, it's not power efficient, it's slow and won't compute useful stuff... But it certainly can make a crazy nice interactive display, when coupled with flip dots :-D

So the answer is : "it depends" :-D

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates