Close
0%
0%

YGREC8

A byte-wide stripped-down version of the YGREC16 architecture

Similar projects worth following
#YGREC16 is getting pretty large and moving away from the original #AMBAP inspiration, making it less likely to be implemented within my lifetime. So here is a "back to minimalism" version with
* 256 bytes of DRAM (plus one parity)
* 8 byte-wide registers
* less relays than the YGREC16
This core is so simple that I focus now on the debug/test access port.
Like the others, it's suitable for implementation with relays, transistors, SSI TTL, FPGA and ASIC.

I give up on the idea of playing the Game of Life (the forte of #YGREC-РЭС15-bis) but I design a VHDL version because @llo sees the YGREC8 as a perfect replacement for PICs for his #SteamBot Willie !


A significant reduction of the register set's size is required so I/O must be managed differently, through specific instructions. The register map is expected to be:

  • D1  <= for NOP
  • A1
  • D2
  • A2
  • R1
  • R2
  • R3
  • PC  <= for INV

I shrunk the instruction word down to 16 bits. It is still reminiscent of the YGREC16 older brother but I had to make clear cuts... The YGREC8 is a 1R1W machine (like x86) instead of the RISCy YGREC16, to remove one field.

I have swapped the condition field and the ALU code field, which is now a more classical opcode.

20171116: The latest evolution of the instruction format has added a 9-bits immediate field address for the I/O instructions.
20180112: Imm9 is now removed again...

There are two classical instruction forms : either an IMM8 field, or a source & condition field, combined with the destination field and a small opcode. The source field can also become a short immediate field (3 bits only but essential for conditional short jumps or increments/decrements).

The opcode field has 4 bits and the following values:

Logic group :

  • OR  => Reg OR Reg does not change Reg
  • XOR
  • AND
  • ANDN

Arithmetic group:

  • CMPU
  • CMPS
  • SUB
  • ADD

Beware : There is no point to ADD 0, so ADD with short immediate (Imm3) will skip the value 0 and the range is from -4 to -1 and +1 to +4. (see 17. Basic assembly programming idioms)

Shift group (optional)

  • SHR
  • SHL
  • SAR
  • ROL

Control group:

The COND field has 4 bits, more than YGREC16, so we can add more direct binary input signals. CALL is moved to the opcodes so one more code is available.  All conditions can be negated so we have :

  • Always
  • Z (Zero, all bits cleared)
  • C (Carry)
  • S  (Sign, MSB)
  • B0, B1, B2, B3 (input signals)

Instruction code 0000h should map to NOP, and the NEVER condition. (???)

Instruction code FFFFh should map to INV, which traps or reboots the CPU (through the overlay mechanism) : condition is implicitly ALWAYS because it's a IMM8 format : CALL PC FFh (thus rebooting/alerting with some code placed there, if any, otherwise keep instruction at FFh equal to INV to make an endless loop)

Overall, it's still orthogonal and very simple to decode, despite the added complexity of dealing with 1R1W code.


Logs:
1. Honey, I forgot the MOV
2. Small progress
3. Breakpoints !
4. The YGREC debug system
5. YGREC in VHDL, ALU redesign
6. ALU in VHDL, day 2
7. Programming the YGREC8
8. And a shifter, and a register set...
9. I/O registers
10. Timer(s)
11. Structure update
12. Instruction cycle counter
13. First synthesis
14. Coloration syntaxique pour Nano
15. Assembly language and syntax
16. Inspect and control the core
17. Basic assembly programming idioms
18. Constant tables in program space
19. Trap/Interrupt vector table
20. Automated upload of overlays into program memory
21. Making room for another instruction
22. Opcode map
23. Sequencing the core
24. Synchronous Serial Debugging
25. MUX trees
26. Flags, PC and IO ports
27. Binary translation
28. Even better register set
29. A better relay-based MUX64
30. Register set again
31. Rename that opcode !
32. Register set again again
.  

ygrec8_20180116_yg.svg

Core diagram in SVG, added LDCx MUXes

svg+xml - 17.96 kB - 01/17/2018 at 17:38

Download

svg+xml - 6.99 kB - 01/12/2018 at 18:57

Download

YGREC8_VHDL.20171209.tgz

Added: license, readme, mustfail...

x-compressed-tar - 36.61 kB - 12/08/2017 at 23:21

Download

ygrec8.nanorc

Coloration syntaxique pour l'├ęditeur de texte Nano

nanorc - 1.16 kB - 12/08/2017 at 14:43

Download

ygrec_debug.svg

How the YGREC8 is split and controlled for debug, development and test

svg+xml - 8.55 kB - 12/03/2017 at 16:26

Download

View all 14 files

  • Register set again again

    Yann Guidon / YGDES03/18/2018 at 10:09 0 comments

    I know the title is lousy but the previous log 30. Register set again   was missing illustrations so here they are.

    The most basic unit is a set of 2 bits of storage (DFF or TL) and 2 MUX2.

    Nothing fancy here but the 2×2 tile is copied/mirored and 2 more MUX2 are added :

    YGREC8 has 8 registers so the 5×2 tile is copy/mirored once again :

    It might look messy so let's not forget that many wires are shared, here are some colors to better visualise the wires' functions :

    it looks almost like an ASIC pre-layout and indeed routing is quite easy, some gates simply need to be moved around.

    The above 11x2 tile is a "slice" of one bit, and 3 are tied together to make a group. The MUX2s are in 3 groups of 7 each but I'm not sure which organisation is best. In the pitures below, each color represent one address bit.

    a

    b

    c

    The b) version seem to have a small advantage because red and green are a bit les wide, but the blue till spans the whole width. Maybe the best approach is the one that requires the least wire crossings for the overall set.

    .

  • Rename that opcode !

    Yann Guidon / YGDES03/11/2018 at 10:23 5 comments

    I remember when I first tried to understand a microprocessor. I had a book in french that explored the 6809 and I was yound and impressed. But I could't wrap my head around the concept of the MOV opcode. Does it displace data ? And what happens to the original data ?

    I have since acquired the habit of using MOV, mostly from my heavy use of x86 asm. But looking back at that early confusion, and despite the almost universal use of his mnemonic, I believe it's time to do the "right thing" : rename it to CP.


    PS : somewhat related to 1. Honey, I forgot the MOV

  • Register set again

    Yann Guidon / YGDES03/07/2018 at 08:52 0 comments

    The pursuit of the Ultimate Register Set Structure progresses. I'm trying to make it more hierarchical and practical for a wider range of technologies (ASIC, FPGA, transistors, TLL, transistors).

    I decided to use a parity bit for the register set and the memory. This increases reliability and the 9th bit is already provided by the A3P FPGA anyway. I'm also settling with a 512 bytes addressing space, whenever I can, to prevent aliasing issues (but the mapping can be controlled by some bits in the IO space at address 0)

    The redesign of the register set uses bit slices again. 3 slices are grouped and 3 groups make the 9-bits wide register set. This is near perfect from the fanout point of view and the structure is very easy to place and route.

    Parity is in bit #4 to reduce wire lengths in FPGA and ASIC.


    Each slice has 8 bits of addressable storage and two MUX8.

    The two MUX8 can be either balanced (fan-in={1,3,3}) or not (the classical {1,2,4}), it doesn't make a difference. There will be a fan-in of 7 in each group of 3 slices for all 8 address wires, when using circular permutation.

    The storage part has more variations and options, depending on the technology.

    For FPGA the bits are made of DFF with enable. The clock must feed all 72 bits and the enable signal is split into 8 lanes, one for each register. No reset signal is required (despite complaints from the synthesiser). It's possible to go further by removing the Enable signal : the clock signal is split into 8 lanes, so yes, that's "clock gating"...

    Even further : a DFF is made from a couple of latches clocked on opposite signals. The first latch of each bit in a lane can be "factored" to reduce parts count in a discrete system. Instead of 16 latches to store 8 bits, only 9 remain (we saved almost one half of the parts !) which is good for TTL, transistors, ASIC... but clock sequencing is more complex. This approach is a bit slower but also saves power because the clock gating reduces the activity on the clock network by a 8:1 ratio.


    3 slices make a group where the control lines get a circular permutation to balance the load on the control lines. However, the 8 "enable" lanes would become all shuffled (and prove hard to route) if all the MUX8 are shuffled, so each of the slices must be routed correctly from the MUX8s to keep the right order of the latches.

    The groups have a fan-in of 1 for each signal (except data input if there is a direct connection to the DFF). The 2×3 MUX8 driving lines get amplified by one buffer each.

    On A3P, each group has a XOR3 at the data input to generate parity.


    Then at the higher level, 3 groups are assembled to create a 9-bits register set. The fan-in of the MUX8 is only 3. For other technologies, the 8 data input bits are parity-ed with a tree of XOR2 and the result is placed in the middle slice. The 8 latch enable lanes should be "straight" and easy to route.

    Two other parity checks should be implemented at the output ports.

  • A better relay-based MUX64

    Yann Guidon / YGDES03/02/2018 at 08:56 0 comments

    I came up with a different system for the MUX64 (required by the memory system) that doesn't use the CCPBRL system :

    It uses full on/off switching instead of constant biasing so it might be less sensitive to individual drift in characteristics. This means less binning. However, there could be one side that is more ON than the other and heat more.

    There is a big trick as well : the capacitor replaces freewheeling diodes to "precharge" the opposite branch when the relay switches to the other side. The question of the capacitance is important because I doubt that 100nF will be enough and the 100µF capacitors are polarised, they would be destroyed...

    I have to evaluate the pros & cons of this method versus the CCPBRL one. For example, CCPBRL has only static/medium current and homogeneous/distributed heat but requires another higher-voltage power supply rail and requires very precise power supply regulation.

  • Even better register set

    Yann Guidon / YGDES02/27/2018 at 11:41 0 comments

    I think I cracked it :-)

    The MUX8 are all identical and a circular permutation controls 7 bits. The last bit has a different permutation to reach the ideal fanout of the gates. Hopefully this will let me make a better register set, both with relays (easier construction) and with VHDL (shorter, more generic code).


    Better :

    I'm just trying to reduce the length of the wires and the long crossings :-)

    Oh, that's even better :

    The sequence of permutations is :
    ABC
    ABC
    BAC
    BCA
    BCA
    CAB
    CAB
    CAB

    I now have to rewrite my register set VHDL code...

  • Binary translation

    Yann Guidon / YGDES02/20/2018 at 01:46 0 comments

    One thing I've been thinking about : since the YGREC8 is a sort of subset of the YASEP ISA, wouldn't it be nice and easy to emulate the YGREC8 on the YASEP with a pipeline stage that performs binary translation of the YGREC8 instructions ?

  • Flags, PC, IO ports and interrupts

    Yann Guidon / YGDES02/15/2018 at 04:12 0 comments

    Interrupt handling should be seriously considered because we'll need them one day or another. This means that the complete state of the core must be saved and restored by suitable hardware and software.

    • A first issue is how to save the flags (C, S and Z). Attempting to save them by conditional instructions will destroy their values... Upon a IRQ signal, they would be saved to 3 backup bits which can be read and written by an IO port (port 0 ?). Exit from IRQ (and restoration of the IRQ mask) would occur when writing the value back to the port... or something like that.
      Cost : 3 DFF with enable, 3 MUX for the feedback to the flags, 3 MUX to select where the DFF input comes from, and some glue logic.
    • A second issue is to save the PC. Actually, it's PC+1 that must be saved (after LDC has completed). Again : the value can be saved to a IO port (port #1 ?) where it can be read and written (with the proper MUXes). 
    • A 3rd issue is that some scratch space is required to save a couple of registers (such as the address registers) to allow memory to be used to save the other registers. At least, ONE backup is required, probably A1, it can be automatic (like PC and Flags) but it's not required, the Interrupt Service Routine can start with OUT A1 2 for example (2 being the scratch register's address). If more scratch registers are provided (let's say 2 or 3) then very short ISR can be written, with no need to touch memory. However, memory is the main channel of communication between threads so a compromise is 2 scratch registers.

    Overall, this means that the very first IO port addresses are reserved for core functions. There are 4 registers that can be written from the core's internal state, as provided by the entity's ports (PC+1, A1, A2 and flags are available outside of the datapath because they are required for the debug system). So far the map is :

    • 00h : Flags (C, S, Z) and Interrupt control backup register. Values come from IO write port or core. Setting bit 0 triggers restoration of the previous states (sort of "return from IRQ"). Bit 1 would enable/disable the IRQ mask.
      20180307: Two other bits control the mapping of memory banks for A1 and A2. These 8 bits are almost fully used.
    • 01h : PC backup : value comes from IO port or core (this allows IRQ re-entrance)
    • 02h : ScratchRegister1 : copied from A1 or from IO write port (result bus), used only by the ISR.
    • 03h : ScratchRegister2 : copied from A2 or from IO write port (result bus), used only by the ISR.

    Saving A1 and A2 directly with dedicated hardware saves one or two cycles of latency and some precious bytes) when servicing IRQs but can also make the core harder to route... So they might be simple registers (which saves a MUX as well as the required wires). Or they can be "shadow" registers, written everytime the corresponding A register is being written (but the value goes through the RESULT bus, while the OUT bus is connected to DST, so it's awkward and would increase the overall electrical activity of the circuit, which is less good for power draw).

    One nice side-effect is : this avoids creating an opcode for RTI (ReTurn from Interrupt) because it is detected by the following conditions : OPCODE=OUT (5 bits), IMM8=0 (8 bits), and DST[0]=1 (1 bit). 14 bits are easy to check in the pipeline.

    The other nice aspect is that this mechanism is entirely optional : it can be disabled/removed if IRQs are not supported by the core.

  • MUX trees

    Yann Guidon / YGDES02/05/2018 at 15:47 0 comments

    At this moment I work on a more formal code for the MUX parts. In other words I'm digging again in a pet topology project. This makes the VHDL code better, because I realise I use MUX8 in various places yet I don't get the best out of them. For example, even though I built the Register Set out of balanced control trees, I didn't use this technique for the conditions. So I started writing MUX8 components in VHDL... I haven't uploaded the new code archive but when I do, look at MUX8.vhdl. I should also rewrite the REG8 module by using these enhanced MUX8.

    The next step is the large MUX64 used by the serial debug system (see 24. Synchronous Serial Debugging). I'd like to design it algorithmically but I haven't cracked yet the algorithm. Is there a simple one ?


    20180227 : algorithm cracking in progress. Meanwhile, I already have one topology/solution for MUX64 :

    It's going to be fun to write this in VHDL...

  • Synchronous Serial Debugging

    Yann Guidon / YGDES01/29/2018 at 04:08 2 comments

    A previous log   16. Inspect and control the core  has shown the high level view of the debugging system. Here we see how it is implemented for a SPI interface, such as a Raspberry Pi.

    The necessary signals are

    • MOSI : data sent to the YGREC8 core
    • MISO : data received from the YGREC8 core
    • SCLK : synchronous clock
    • Select : control which chain is accessed.

    The protocol is half-duplex to prevent incoherency. This means that MOSI and MISO could share a bidirectional pin:

    • Select=low : YGREC8 core receives data
    • Select=high : YGREC8 sends data

    So the physical interface could use 3 or 4 pins, depending on the requirements. The interface is easy to bit-bang with a microcontroller or something else.

    The transitions of Select reset the appropriate chains. For example, when Select=low, the sending circuits are in RESET mode. The Receive informations will only become active and registered when Select goes high (so Select is also a clock input to the DFF of certain internal registers). The Select pin can be left with a weak pull-down, as well as the data pin and the clock pin. Any number of clock cycles has no effect, as long as Select remains low (the debug controller can then flush the shift register by sending 0s, then its own 32 bits).

    Note : the design is aimed at simplicity and compactness, using the least possible gates.


    When Select=low, the core is in receive mode.

    There is a 32-bits shift register that shifts bits in. Any number of clock cycles can be sent, only the most recent bits are considered. The bits are described in the log 16. but here they are again :

    • 16 bits for the CONTROL register, with the bits RESET, BYPASS, UPDATE, START, STOP, STEP, as well as bank select to access the breakpoint & profiling registers.
    • 16 bits of INSTRUCTION, which can be sent to the core, or eventually later to the breakpoint & profiling registers.

    When the Select pin goes high, the shift register is transferred to the appropriate registers, depending on the state of the bits of the control register. For example, if the "Instruction Bypass" bit is set, the core will execute the 16 bits of instruction that are provided in the currently written word.

    Physically, the circuit is just a string of 32 DFF, with a common clock that is gated by the Select signal. No reset is needed (except a few sensitive signals). The Select signal will latch the appropriate registers and/or update the FSM (after a resynchronisation to the local/internal clock). There are only 32×DFF to drive with the SCLK signal, and fewer for the MOSI signal.


    When Select=high, the core is in send mode.

    To read meaningful data, the core must have been set to "STOP" or "STEP" state by a previous command.

    The circuit serialises 8 bytes, or 64 bits, using a 6-bits counter, which is held in RESET state while Select=low.

    Contrary to most SPI interfaces, the data is not serialised with a parallel-in shift register, because of size/cost and timing reasons. Each bit would require a full DFF and a large clock fanout (which is a precious routing resource) as well as a MUX2 (to select between the data and the previous DFF). Timing gets complicated as well.

    This is solved by using only (about) one MUX2 per signal, controlled by a Gray code counter. This serialises the 64 bits to the MISO pin, before the counter wraps around (if more than 64 clock pulses are given).  Using a large (balanced) MUX also solves the resource problem, as the dedicated clock network is free for the core's use. The Gray code counter prevents glitches, as well as registering the output with a DFF buffer...


    20180202:

    I hope that it's clearer now :-)

    I'm currently working on the MUX64 and Grey code counter.

    I must emphasize a few things. First, the priority is to reduce the cell count, and their actual surface, to the minimum, for a ASIC target. This means that only some of the shift...

    Read more »

  • Sequencing the core

    Yann Guidon / YGDES01/17/2018 at 18:00 0 comments

    2018 has seen a first significant change happen in the YGREC8 architecture, with the new instruction set map (see 22. Opcode map). This follows the discussions in the logs 18. Constant tables in program space, 20. Automated upload of overlays into program memory and 21. Making room for another instruction. The new core diagram shows the modifications with two added MUX at the bottom:

    The non-glorious control&decoding signals are not shown here. They are rather simple but the new LDCx instructions increase the complexity, and this is what this log is about.

    Here's a quote from a private conversation :

    Well, it IS a kludge.

    I wish I could come up with something better but I have examined other alternatives. The constraints are :
    * information density : we got 16 instruction bits and it'd be a shame to waste one half because we only got 256 instructions to address and so many switches or transistors...
    * minimal gate count : the mechanism should barely increase the number of gates/transistors, so it's necessary to time-multiplex the access because adding another read port is prohibitive
    * Ease of programming : it must be easy to use and code density should not be reduced (hence no access through the IO registers)

    It's not a problem if it takes 2 cycles because LDC is rarely time-critical and the core is already pretty fast. It's just annoying that I break the clean, smooth, lean single-cycle machinery. But at least it's not part of the initial design.

    .

    A previous log 18. Constant tables in program space also explains that reading the program memory requires temporal multiplexing because a 2nd read port (in the instruction memory) would be prohibitive. This implies that LDCx instructions must use 2 cycles:

    1. First cycle (green) brings the address from the SRC field (normally, a register, because an immediate would not make sense) to the program memory address bus. This is why the left-hand MUX is added. It is tied to the RESULT bus on the picture for convenience but the output of the registers MUX8 should be used instead. Conditions should be checked and if OK, then update of PC is inhibited, and instead, a new bit (LDCstate or something) is set.
    2. Second cycle (red) starts with the Instruction word MUXed to select the high or low byte, depending on the previous value of the R/I8 flag of the instruction. The value then goes through the datapath (and not directly to the RESULT bus to avoid adding another MUX in the critical datapath). The new MUX's latency is a bit lower than the MUX8's latency so no time is wasted. The RESULT value is written to the designated DST register.

    .

    But it's more complicated than that...

    The first cycle is almost like others. But it must prepare the state of the 2nd cycle and save data from the instruction word because it will be wiped during the 2nd cycle. Note that this design applies to the FPGA version, so the SRAM address is latched at the end of the cycle and the output changes some ns after the start of the new cycle.

    What is not shown on the diagram is the necessary latches on the opcode and the DST address. Fortunately, the critical datapath goes to the register set and the 4 layers of MUX and one gate layer can be added on the DST write decoder.

    The normal and good way to deal with that is to save the value of the DST address in a DFF on the first cycle, then MUX the DST and delayed DST to feed the register address decoder. But transistor-wise it's not very efficient. A transparent latch uses less transistors and has potentially the same gate delay as a MUX. The delicate part is to drive it properly, with the right timing...

    Concerning the opcode, there is nothing to "remember" from the first cycle. The opcode can  simply be forced, using only a few logic gates, to emulate a MOV instruction.

    So here is a summary of the modifications to the code :

    • Create a new FF called LDCstate....
    Read more »

View all 32 project logs

Enjoy this project?

Share

Discussions

Bartosz wrote 11/08/2017 at 16:40 point

this will working on epiphany or oHm or other cheap machine?

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/08/2017 at 18:07 point

I'm preparing a version that would hopefully use less than half of a A3P060 FPGA, which is already the smallest of that family that can reasonably implement a microcontroller.

But it's a lot less fun than making one with hundreds of SPDT relays !

  Are you sure? yes | no

Bartosz wrote 11/14/2017 at 14:13 point

Question is price and posibility to buy

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/14/2017 at 16:08 point

@Bartosz : what do you want to buy ?

If you can simulate and/or synthesise VHDL, the source code is being developed and available for free, though I can't support all FPGA vendors.

If you want a ready-made FPGA board, that could be made too.

If you want relays, it's a bit more tricky ;-)

I have just enough RES15 to make my project and it might take a long while to succeed. There will be many PCB and other stuff.

However if, in the end, I see strong interest from potential buyers, I might make a cost-reduced version with easily-found minirelays. I don't remember well but the Chinese models I found cost around 1/2$ a piece. Factor in PCB and other costs and you get a very rough price estimate... It's not cheap, it's not power efficient, it's slow and won't compute useful stuff... But it certainly can make a crazy nice interactive display, when coupled with flip dots :-D

So the answer is : "it depends" :-D

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates