A byte-wide stripped-down version of the YGREC16 architecture

Similar projects worth following
#YGREC16 is getting pretty large and moving away from the original #AMBAP inspiration, making it less likely to be implemented within my lifetime. So here is a "back to minimalism" version with
* 256 bytes of Data RAM (plus parity)
* 8 registers, 8 bits each
* fewer relays/gates than the YGREC16
This core is so simple that I focus now on the debug/test access port and the register set's structure.
Like the others, it's suitable for implementation with relays, transistors, SSI TTL, FPGA and ASIC.

I give up on the idea of playing the Game of Life (the forte of #YGREC-РЭС15-bis) but I design a VHDL version because @llo sees the YGREC8 as a perfect replacement for PICs for his #SteamBot Willie !

A significant reduction of the register set's size is required so I/O must be managed differently, through specific instructions. The register map is expected to be:

  • D1  <= for NOP
  • A1
  • D2
  • A2
  • R1
  • R2
  • R3
  • PC  <= for INV

I shrunk the instruction word down to 16 bits. It is still reminiscent of the YGREC16 older brother but I had to make clear cuts... The YGREC8 is a 1R1W machine (like x86) instead of the RISCy YGREC16, to remove one field. Speed should be great, with a very short crritical datapath, and all instruction execute in one clock cycle (except the LDCx instructions and computed writes to PC).

I have swapped the condition field and the ALU code field, which is now a more classical opcode.

20171116: The latest evolution of the instruction format has added a 9-bits immediate field address for the I/O instructions.
20180112: Imm9 is now removed again...
20181024: changed the names of some fields
20181101: modified the conditions to change Imm3 into Imm4

There are two classical instruction forms : either an IMM8 field, or a source & condition field, combined with the destination field and a small opcode. The source field can also become a short immediate field (4 bits only but essential for conditional short jumps or increments/decrements).

The opcode field has 4 bits and the following values:

Logic group :

  • XOR
  • OR
  • AND
  • ANDN

Arithmetic group:

  • CMPU
  • CMPS
  • SUB
  • ADD

Beware : There is no point to ADD 0, so ADD with short immediate (Imm4) will skip the value 0 and the range is now from -8 to -1 and +1 to +8. (see 17. Basic assembly programming idioms)

Shift group (optional)

  • SHR
  • SHL
  • SAR
  • ROL

Control group:

The COND field has 3 bits (for Imm4) or 4 bits, more than YGREC16, so we can add more direct binary input signals. CALL is moved to the opcodes so one more code is available. All conditions can be negated so we have :

  • Always
  • Z (Zero, all bits cleared)
  • C (Carry)
  • S (Sign, MSB)
  • B0, B1, B2, B3 (input signals, for register-register form)

Instruction code 0000h should map to NOP, and the NEVER condition, hence ALWAYS is coded as 1.

Instruction code FFFFh should map to INV, which traps or reboots the CPU (through the overlay mechanism): condition is implicitly ALWAYS because it's a IMM8 format : CALL PC FFh (thus rebooting/alerting with some code placed there, if any, otherwise keep instruction at FFh equal to INV to make an endless loop)

Overall, it's still orthogonal and very simple to decode, despite the added complexity of dealing with 1R1W code.

1. Honey, I forgot the MOV
2. Small progress
3. Breakpoints !
4. The YGREC debug system
5. YGREC in VHDL, ALU redesign
6. ALU in VHDL, day 2
7. Programming the YGREC8
8. And a shifter, and a register set...
9. I/O registers
10. Timer(s)
11. Structure update
12. Instruction cycle counter
13. First synthesis
14. Coloration syntaxique pour Nano
15. Assembly language and syntax
16. Inspect and control the core
17. Basic assembly programming idioms
18. Constant tables in program space
19. Trap/Interrupt vector table
20. Automated upload of overlays into program memory
21. Making room for another instruction
22. Opcode map
23. Sequencing the core
24. Synchronous Serial Debugging
25. MUX trees
26. Flags, PC and IO ports
27. Binary translation (updated)
28. Even better register set
29. A better relay-based MUX64
30. Register set again
31. Rename that opcode !
32. Register set again again
33. Yet Another Fork
34. What can it run ?
35. More register set layout
36. More VHDL and more gates
37. R7 P&R
38. Program Counter and other considerations
39. Bus names (SRC-SRI, DST/SND)
40. Now faster without the "PC-swap" MUX
41. A diode-less balanced...

Read more »

Added the proasic3 VHDL library for rough gate-level simulations, many incoherent or obsolete files though.

Zip Archive - 109.48 kB - 11/01/2018 at 16:04



V5 with Imm4 field

svg+xml - 10.89 kB - 11/01/2018 at 15:59



Added the ProASIC3 "tiles" library

x-compressed-tar - 60.71 kB - 10/17/2018 at 04:14



Core diagram in SVG, added LDCx MUXes

svg+xml - 17.96 kB - 01/17/2018 at 17:38


svg+xml - 6.99 kB - 01/12/2018 at 18:57


View all 17 files

  • Data retention times of hysteretic relay latches

    Yann Guidon / YGDES11/06/2018 at 04:58 1 comment

    So far I have not actually measured how long a hysteresis-based relay latch could hold a state. So I'm doing it now.

    I have set up a little circuit with a RES15 relay (36 Ohms), a matching series resistors (39 ohms), a capacitor to set the state, and a LED to show the state.

    The circuit is powered by a digitally controlled PSU set at 2.8V (mid-way between the 2.1V release voltage and the 3.62V latching voltage). After a while, the circuit draws 33mA (some heating occurs in the coil).

    I have no idea how long the circuit can stay latched so it's not possible to use my multimeter (it would go into power-saving mode after some minutes instead of beeping). I have no timer either, so I connected an LED that would light up when the relay is released... and I count time manually :-D

    Unless the register set is held in standby state during a debug session, I can imagine that a register would be toggled at the very least once a minute and the experiment is running for 3 hours now. The possible cause of perturbation in this test would be the poor wiring quality on the solderless breadboard, so I stay away from it to prevent any minute wiggle.

    Retention of 1 bit requires 92mW so 64 bits (the YGREC8's register set) would draw 5.9W alone... This register set needs a separate power supply that is very stable and well filtered : 2.8V (5%) at 2.1A. Early experiments (with the YGREC16) have shown that the system is not stable if the latches' supply is shared with other circuits, which create a lot of switching noise.

    I don't want to use a modern PSU so I'll go for the old good way : an AC transformer, a diode bridge and a large filtering capacitor. A very big capacitor is not difficult to find (10000µF at least, more is better to keep the ripple as low as possible), the diode bridge is possible (there were selenium rectifiers in the 1930s, each plate stands about 20V)  but a low voltage transformer is a different story today. I'm not sure I can find a transformer that can provide 3VAC under at least 2A (more is better). At least, I can finely adjust the input AC with an auto-transformer.

    "Back in the days" when tubes/valves were kings, radio sets would provide a low voltage, "high current" output to power the heater(s). This is something to explore but what I have seen so far is 6V or 12V output, not 3V and only 2 wires.

    Another alley to explore is partition : there are 8 bits that can be written at any moment (plus PC) so 8 subcircuits (one per bitslice) are possible, each with their own power source. Partition would be along the bitslice, not per register, because each write would create noise on 8 bits simultaneously and the strategy is to spread those spikes evenly. So each bitslice would have at least a local filter (capacitor + inductor) that can provide a clean power at 270mA. A local diode can also drop the current if needed.

    If each bitslice has a local power input, the transformer can be partitioned into multiple smaller transformers. The bridge rectifiers become smaller too.

    If I can find suitable 3V AC transformers, then adding the drop of a silicon diode bridge gives the right voltage : 3×1.4=4.2V, 4.2-(2×0.7V)=2.8V

    I remember some old 1A bricks with a selectable output voltage : 3V, 4.5V, 6V, 9V, 12V. Inside, a PCB holds four diodes and a 1000µF capacitor, and the secondary windings were probably multiple 1.5V or 3V in series. I could rewire that in parallel to provide a stronger 3V output ...

    test interrupted after 3h46m due to human error...

    Test is restarted : 6 hours and no sign of weakness. Shall I call this experiment a success ?

    ... Read more »

  • Imm4

    Yann Guidon / YGDES11/01/2018 at 06:19 0 comments

    Let's break orthogonality again !

    Imm3 is pretty lousy and can make only very short loops, 4 instructions maximum. Where and how can I get more bits ?

    The condition code is a good candidate: there is one "negate" flag, and 3 source bits. 4 sources are external arbitrary, configurable signals, but are they required ?

    Let's drop those extra conditions in the "short immediate" format so Imm3 becomes Imm4. The extra conditions are still available with the register form, because there are only 8 source registers. And I don't want to get rid of the extra conditions because they will be very handy later, when used as a microcontroller: that thing is meant to deal with I/Os and it's a nice feature to inherit from the CDP1802 ("The 1802 has a single bit, programmable and testable output port (Q), and four input pins which are directly tested by branch instructions (EF1-EF4).").

    The diagrams must be updated or redrawn and the assembler must be modified...

  • A diode-less balanced relay amplifier

    Yann Guidon / YGDES11/01/2018 at 05:23 0 comments

    As you might already know, if you have followed the #YGREC16 - YG's 16bits Relay Electric Computer 's saga, one of the challenges with relays such as those I use is : fanout. These relays have a relatively poor current amplification capability : the coil is rated at about 60mA and they can switch maybe 2× that value. So when I have to distribute a signal to many other relays, I have to:

    • connect the relays in series, so the voltage increases instead of the current
    • be smart (that's how I came up with the balanced control binary trees).
    • use hysteretic mode (CC-PBRL) to reduce both the voltage and current.

    The hysteretic mode is a pretty smart circuit that is more energy-efficient than plan on-off switching. However it suffers from a very practical effect : as the circuit functions, its coils resistance dissipates the power and heats. This heat in turn changes the resistance of the coil, as well as the current, and the working point of the system. Some adujstable resistors are required to finely tune the operating point and the system might behave differently depending on the environment and temperature.

    The circuit I explore here is an interesting alternative because it avoids some problems from the CC-PBRL circuit :

    • no more alternate supply rails (caused by the polarised capacitor)
    • no more temperature sensitivity
    • no freewheeling diode
    • no large non-polarized capacitor
    • current draw is 1/2 of full power and does not vary significantly (it's roughly as efficient as CCPBRL)

    The last part is important : power consumption is a problem that directly affects the power supply's design, and I'll use "basic" circuits with transformer/diodes/filtering capacitors. Their values determine the weight, cost, size etc. of the system.

    The first trick (again borrowed from ECL circuits) is to always have one half of the relays powered : this effectively limits the maximum power requirement. The control relay can switch from one half to another and each contact has an equal wear. The power is interrupted only during the short time when the relay's contact is not connected to either branch. This is only a few ms at most, which is easy to filter on the power supply. Slave relays can be connected in series, as many as required.

    But switching noise is annoying, particularly when countless coils are charged and released, their 30000µH can store quite a lot of energy. So the surges must be avoided when one branch is disconnected. Usually, diodes are the default choice but it's not satisfying. I want the excess energy to be recycled and start to energise the other branch...

    The solution is a simple capacitor across the control relay's pins. The above picture shows a simulation with Falstad where I try to simulate the behaviour using realistic values. Here is the code if you want to try at home:

    $ 1 0.000005 0.10425469051899915 50 5 43
    c 352 256 320 256 0 1e-7 -3.3621449028580765
    v 240 400 240 128 0 0 40 3 0 0 0.5
    l 288 192 288 256 0 0.03 -0.0006352952130544926
    r 288 192 288 128 0 60
    S 336 336 336 256 0 1 false 0 2
    r 384 192 384 128 0 60
    l 384 192 384 256 0 0.03 0.04897750082379863
    w 288 256 320 256 0
    w 352 256 384 256 0
    w 240 400 336 400 0
    w 240 128 288 128 0
    w 288 128 384 128 0
    s 336 336 336 400 0 0 false
    o 7 1 0 4098 6.0193733601080295 0.0001 0 2 7 3
    o 8 1 0 4098 6.00000000171388 0.0001 0 2 8 3
    o 0 1 0 4098 6.000000000002793 0.0001 0 2 0 3
    o 10 1 0 4609 5 0.1 0 2 10 3
    38 0 0 0.000001 0.000101 Capacitance

     With this simulation, I found that the size of the capacitor was not as I expected, and a classic ceramic 100nF is enough : the current in the coils varies slowly because of the very high inductance. And there is no high voltage spike to filter out ! The capacitor must be unpolarised but can be a reasonably low value (2×PSU provides a great margin).

    I also simulated the current during switching :

    The supplied current (yellow trace at the bottom) drops immediately (this must be filtered on the power supply) but the current increases slowly (approximately...

    Read more »

  • Now faster without the "PC-swap" MUX

    Yann Guidon / YGDES10/27/2018 at 00:16 7 comments

    In a previous log, I sketched a different approach to handle the CALL instruction.

    The current datapath draws inspiration from the YASEP and it looks like this on the main YGREC diagram (only a close-up):

    One pair of MUX2 (controlled by almost the same signal) swap the RESULT bus with the value of PC+1.

    • The usual ALU operations get the operands from the register set, go through the ALU, and the result is written back to the register set.
    • The NextPC (or PC+1) value is generated from PC and goes back to the PC (and to the memory address bus).
    • When CALL is executed, the inputs of the register set and PC are swapped with a pair of MUX.

    This is a structure I introduced in the YASEP and carried over to the YGREC, with confidence because it seemed to work well. But this structure sits like a Sphynx at the end of the critical datapath and costs a precious MUX2 delay that I want to optimise out. The previous related log is first dent into the old design.

    The latest design removes this bottleneck, by actually moving it away from the critical datapath and narrowing the selected values to those that make sense.

    • PC can get written with PC+1 or the end of the SRI selector, because this is the field that indicates the destination of a jump. This leave more time to fetch the next instruction, compared to going through the whole datapath and the final MUX. Slower memories can be used.
    • The value of PC+1 gets inserted in the datapath earlier, in a source selection MUX where there is less timing pressure. The destination register is given by the SND field of the instruction so the next PC can be inserted near the ROP2 unit for example, with a sort of "pass" function. This puts more pressure on the other MUX2s which require a more complex bypass signal though...

    The above diagram shows the flow of data during the CALL instruction. The red path shows the updated PC being written to the register set while the SRI bus is written to PC. The final MUX tree will be adjusted when the latency of each unit is well characterised.

    No more MUX2 ! In the ProASIC3 this saves maybe 1ns in the instruction cycle time...

    PS: The only thing to avoid is a conditional CALL or write to PC because if it does not get executed, the write to PC is inhibited. Which will stall the program because the next instruction is not fetched ! So if a "not taken" CALLl is executed, a second cycle must be issued. Remember : otherwise, only the LDCx instructions can inhibit the write to PC !

    Or better: the PC MUX is reversed back to PC+1 by the condition. The pseudocode would read something like:

    NPC <= SRI when (condition=true) and (opcode=CALL or (opcode=SET and AddrSND=PC))  else PC1;

    Since the value of SRI takes about 5 logic gates to compute, there is enough "time" to check the condition (MUX4+XOR: 3 gates deep) and the opcode (2 gates). The core can run fast with a semi-parallel, overlapping fetch of the next instruction memory and this covers the SET and CALL opcodes, with Imm8, Imm3 or register with condition.

    Unfortunately the other jump methods (like IN, or computed jumps) are not possible anymore and the ISA orthogonality is broken :-( everything must go through a temporary register, which consumes another instruction and cycle. Short loops can't be done anymore :

    R1 = 42 ; block size
    ;; block copy :
    SET D2 D1 ; copy data
    ADD A1 1  ; update the pointers
    ADD A2 1
    ADD R1 -1  ; decrement the counter
    ADD PC -4 IFZ  ; conditional loop back

     without the last instruction, a temporary register is required to hold the target address and the number of available registers is already very small...

    Note that this problem appears mostly on semiconductor-based implementations where each MUX2 adds latency.  The ALU/SHL/ROP2...

    Read more »

  • Bus names (SRC-SRI, DST/SND)

    Yann Guidon / YGDES10/23/2018 at 22:30 0 comments

    Today I try to make the bus names more coherent and bring them closer to the YASEP conventions.

    The Y8 has 2 register address fields where one can also be an immediate value and the other becomes the destination. The names can be a bit confusing and the early YASEP had something like SRC1 and SRC2, which didn't really help because each bus could only do certain operations.


    This is the bus that brings a Source register, can perform Negation (this is where the XOR applies) and can become a Destination. For the ROP2 operations, this is the operand that gets negated (complemented) in ANDN and for SUB, this is the register that gets subtracted from the other operand.


    This Source operand can be either be a Register or an Immediate value (3 or 8 bits). This field gives the positive value in a subtraction or comparison, and can be a literal value instead of a register number.

    It will take a while until I've updated all the diagrams...

  • Program Counter and other considerations

    Yann Guidon / YGDES10/23/2018 at 04:06 0 comments

    I made some mistakes with the previous diagrams and I might have uncovered a new concept...

    Let's start with the register set : the last register is PC, which is the address of the current instruction. It is precious for several reasons : for "loop entry", for "call" and return, for (conditional) jumps, for calculated/indexed or indirect jumps...

    Jumping is easy : just write the desired address to PC. It can be an immediate value, a register (even PC), a value coming from memory (through a register), even data from an input port. It's THAT flexible. This justifies having the PC in the register set so the ISA is highly orthogonal and it saves a lot of opcode space.

    However reading the PC register is not as simple. Physically reading the current value is easy but most of the times, this is not what we want. A computed/indexed jump simply adds a value to PC and any offset is easily adjusted by an additional instruction. However Call and "loop entry" need PC+1 !

    "Loop entry" is emulated with PC+1=>Reg but this is valid for 3-addresses instructions, and Y8 has only 2 address fields, one source and one source/destination. Getting PC+1 directly is important for relocatable code, an 8-bits immediate is still possible  but you want to be able to move code blocks around (even though this uses a lot of registers, but you could spill on the stack).

    Call absolutely requires the value of PC+1 : this value gets written to the destination register (might be an address or data register as well) while the source (register or immediate) goes to PC (and the Program Address bus). But instruction fetch takes one cycle time and things get a bit messy here.

    It is NOT reasonably possible to read PC+1 in the register set : the incrementer adds some inherent latency to the already tight critical datapath (read operands, calculate, select the result from the various sources, setup&hold...). The value on the bus MUST come from the PC register itself.

    Yet we need PC+1 for important features, which would take one more instruction otherwise, and waste time&program space. And the incrementer doesn't have much latency, the CDP is a few gates at most, which leaves ample time to route the result to the Program Memory and any other MUX.

    This is where I realise something very interesting... The instructions that need PC+1 keep data movement in the register set section. The ALU is short-circuited. The other instructions use the ALU but don't need PC+1. Their CDP don't need to be added !

    ALU operations (ADD/SUB, ROP2, SHL) take 2 operands and create a result by going through a lot of circuits. Let's call that a "grand tour" :-)

    OTOH the other instructions (the "control group" : SET, CALL, IN/OUT, LDCx) stay close to, or inside, the register set. Let's call that the "petit tour" :-D

    SET and CALL need PC+1 and don't use the ALU so we can directly tap the desired value from the incrementer's output, for these instructions. This is possible because the value of PC+1 doesn't go much further than the register set (maybe to the data RAM). LDCx and IN/OUT don't make much sense, however, because PC+1 would have to go through the main multiplexer and this would slow everything down.

    Thus, SET and CALL have a special access to PC+1 because they do a "little tour". They bypass the ALU (which can get rid of its "bypass" flag) and get their value directly from the incrementer.

  • R7 P&R

    Yann Guidon / YGDES10/18/2018 at 01:34 0 comments

    Yet again another more log about the register set...

    Besides the bit latches, the MUX make a significant part of the circuit. One basic block is the MUX4 that clusters 4 bits, 3 MUX2 are required :

    (only one bit shown, this circuit is doubled for the 2 read ports)

    Nothing fancy here... but it becomes "interesting" when two of these structures are joined:

    I have applied a "little optimisation"  to balance the fanout of the sel1 and sel2 signals (it's the subject of several previous logs and 2 published articles).

    I have chosen to not apply further optimisations because it would make the register set even more complex with only negligible benefits.

    For 8 bits, the sel3 has a fanout of 8, sel1 and sel2 each have 24 (instead of 16 and 32). Another method would use some sort of pre-decoding of the columns, with one control signal per MUX2, the lateral MUX2 would be driven by signals that are active if B, D, F and H (respectively) are selected. This is an interesting technique for ASIC but not for ProASIC3.

    Because of the permutation of sel1 and sel2, the inputs must be re-shuffled. Fortunately the following drawing shows that only two inputs need to be swapped:

    Codes 110 and 101 are numbers 5 and 6, mapped to R2 and R3, which are general-purpose storage and relabelling them is perfectly harmless. In the end, there are 2 MUX signals per bitslice to route over the register array.

    If the registers were homogeneous, it would be possible to further swap signals across the array and further reduce the fanout of sel1 and sel2 (sel3 would increase). However this forces more swapping all over the place and routing would be uselessly complicated.

    But wait... the codes on the left are not convenient because A1 and A2 should both be on the right of the MUX4 (so they constitute a nice cluster of 4 DFF). Instead of moving the DFF (relative to the first prelayout of the previous log) it's easier to just perform the swap the other way :-) The routing diagram is very simple in the end:

    Sel1 and Sel2 are swapped on the left branch, which swaps D1 and A0, without any effort. Neat :-)

    This structure is applied to the more general bitslice:

    place&route of one bitslice of the register set (click to enlarge)


    Damnit, I've already found problems with the above diagram :-(

  • More VHDL and more gates

    Yann Guidon / YGDES10/17/2018 at 04:29 0 comments

    I decided to re-test the incrementer with a version that is mapped to ProASIC3. I added my custom library to the latest archive YGREC8_VHDL.20181017.tgz

    The INC8 unit now looks like this :

    -- YGREC8/INC8.vhdl
    Library ieee;
        use ieee.std_logic_1164.all;
    Library work;
        use work.all;
        use work.ygrec8_def.all;
    Library proasic3;
        use proasic3.all;
    entity INC8 is
        A : in  SLV8;
        Y : out SLV8;
        V : out SL);
    end INC8;
    architecture tiles of INC8 is
      Signal A012, A34, A345, A3456 : SL;
      -- Row 0
      e_R0B: entity INV    port map(A=> A(0),                    Y=>Y(0)); -- Y(0) <=                not A(0) ;
      -- Row 1
      e_R1B: entity XOR2   port map(A=> A(0), B=>A(1),           Y=>Y(1)); -- Y(1)           <= A(1) xor A(0) ;
      -- Row 2
      e_R2B: entity AX1    port map(A=> A(0), B=>A(1), C=>A(2),  Y=>Y(2)); -- Y(2) <= A(2) xor (A(1) and A(0));
      -- Row 3
      e_R3A: entity AND3   port map(A=> A(0), B=>A(1), C=>A(2),  Y=>A012); -- A012 <= A(2) and  A(1) and A(0) ; -- FO7
      e_R3B: entity XOR2   port map(A=> A(3), B=>A012,           Y=>Y(3)); -- Y(3) <= A(3) xor A012;
      -- Row 4
      e_R4A: entity AND2   port map(A=> A(3), B=>A(4),           Y=> A34); -- A34  <= A(3) and A(4);          -- F02
      e_R4B: entity AX1    port map(A=> A012, B=>A(3), C=>A(4),  Y=>Y(4)); -- Y(4) <= A(4) xor (A(3) and A012);
      -- Row 5
      e_R5A: entity AND3   port map(A=> A(3), B=>A(4), C=>A(5),  Y=>A345); -- A345 <= A(3) and A(4) and A(5); -- FO1
      e_R5B: entity AX1    port map(A=> A012, B=>A34,  C=>A(5),  Y=>Y(5)); --   Y(5) <= A(5) xor (A012 and A34);
      -- Row 6
      e_R6A: entity AND3   port map(A=>  A34, B=>A(5), C=>A(6), Y=>A3456); --  A3456 <= A34  and A(5) and A(6); -- FO2
      e_R6B: entity AX1    port map(A=> A012, B=>A345, C=>A(6),  Y=>Y(6)); --    Y(6) <= A(6) xor (A012 and A345);
      -- Row 7
      e_R7A: entity AND3   port map(A=>A012, B=>A3456, C=>A(7),  Y=>   V); --    V    <= A(7) and  A012 and A3456;
      e_R7B: entity AX1    port map(A=>A012, B=>A3456, C=>A(7),  Y=>Y(7)); --   Y(7) <= A(7) xor (A012 and A3456);
    end tiles;

    The gates are organised in a single column, on the right of the register set block:

    Several intermediate versions are also available of course, for other platforms. But I stick to ProASIC3 because it is the best way to design for ASIC later : I can see which gates are used, tune the layout, estimate the surface...

    It seems to work well and this leads to the design of more bitslices. I also have to test with Libero and I must explore the explicit cell placement directives...

  • More register set layout

    Yann Guidon / YGDES10/14/2018 at 20:59 4 comments

    The register set is really the central, critical part of the core, it's the nexus and a physical representation of many logical structures. This explains why I focus so much on this apparently innocuous unit...

    Previous posts have examined the register set's low-level structure and here I'm going further by taking the ISA into account.

    The register map is : D1 A1 D2 A2 R1 R2 R3 PC

    • R1, R2 and R3 are "normal registers", implemented the old good way. Nothing to add here.
    • PC is "a bit different", since the input has a multiplexer and the output has a direct bypass path to the memory. Oh and it is not an actual latch but an incrementer. Some special cases must be considered...
    • A1 and A2 are almost "normal" : they havean extra "read" port/path that goes to the memory's address bus, might be through a MUX.
    • D1 and D2 are... quite a mess. Their input is multiplexed because the value can be written from RAM. And it also depends on the configuration of the RAM array : dual port or single port ?

    I will now consider the specific case of the VHDL implementation that targets the A3P FPGA family. Reasons include the ample availability of the chips and the ease of VHDL simulation, so I would get working results faster.

    The interesting part is the RAM blocks that feature some interesting output latch modes. Only the address ports need to be externally latched and this saves some gates for the data registers. The following drawing summarizes the whole idea :

    The wires are usually understood as bytes but here we'll think of them as individual bits because we'll design a first bitslice and replicate it (9 times, including parity).

    The bitslice for this part has the following interface signals :


    • of course : SRC and SND addresses (3 bits each)
    • Result
    • INC output / PC input
    • D1
    • D2
    • SwapSelect (2 signals)
    • Write enable for A1, A2, R1, R2, R3, PC


    • of course : SND and SRC
    • A1
    • A2
    • PC input
    • PC output
    • WriteData (post-swap)

    it becomes apparent that the register set's structure contains more than latches and MUXes : some signal conditioning is performed in place, in particular the "swap" of PC. The address MUXes could even be performed in place. The corresponding VHDL code is easy to write from there.

    An FPGA has different constraints than other technologies and it's a good first step toward full ASIC implementation, particularly with the ProASIC3 family (Actel/Microsemi's A3P, now more than 10 years old but still pretty good for many purposes). So I have tried to make a preliminary layout of one bitslice of the register set (called R7):

    The circuit is dominated by MUX2s. There are only 6 DFF but more than 20 MUX2, and soon even more because it's easy to extend the datapath from here. There are also a lot of vertical control lines.

    The INC unit has already been designed, and must be routed/laid to form a vertical column of minimal width.

    The lower layer (not shown) will contain many decoders to drive the columns : MUX2 controls, DFF enable signals...

    The program memory is at the bottom and the dual-ported Data RAM at the left. ALU and SHL at the right, and IN/OUT ports at the top:

    For a FPGA, parity is not necessary at first. However a big missing piece is the debug system, mainly sitting between the program memory and the decoder. A line of MUX might also be sitting between the decoder and R7 to catch the Result, SND and SI8 busses.

    Luckily, the PC's incrementer uses quite few gates and can fit in a column of only 1 gate wide :

    The schematic shows the whole circuit with 13 3-inputs gates only. They can be paired straight-forwardly. I'll have to update the VHDL code.

    Another full column of MUX2 selects the result bus' source : ALU, SHL or PORT_IN.

    Yet another column of MUX2 selects the source between : register, R/I3 or R/I8.

    The whole set is almost square : 15 tiles wide and 16 tiles high, or 240 tiles....

    Read more »

  • What can it run ?

    Yann Guidon / YGDES08/16/2018 at 12:34 0 comments

    #YGREC-РЭС15-bis has 16 bits wide registers that make it suitable for quite a few things, including running Tetris and Game of Life. However, the #YGREC8 is only 8 bits wide and this limits the range of programs even more. I'll focus only on "toy games" because they are the most attractive applications, while I also consider other uses such as PLC or monitoring.

    Tetris is still somewhat possible but it would be an impractical stretch because the 10 columns exceed the 8 bits of the registers, and the processor is too slow to animate that smoothly : the display would be sheared.

    Tic-tac-toe is a contender.

    Battleship is another good candidate : it's not a hard real-time game and animations are not critical. However I would have to build 2 units and make them communicate somehow... So it would be good to develop communication protocols, later.

    Another good challenge is the SNAKE game. It doesn't require too much computing power and could run fast enough to be enjoyable. The problem is to memorise a linked list of coordinates, which could exceed the DRAM capacity... But there is a not-too-hard solution :-)

    It requires 4 bitplanes :

    • one bitplane is a boolean that says "food"
    • one bitplane is a boolean that says "snake" (can be mapped to the flip dots display)
    • one bitplane says "up/down"
    • one bitplane says "left/right"

    so overall, there are 4 bits per pixel. With a 16×16 pixels array, that's 128 byte of DRAM, or half the addressing space of one address register.

    • Food is pretty simple : it's set by a random generator from time to time in places where there is no "snake" bit set.
    • Snake is used as a collision condition, it's set by the "head" code and cleared by the "tail" code.
    • The "head code" has a coordinate and a direction : the direction is changed by the button inputs.
      - At each game step, the buttons are scanned, the direction updated, the direction increments/decrements the coordinate and the "snake" bit is set at the new coordinate.
      - If a "food" bit is also present, the food is swallowed (cleared) and a new food is created pseudo-randomly. The tail code is skipped for one cycle, so the snake gets longer.
      - if a wall is touched or the snake bit is already set, game over.
    • The trick is with the tail code :-) Each "head" leaves a sort of "trail" on the "left/right" and "up/down" bitplanes so the tail can follow it, without requiring the storage of a long list of coordinates. It uses quite a lot of room but much less than fully-decoded coordinates. So the "tail code" remembers the coordinates of the tail but instead of reading the buttons, it reads the "trail" left by the head to follow the body.

    It shouldn't be too hard, right ?...

    I have a 16×24 flip dots array that leaves 16×8 pixels to display the stats of the game.

View all 43 project logs

Enjoy this project?



Yann Guidon / YGDES wrote 11/04/2018 at 07:11 point

Another note for later :
writing to A1 or A2 starts a fetch from RAM. In theory the latency is the same as instruction memory and one wait state would be introduced. However the processor can also write directly so the wait state would be only on read to the paired data register...

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/04/2018 at 06:55 point

Note for later : don't forget the transparent latch on the destination register address field, for the (rare) case of LDCx, because the 2nd cycle doesn't preserve the opcode etc.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/04/2018 at 07:18 point

OK, not a transparent latch, but a DFF and a mux, plus some logic to control it.

-- DFF, every cycle :

SND_latched <= SND_field;

LDCx_flag <= '1' when (LDCx_flag='0' and opcode=opc_LDC and writeBack_enabled='1')   else '0';

-- MUX2 :

WriteAddress <= SND_latched when LDCx_flag = '1' else SND_field;


Note : LDCx into PC must work without wait state because it's connected directly to SRI, as an IMM8, and no extra delay is required. PC wait state is required for ADD/ROP2/SHL and IN.

  Are you sure? yes | no

Frank Buss wrote 10/27/2018 at 12:51 point

Do you really plan 8 byte-wide registers? This would require thousands of relays :-)

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/27/2018 at 14:26 point

no :-)

8 registers, 8 bits each = 64 storage bits.
1 relay per bit => 64 registers

The trick is to use the hysteretic mode of the relays :-)

  Are you sure? yes | no

Frank Buss wrote 10/27/2018 at 16:17 point

Ok, makes sense. Maybe change the project description, someone might think you are planning a 64 bit architecture.
BTW, could this be parametrized for the address and data size? If you implement it in VHDL, you could use generics for this, would be no additional work to use just the generic names instead of hard coded numbers. Except maybe some work for extending the instruction opcodes.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/27/2018 at 17:16 point

Frank : DAMNIT you're right !

I updated the description...

  Are you sure? yes | no

Yann Guidon / YGDES wrote 10/27/2018 at 17:19 point

For the parameterization : it doesn't make sense at this scale. Every fraction of bit counts and must be wisely allocated.

Larger architectures such at #YASEP Yet Another Small Embedded Processor  and #F-CPU  have much more headroom for this.

  Are you sure? yes | no

Bartosz wrote 11/08/2017 at 16:40 point

this will working on epiphany or oHm or other cheap machine?

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/08/2017 at 18:07 point

I'm preparing a version that would hopefully use less than half of a A3P060 FPGA, which is already the smallest of that family that can reasonably implement a microcontroller.

But it's a lot less fun than making one with hundreds of SPDT relays !

  Are you sure? yes | no

Bartosz wrote 11/14/2017 at 14:13 point

Question is price and posibility to buy

  Are you sure? yes | no

Yann Guidon / YGDES wrote 11/14/2017 at 16:08 point

@Bartosz : what do you want to buy ?

If you can simulate and/or synthesise VHDL, the source code is being developed and available for free, though I can't support all FPGA vendors.

If you want a ready-made FPGA board, that could be made too.

If you want relays, it's a bit more tricky ;-)

I have just enough RES15 to make my project and it might take a long while to succeed. There will be many PCB and other stuff.

However if, in the end, I see strong interest from potential buyers, I might make a cost-reduced version with easily-found minirelays. I don't remember well but the Chinese models I found cost around 1/2$ a piece. Factor in PCB and other costs and you get a very rough price estimate... It's not cheap, it's not power efficient, it's slow and won't compute useful stuff... But it certainly can make a crazy nice interactive display, when coupled with flip dots :-D

So the answer is : "it depends" :-D

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates