Close

Proof of concept - CDP1802 compatible CPU

A project log for Microcoding for FPGAs

A microcode compiler developed to fit into FPGA toolchain and validated to develop CDP1805-like CPU and text-based video controller

zpekiczpekic 05/30/2020 at 18:001 Comment

Before digging into the implementation which can be found here, why 1802?

For better understanding of the 1802 CPU from the "black box" perspective (and especially to understand its states during each instruction execution) it is useful to look at the data sheets as a refresher: 

http://www.cosmacelf.com/publications/data-sheets/cdp1802.pdf

http://datasheets.chipdb.org/Intersil/1805-1806.pdf

Going inside the box, here is the great reverse engineering description:

http://visual6502.org/wiki/index.php?title=RCA_1802E

SAMPLE INSTRUCTION EXECUTION

One way to explain how microcode-driven CPU works is to follow the execution of a single instruction. for example SDB:

SUBTRACT D WITH BORROW SDB 75 M(R(X)) - D - (NOT DF) → DF, D

Note that it executes in machine 2 states ( == 16 clock cycles):

S0 FETCH MRP → I, N; RP + 1 → RP MRP RP 0 1 0

S1 7 5 SDB MRX - D - DFN → DF, D MRX RX 0 1 0

(1) Execution starts with fetch microinstruction:

//    Read memory into instruction register
//    ---------------------------------------------------------------------------
fetch:    fetch_memread, sel_reg = p, reg_in <= alu_y, y_bus, reg_inc;

fetch_memread ... this is an alias to set the bus_state = fetch_memread; fetch_memread is nothing more that an symbolic name for a location in a look-up table:

signal state_rom: rom16x8 := (
        --                         SC1            SC0            RD    WR    OE    NE    S1S2        S1S2S3
"01000011",     --        exec_nop,    //    0        1        0    0    0    0    1        1
"01100011",    --        exec_memread,    //    0        1        1    0    0    0    1        1
"01011011",    --        exec_memwrite,    //    0        1        0    1    1    0    1        1
"01010111",    --        exec_ioread,    //    0        1        0    1    0    1    1        1
"01100111",    --        exec_iowrite,    //    0        1        1    0    0    1    1        1
"10100011",    --        dma_memread,    //    1        0        1    0    0    0    1        1
"10010011",    --        dma_memwrite,    //    1        0        0    1    0    0    1        1
"11000001",    --        int_nop,    //    1        1        0    0    0    0    0        1
"00100000",    --        fetch_memread,    //    0        0        1    0    0    0    0        0
"00000000",                
"00000000",                
"00000000",                
"00000000",                
"00000000",                
"00000000",                
"00000000"                
);

As expected, this will drive the S1, S0, nRD, nWR, N CPU signals to the right levels / values. Note that OE ("output enable") of D bus is 0 meaning it will be in hi-Z state, therefore input. 

 sel_reg = p ... value of P register will be presented as address to the 16*16 register stack:

-- Register array data path
with cpu_sel_reg select
        sel_reg <=     X"0" when sel_reg_zero,
                X"1" when sel_reg_one,
                X"2" when sel_reg_two,
                reg_x when sel_reg_x,
                reg_n when sel_reg_n,
                reg_p when sel_reg_p,
                sel_reg when others;

reg_y <= reg_r(to_integer(unsigned(sel_reg)));

 reg_y signal (16 bits) will show the value of the P (program counter). The simple beauty of 1802 is that this will go directly to the A outputs, no loading of separate MAR (memory address register) is needed, as such register doesn't even exist.

reg_inc ... this is the alias (== shortcut) for: 

reg_inc: .alias reg_r <= r_plus_one;

Important is to notice the <= notation - that means there will be a register updated at the end of the cycle, in this case R(P) (value of reg_y in snippet above) will be added 1:

update_r: process(UCLK, cpu_reg_r, reg_r, sel_reg, reg_b, reg_t, alu_y, alu16_y)
begin
    if (rising_edge(UCLK)) then
        case cpu_reg_r is
            when reg_r_zero =>
                reg_r(to_integer(unsigned(sel_reg))) <= X"0000";
            when reg_r_r_plus_one =>
                reg_r(to_integer(unsigned(sel_reg))) <= std_logic_vector(unsigned(reg_y) + 1);
            when reg_r_r_minus_one =>
                reg_r(to_integer(unsigned(sel_reg))) <= std_logic_vector(unsigned(reg_y) - 1);
            when reg_r_yhi_rlo =>
                reg_r(to_integer(unsigned(sel_reg))) <= alu_y & reg_lo;
            when reg_r_rhi_ylo =>
                reg_r(to_integer(unsigned(sel_reg))) <= reg_hi & alu_y;
            when reg_r_b_t =>
                reg_r(to_integer(unsigned(sel_reg))) <= reg_b & reg_t;
            when others =>
                null;
        end case;
    end if;
end process;

Finally, the IN (instruction register must be loaded):

reg_in <= alu_y, y_bus,

<= again indicates state change ("load") at the end of the cycle - IN will be loaded from the output of ALU:

-- update IN (instruction) register
update_in: process(UCLK, cpu_reg_in, alu_y)
begin
    if (rising_edge(UCLK)) then
        case cpu_reg_in is
            when reg_in_alu_y =>
                reg_in <= alu_y;
            when others =>
                null;
        end case;
    end if;
end process;

and the ALU will pass through the D (data bus) value (see ALU description below for more details):

y_bus:        .alias alu_f = pass_s, alu_s = bus; 

In summary, the fetch microinstruction engaged multiple elements of the CPU to orchestrate a read cycle to memory, addressed by the register designated as program counter, and at the end of cycle update the instruction register with the value loaded from memory, and incrementing the program counter.

(2) Next is to load the memory addressed by X

This cycle is very similar to fetch, except:

- register is selected by X not P

- destination register is B, not IN

- machine state is "execute" not fetch:

//    Given that instruction register is loaded late, execute 1 more cycle before forking. It is useful to load B <= M(X)
// ----------------------------------------------------------------------------
load_b:        exec_memread, sel_reg = x, reg_b <= alu_y, y_bus,
        if traceEnabled then traceState else fork;    

The interesting part is what happens next - in regular CPU execution would proceed with the routine implementing SDB, but with microcoded design it is very convenient to code the trace / debug routine in the same microcode - the "traceEnabled" is simply an external pin which when pulled high will cause the tracer routine to dump the state of all the registers (see here) . If low, the "fork" will be activated.

What does "fork" do? It simply loads the address of the implementation routine into the microprogram counter. The MCC compiler will automatically generate the lookup table ("mapper memory") for all the instructions given by their instruction format pattern. In this case, the location 0x75 (code of SDB) in mapper memory will contain value of 0x90 (144) which is the entry point to execute SDB:

-- L0412@0090.SDB: reg_df <= alu_cout, reg_d <= alu_y, alu_f = r_plus_ns, alu_r = b, alu_s = d, alu_cin = df,if continue then fetch else dma_or_int
--  bus_state = 0000, if (0110) then 00000100 else 00010000, reg_d <= 001, reg_df <= 11, reg_t <= 00, reg_b <= 00, reg_x <= 00, reg_p <= 00, reg_in <= 0, reg_q <= 00, reg_mie <= 00, reg_trace <= 00, reg_extend <= 00, sel_reg = 000, reg_r <= 000, alu_r = 10, alu_s = 01, alu_f = 101, alu_cin = 1, ct_oper = 0000;
144 => X"0" & X"6" & X"04" & X"10" & O"1" & "11" & "00" & "00" & "00" & "00" & '0' & "00" & "00" & "00" & "00" & O"0" & O"0" & "10" & "01" & O"5" & '1' & X"0",

 This pattern can be easily seen if inspecting the contents of generated mapper file - it can be easily seen how all PLO, PHI, LDN, GLO, GHI all map to same locations, but for example IDL is unique etc.

Latencies in the circuits and esp. analysing the maximum latency part are crucial to make any of them work. This comes to play in big way in the "fork" too, as the latency is:

mapper memory address (== IN contents) to data propagation

+

control unit then/else MUX propagation

control unit next address MUX propagation

Given that the IN register became valid at the end of previous cycle, there is not enough time for the next cycle to have the proper fork address. That's why regardless of the instruction the B <= M(X) cycle is inserted. This way the fork propagation goes on in parallel with the useful memory read. This presents a problem though as it causes this particular design to be NOT CYCLE ACCURATE - all instructions will read M(X) but only a subset such will use it. This causes a certain loss of perfomance (which is offset by less cycles executing some other instructions though).

In most CPUs this is not a problem, 1 clock cycle delay is sufficient for the propagation, but 1802 has rigid 8 to 1 clock/machine cycle timing so the simplest solution was to insert the B <= M(X) at this point.

(3) Finally, execute:

With memory data in B register, executing is simple:

        .map 0b0_0111_0101;
SDB:        reg_df <= alu_cout, reg_d <= alu_y, y_b_minus_d, alu_cin = df, 
        if continue then fetch else dma_or_int;

Two registers will be updates, as indicated by "<=":

DF ... should get its value from ALU carry-out:

-- update DF (data flag == carry) register
update_df: process(UCLK, cpu_reg_df, alu_cout, reg_d)
begin
    if (rising_edge(UCLK)) then
        case cpu_reg_df is
            when reg_df_d_msb =>
                reg_df <= reg_d(7);
            when reg_df_d_lsb =>
                reg_df <= reg_d(0);
            when reg_df_alu_cout =>
                reg_df <= alu_cout;
            when others =>
                null;
        end case;
    end if;
end process;

D ... should get its value obviously from the ALU output:

-- update D (data == accumulator) register
update_d: process(UCLK, cpu_reg_d, alu_y, reg_df)
begin
    if (rising_edge(UCLK)) then
        case cpu_reg_d is
            when reg_d_alu_y =>
                reg_d <= alu_y;
            when reg_d_shift_dn_df =>
                reg_d <= reg_df & reg_d(7 downto 1);
            when reg_d_shift_dn_0 =>
                reg_d <= '0' & reg_d(7 downto 1);
            when others =>
                null;
        end case;
    end if;
end process;

The ALU operation is described below, but for it to work, proper values need to be presented to ALU inputs R and S (D and B registers). This  is accomplished by a pair of MUXs driven by following microcode fields:

// 8-bit ALU for arithmetic and logical operations
// the binary / decimal mode comes directly from instruction bit 8 (reg_extend) as 68XX arithmetic instructions are all decimal
alu_r:        .valfield 2 values t, d, b, reg_hi default t;
alu_s:        .valfield 2 values bus,     d, const, reg_lo default bus;    // const comes from "else" value

MCC will generate the code to help pull it together:

--
-- L0076.alu_r: 2 values t, d, b, reg_hi default t
--
alias cpu_alu_r:     std_logic_vector(1 downto 0) is cpu_uinstruction(11 downto 10);
constant alu_r_t:     std_logic_vector(1 downto 0) := "00";
constant alu_r_d:     std_logic_vector(1 downto 0) := "01";
constant alu_r_b:     std_logic_vector(1 downto 0) := "10";
constant alu_r_reg_hi:     std_logic_vector(1 downto 0) := "11";

--
-- L0077.alu_s: 2 values bus,  d, const, reg_lo default bus
--
alias cpu_alu_s:     std_logic_vector(1 downto 0) is cpu_uinstruction(9 downto 8);
constant alu_s_bus:     std_logic_vector(1 downto 0) := "00";
constant alu_s_d:     std_logic_vector(1 downto 0) := "01";
constant alu_s_const:     std_logic_vector(1 downto 0) := "10";
constant alu_s_reg_lo:     std_logic_vector(1 downto 0) := "11";

The only thing remaining for the developer to do is simply to use the values generated and implement the 2 MUXs - the whole design is pretty much on "rails" which minimized the chances of bugs:

-- ALU data path
with cpu_alu_r select
        r <=     reg_t     when alu_r_t,
            reg_d     when alu_r_d,
            reg_b     when alu_r_b,
            reg_hi     when alu_r_reg_hi;        -- R(sel_reg).1    TODO

with cpu_alu_s select
        s <=     data_in    when alu_s_bus,            -- data bus input
            reg_d     when alu_s_d,
            cpu_seq_else when alu_s_const,    -- "constant" is reused else field
            reg_lo     when alu_s_reg_lo;        -- R(sel_reg).0    TODO
            
with cpu_alu_cin select
        alu_cin <= cpu_alu_f(1) or cpu_alu_f(0)    when alu_cin_f1_or_f0, -- this will be 0 for add (no carry) and 1 for substract (no borrow)
               reg_df                          when alu_cin_df;
            

(4) What to do after execution?

There are two possibilities:

if continue then fetch else dma_or_int;

If "continue" conditional input is true, then loop back to execute next fetch / execute (remember, R(P) now points to next instruction already)

If "continue" conditional input is low, then service DMA or INT request. 

 At specific moments of machine cycles, the state of DMAOUT, DMAIN and INT are captured, and in addition INT is conditioned on the state of MIE (master interrupt enable flag):

-- capture state of interrupt
capture_int: process(nINTERRUPT, cycle_di, state_s1s2)
begin
    if (falling_edge(cycle_di) and (state_s1s2 = '1')) then
        reg_int <= reg_mie and (not nINTERRUPT);
    end if;
end process;

-- capture state of dma requests
capture_dma: process(nDMAIN, nDMAOUT, cycle_di, state_s1s2s3)
begin
    if (falling_edge(cycle_di) and (state_s1s2s3 = '1')) then
        reg_dma <= (not nDMAIN) & (not nDMAOUT);
    end if;
end process;

continue <= not (reg_int or reg_dma(1) or reg_dma(0));    -- no external signal received 

So at the end of instruction, a decision can be made if to proceed to service these async requests. 


HANDLING DMA  AND INT

1802 establishes the following priority of requests:

(1) DMA IN

(2) DMA OUT

(3) INTERRUPT

It would be simple to respond to these as consecutive "if request then ... else ..." microcode instructions, but that would mean delay to start servicing INTERRUPT even if there is no DMA IN/OUT request. Clearly, what is needed is a "switch" - like statement that checks for 8 possible states of these 3 requests and in 1 machine cycle starts executing the request.

This is accomplished by "injecting" a special branch destination to the "else" branch if the condition is "seq_cond_continue_sw":

-- "switch statement" for 8 possible combinations of DMA and INT states
seq_else <= ("0001" & reg_dma & reg_int & '1') when (to_integer(unsigned(cpu_seq_cond)) = seq_cond_continue_sw) else cpu_seq_else;

The branch destinations will be:

0x11 - when there is no request

...

0x1F -when all requests come in simultaneously

The "switch value" is injected on positions 3 downto 0, so that there is place for 2 microinstructions for each case. Now the only thing code needs to do is to respond to the highest received request, ignore the others (first 4 cases below):

//	Respond to DMA or INT requests (using simple switch mechanism)
//	These requests are honored at the end of each instruction execution
//	--------------------------------------------------------------------------
		.org 0b0001_0000;
dma_or_int:     if continue_sw then fetch else dma_or_int;	// else mux is else(7 downto 3) & dma_in & dma_out & int & '1' 

		.org 0b0001_0001;	// no special cycle needed, start a new fetch
fetch1:		fetch_memread, sel_reg = p, reg_in <= alu_y, y_bus, reg_inc,
		goto load_b;

		.org 0b0001_0011;	// INT
int_ack:	int_nop, 
		y_const, reg_t <= xp, reg_x <= alu_yhi, reg_p <= alu_ylo, // T <= XP, X <= 2, P <= 1
		reg_mie <= disable, 
		if true then fetch else 0x21;

		.org 0b0001_0101;	// DMA_OUT
dma_out:	dma_memread, sel_reg = zero, reg_inc,	//	DEVICE <= M(R(0)), R(0) <= R(0) + 1
		if continue then fetch else dma_or_int;

		.org 0b0001_0111;	// DMA_OUT, INT ignored
		dma_memread, sel_reg = zero, reg_inc,	//	DEVICE <= M(R(0)), R(0) <= R(0) + 1
		if continue then fetch else dma_or_int;

Note that after INT is acknowleged, at least 1 more instruction will be executed ("if true then fetch"), this is a bug as according to state diagram, from INT it is possible to get directly to DMA IN/OUT without executing any instruction. The fix will be to replace with "if continue then fetch else dma_or_int".


EXECUTING EXTENDED INSTRUCTIONS

Many successful CPUs had successors implementing enhanced instructions / capabilities. The canonical example is Z80 which enhanced 8080 with new registers and instructions following the "escape codes" 0xCB, 0xDD, 0xED, 0xFD. 1802 is no exception, the successors 1804, 05, 06 introduced new instructions with escape code 0x68 which was a NOP in original 1802.

For CPUs implemented as state machines, extending the instruction set effectively means re-writing the state machine. This is a huge advantage of microcoded designs, where this task becomes a matter of another "if" microinstruction. Here is the implementation of 0x68 op-code:

		.map 0b0_0110_1000;	// override for INP 0 is the linking opcode for extended instructions			
EXTEND:		if mode_1805 then next else NOP;

		fetch_memread, sel_reg = p, reg_in <= alu_y, y_bus, reg_extend <= one,
		reg_inc,
		goto load_b;

2 microinstructions can be seen:

(1) if not in 1805 mode (the control pin is down) then interpret as a NOP, otherwise continue

(2) flip the reg_extend bit to 1 and execute fetch of the next op-code

The reg_extend bit is treated as bit 8 (9th bit) of the instruction register, which makes sense as the 1805 op-code matrix is 2*256 entries. That is the reason why the mapper memory has 512 entries (9 address bits depth):

.code 8, 64, cdp180x_code.mif, cdp180x_code.cgf, cdp180x_code.coe, cpu:cdp180x_code.vhd, cdp180x_code.hex, cdp180x_code.bin, 8;
.mapper 9, 8, cdp180x_map.mif, cdp180x_map.cgf, cdp180x_map.coe, cpu:cdp180x_map.vhd, cdp180x_map.hex, cdp180x_map.bin, 1;
.controller cpu_control_unit.vhd, 8;

With 9th bit flipped to 1, it is easy to map extended instructions, for example DBNZ (0x68, 0x2X):

	.map 0b1_0010_????;
DBNZ:	reg_extend <= zero, sel_reg = n, reg_dec,
	if alu16_zero then skip2 else next;	// zero detection is connected to the output of the incrementer/decrementer

	exec_memread, sel_reg = p, reg_b <= alu_y, y_bus, reg_inc;	// B <= M(R(P)), R(P) <= R(P) + 1
			
	exec_memread, sel_reg = p, reg_r <= rhi_ylo, y_bus;	// R(P).0 <= M(R(P))

	sel_reg = p, reg_r <= yhi_rlo, y_b,							// R(P).1 <= B 
	if continue then fetch else dma_or_int;

It is obviously important to flip the reg_extended bit back to '0' before next fetch otherwise instruction interpretation would go terribly awry. 

Interestingly, in 1805 BCD instructions are mapped to exact same opcodes as their binary counterparts. No doubt, internally in the original CPU this was done for a reason to simplify the control unit. In case of microcode, this means 2 op-codes mapped to same execution (e.g ADI and DADI):

		.map 0b0_1111_1100;
		.map 0b1_1111_1100; // "bcd mode" will be 1 because reg_extend == 1 therefore BCD add will be executed
ADI:		exec_memread, sel_reg = p, reg_b <= alu_y, y_bus, reg_inc,
		goto ADD;

Side note about the instruction above: recall that each instructions starts with M(R(X)) loaded in B. For immediate value instructions that is not useful, so B <= M(R(P)) must be executed in first microinstruction, with R(P) incremented at the end of the same microinstruction to point to the next opcode. After that, there is no longer a difference between ADI and ADD (or SMI and SM etc.) so simply jump and proceed there.

As an optimization, 1 machine cycle could be saved by avoiding "goto ADD" and simply duplicating that microinstruction instead.

DEFINING MACHINE CYCLES

A unique characteristic of 1802 CPUs is the relationship between machine and clock cycles:

1 machine cycle = 8 clock cycles

As an additional complication, signal changes do not occur always on the same transition during those 8 cycles - sometimes they are on L->H and sometimes on H->L. I solved this by combining the 8 counter values (in 1 machine cycles) with CLK state to achive 16 states, which drive a simple lookup table that hard-codes the values of bus control signals:

type rom16x8 is array(0 to 15) of std_logic_vector(7 downto 0);
signal cycle_rom: rom16x8 := (
"00100011",     -- 00				MA_HIGH				EF	UC
"00100011",	-- 01				MA_HIGH				EF	UC
"10100011",	-- 10	TPA 			MA_HIGH				EF	UC	
"10110001",	-- 11	TPA			MA_HIGH	RD				UC
"00110001",	-- 20				MA_HIGH	RD				UC
"00010001",	-- 21					RD				UC
"00010001",	-- 30					RD				UC
"00010001",	-- 31					RD				UC
"00010001",	-- 40					RD				UC
"00010001",	-- 41					RD				UC
"00011001",	-- 50					RD	WR			UC
"00011001",	-- 51					RD	WR			UC
"00011101",	-- 60					RD	WR     DI		UC
"01011101",	-- 61			TPB		RD	WR     DI		UC
"01011001",	-- 70			TPB		RD	WR			UC
"00000000"	-- 71												
);

 One can read the timing of signals (for example TPA == bit 7) from top to bottom as the machine cycle unfolds.

The microcode doesn't know or care about it. It advances on UC (bit 0) which pulses low in state "71".

However, each microinstruction brings its own "machine cycle state", which in effect from state 00 to 71. For example INP:

		.map 0b0_0110_1???;
INP:		exec_ioread, sel_reg = x, y_bus, reg_d <= alu_y, 
		if continue then fetch else dma_or_int;	

 exec_ioread is an index into another lookup table ("state_rom") which has RD enabled and WR disabled (and obviously N MUX selecting bits 3..0 from instruction register, instead of 000). This will be combined with the cycle table to produce the exact timing of the pulses:

-- machine cycle (8 clocks, fixed)
cnt16 <= cnt8 & CLOCK;
cycle <= cycle_rom(to_integer(unsigned(cnt16)));

-- CPU state (driven by microcode)
state <= state_rom(to_integer(unsigned(cpu_bus_state)));

-- driving output control signals, which are a combination of cycle and state
TPA <= cycle_tpa;
TPB <= cycle_tpb;
Q <= reg_q;
N <= reg_n(2 downto 0) when (state_ne = '1') else "000"; -- note that 60 and 68 will still generate N=000
SC <= state_sc;

-- ADDRESS BUS - always reflects currently selected R(?), R(?).1 in first 5 periods, R(?).0 in remaining 11
MA <= reg_hi when (cycle_ahi = '1') else reg_lo;	
A <= reg_y;

-- READ/WRITE at specific timing in the cycle, if enabled by CPU state
nMRD <= not (state_rd and cycle_rd);
nMWR <= not (state_wr and cycle_wr);

-- DATA BUS - drive and capture at specific moments in the cycle
DBUS <= alu_y when ((state_oe and cycle_wr) = '1') else "ZZZZZZZZ";

ALU

Arithmetic - logic units are obviously important part of any CPU, and they can vary a lot in capabilities and complexities. The one in 1802 would be one of the simplest, using straightforward 2's complement binary adder with few binary operations thrown in. However 1805 adds a complications - 8-bit BCD add/substract, with carry / borrow obviously.

Inspecting the 1802/05 instruction set, the following operations are needed, and described in the microcode:

alu_f:        .valfield 3 values xor, and, ior, pass_r, r_plus_s, r_plus_ns, nr_plus_s, pass_s default xor;    // f2 selects logic/add, f1 and f0 flip r/s

 This translates to autogenerated VHD:

--
-- L0087.alu_f: 3 values xor, and, ior, pass_r, r_plus_s, r_plus_ns, nr_plus_s, pass_s default xor
--
alias cpu_alu_f:             std_logic_vector(2 downto 0) is cpu_uinstruction(7 downto 5);
constant alu_f_xor:          std_logic_vector(2 downto 0) := "000";
constant alu_f_and:          std_logic_vector(2 downto 0) := "001";
constant alu_f_ior:          std_logic_vector(2 downto 0) := "010";
constant alu_f_pass_r:       std_logic_vector(2 downto 0) := "011";
constant alu_f_r_plus_s:     std_logic_vector(2 downto 0) := "100";
constant alu_f_r_plus_ns:    std_logic_vector(2 downto 0) := "101";
constant alu_f_nr_plus_s:    std_logic_vector(2 downto 0) := "110";
constant alu_f_pass_s:       std_logic_vector(2 downto 0) := "111";

 (read: 3 control bits names "cpu_alu_f" spanning bit positions 7 to 5 of the current microinstruction determine the ALU operation)

The implementation is pretty apparent - it boils down to a 8-bit wide 8-to-1 MUX:

with cpu_alu_f select
        alu_y <= r xor s     when alu_f_xor,
        r and s  when alu_f_and,
        r or s   when alu_f_ior,
        r        when alu_f_pass_r,
        add_y    when alu_f_r_plus_s,
        add_y    when alu_f_r_plus_ns,
        add_y    when alu_f_nr_plus_s,
        s        when alu_f_pass_s;

 The interesting part is that 3 out of 8 combinations above pass the result of adder. cpu_alu_f(1 downto 0) also control the "complement" of the adder to implement subtract operation. This is handy to implement

SD (hint: check what y_b_minus_d alias resolves into)

    .map 0b0_1111_0101;
SD:        reg_df <= alu_cout, reg_d <= alu_y, y_b_minus_d, alu_cin = f1_or_f0, // cin = 1
           if continue then fetch else dma_or_int;

vs. 

SM (see y_d_minus_b alias, it is obviously flip of the one used for SD)

        .map 0b0_1111_0111;
        .map 0b1_1111_0111; // "bcd mode" will be 1 because reg_extend == 1 therefore BCD add will be executed
SM:     reg_extend <= zero, reg_df <= alu_cout, reg_d <= alu_y, y_d_minus_b, alu_cin = f1_or_f0, // cin = 1  
        if continue then fetch else dma_or_int;

It is not useful to have both inputs complemented ("Y = -R - S") so that combination is discarded, and instead a convenient "Y = S" bypass is implemented (alu_f_pass_s).

For purely 1802, simplest adder with 2 8-bit XORs inverting the inputs would suffice, but because 1805 has BCD, that would not work because BCD needs to be corrected within 4 bits and the so-called "half carry" passed between BCD digits.

Many processors implement this using a special "DAA" (decimal adjust accumulator) step which is either an extra instruction (8080, 8085, Z80), or executed as extra step for special BCD operations (which is the approach 1805 uses, the machine cycle listing for example for DADD (0x68F4) instruction and others like that makes it clear that last cycle is a "DAA".)

That cycle can be saved if inspiration can be taken from 6502 processor, which has a "D" flag - when set, all ADD/SUB become BCD operations instead of binary, as described in the following classic patent #3991307 by Peddle et al. The difference is that I used lookup tables (clean way for FPGAs) instead of logic to implement BCD add and substract (as explained below)

Looking at 1805 instruction set, it becomes obvious that all ADD/SUB operations are BCD in the extended instruction set, and binary in the regular 1802 set. Which means that the extra instruction bit (reg_extend of the instruction register) can be used as the "D" flag bit:

adder_lo: nibbleadder Port map ( 
                cin => alu_cin,
                a => r(3 downto 0),
                b => s(3 downto 0),
                na => cpu_alu_f(1),
                nb =>    cpu_alu_f(0),
                bcd => reg_extend,    -- all 68XX add/sub is in BCD mode
                y => add_y(3 downto 0),
                cout => cout
            );

adder_hi: nibbleadder Port map ( 
                cin => cout,
                a => r(7 downto 4),
                b => s(7 downto 4),
                na => cpu_alu_f(1),
                nb =>    cpu_alu_f(0),
                bcd => reg_extend,
                y => add_y(7 downto 4),
                cout => alu_cout
            );

The "cout" (carry out) from lower nibble is passed as carry in to upper nibble as expected - but the trick is that this half carry will depend on the mode:

07

07

----

0E -- binary, there was no half carry

07

07

---

14 - decimal, half carry was generated in the lower nibble

How is the "nibbleadder" implemented?

First, each input (S and R) can be in passed in:

- unchanged (binary and BCD add)

- 1's complemented (binary subtract - done using simple XOR with "1111")

- 9's complemented (BCD substract - this is done using a lookup table)

with sel_r select
    r <=    a         when "00",    -- binary add
        a xor X"F"     when "01",    -- binary sub
        a         when "10",    -- bcd add
        a_compl9(to_integer(unsigned(a))) when "11";    -- bcd sub

Then, the 2 6-bit numbers (why 6? this is the usual FPGA trick - on the LSB side the operand side is extended to capture the carry in, and on the MSB side with "0" so that the resulting sum reflects the generated carry-out in MSB of the result) when operands are added.

sum_bin <= std_logic_vector(unsigned('0' & r & '1') + unsigned('0' & s & cin));

The sum and carry is also used to lookup in BCD correction table (so for example E with carry out of 0 becomes 4 with carry out of 1, and the maximum possible BCD value of 9 + 9 + 1 = 3 carry out 1, becomes 9 with cary out of 1)

sum_bcd <= adcbcd(to_integer(unsigned(sum_bin(5 downto 1))));

Finally, if the ALU is in the BCD mode, the corrected result (sum and carry out) is MUXed to the output, otherwise the binary addition result is.

y <= sum_bin(4 downto 1) when (bcd = '0') else sum_bcd(3 downto 0);
cout <= sum_bin(5) when (bcd = '0') else sum_bcd(4);

Similar to 6502 (but not to 8080 and its descendants), the 1802 DF (= carry) flag has the "natural" state of the 2's complement borrow bit preserved. This means that for binary and BCD substract operations DF = 1 means no borrow has occured, and DF = 0 means there is a borrow. That also means that DF should be set to "1" for operations that start the subtract chain (SD), and to "0" which start the add chain (ADD). Therefore following carry in options are needed:

- 0 for ADD, ADI

- 1 for SD,SM, SMI, SBI

- DF for ADC, ADCI (DF means "carry")

- DF for SDB, SMB, SDBI, SMBI (DF means "borrow")

With some "cleverness" this boils down to just 2 options - either DF directly, or the OR of the low 2 bits of the function because if any of those is "1" means it is a subtract therefore 1 must be passed in as carry in, and if both are 0, that is an "add" so by default carry in should be 0:

alu_cin:    .valfield 1 values f1_or_f0, df default f1_or_f0;    // f1_or_f1 will generate 0 for add, and 1 for subtract

Discussions

zpekic wrote 06/09/2020 at 05:41 point

A very bad video, but shows the basic working of the system https://www.youtube.com/watch?v=CQuC0hO-Ga4

  Are you sure? yes | no