04/22/2018 at 10:17 •
We made the leap from 16 bits to 32 bits! At the last update, we had our RISCy 16-bit OPC6 running on various FPGAs and also in emulation on Raspberry Pi. It's a word-addressed machine, with instructions of one or two words, such that a register, or an operand, is big enough for an address. Limited to 64k words, but still with 3 HLLs to program it with: C, BCPL, PLASMA. We also had a single-stepping monitor program for it, and we'd run it both standalone and as a second processor to Acorn's 6502-based BBC Micro.
To compute ever larger numbers of digits of Pi, or more generally just to have more available memory, and to make arithmetic just a bit easier on large values, we cooked up a 32-bit version - welcome to the OPC7!
With 32 bits of instruction, we use 3 for predication and 5 for the opcode, leaving 4+4 for the registers and 16 bits for an immediate. Because the immediate is no longer word-sized, we added a Byte Permute instruction and a Move Top. We also added four instructions with a 20 bit immediate, reusing the source register field. We added a Software Interrupt mechanism for OS calls and the like. As before, short branches and relative jumps are done by adding to the PC, and the subroutine call uses a link register - there are now no push and pop instructions.
08/24/2017 at 17:19 •
Another great leap forward: OPC5/6 has been seen running on several Xilinx chips: the Spartan 3 on an OHO GOP board, and the LX9 on a cheapo Starter Board and on an Avnet micro board. But now, we have it working on the BlackIce dev board: a nice little board with a Lattice FPGA, an ARM microcontroller, a USB-to-serial interface chip, and a fast 16bit wide SRAM. Look:
At first, we used only the on-chip RAM, but in the course of an afternoon and evening we got the CPU running at 40MHz with the external SRAM and talking to a host over a serial port. A little later, with the help of a firmware update, we were able to talk on serial over USB. (You'll notice two USB connectors: one is for the ARM and for downloading design data while the other is free for communication with the FPGA.)
As an extra bonus, there's a fully open source toolchain for this Lattice chip.
08/13/2017 at 17:37 •
Last time, we noted that Rob Finch had embarked on a port to his C-superset "C64" compiler to target the OPC6 - that's going pretty well, and in one afternoon last week we got from hello world to a pi-computing spigot.
But we have other big news: Steve F has embarked on a port of PLASMA, the well-known VM and high level language for 6502 systems, originally for Apple and then for Acorn's Beeb, and hopefully before long for the OPC6 too.
To help test and bring up Rob's compiler, Dave was running the OPC6 emulation on a Pi Zero connected to a Beeb: the Pi is not only emulating our CPU but also able to run a command line debugger, so we can set breakpoints and watchpoints, single step and disassemble. Here's the setup:
- 08/02/2017 at 12:38 • 0 comments
07/28/2017 at 18:39 •
Revaldinho's written up the story of our performance and code density evolution, using Bruce Clark's pi spigot program as a benchmark.
You'll notice that
- - the OPC5 has evolved via two intermediate forms into the OPC6.
- our code size is still about 1.5x the 6502 but much better than it was
- our cycle count is around 3x better than the 6502 when we write our best code
- it's not too hard to just translate 6502 code into OPC5 or OPC6, but it won't be great code
- our clocks-per-instruction with the best code and best machine is very respectable
In the course of writing these Pi programs, and other programs, we've identified some common errors we make - some of which are because the single-page assembler is a bit feature-limited:
- missing r0 in the source (the minimal assembler emits bad code)
- duplicate label
- failing to stack r13 (our link register)
- failing to stack scratch registers
- using ld where we intend mov
- assembling for wrong start address
- failing to account for I/O hole in memory map and placing code there by mistake
07/27/2017 at 09:05 •
How to show off and use our OPC-5 CPU? We need I/O! A filesystem... Acorn's BBC Micro is powered by a 2MHz 6502 and has 32k RAM and 32k ROM - but there's much more to it than that. There's a simple bus extension called The Tube and some facility in the OS to allow the Beeb to act as a front-end processor to almost any other kind of CPU - especially useful for a faster CPU with more RAM.
With a small OS handler on the 'parasite' machine it can use the I/O facilities of the Beeb as a 'host' machine - which means we get keyboard, screen, filesystem, serial I/O as well as a handy 5V power supply. Here's the OPC-5 running as such a parasite on a small single layered PCB by hoglet:
We had two splendid software updates from hoglet: first, a port of the small OS needed by a second processor, from 6502 assembly ro OPC-5 assembly, which enabled the above setup. Second, an emulator of the OPC-5 in C - not quite single-page density, but readily compressible into 88 lines - which allow anyone with a Pi-based coprocessor to have a go with the OPC-5, even if they don't have an FPGA. (The Pi-based coprocessor offers a host of different CPU models, including a 274MHz 6502 with a megabyte of RAM!)
07/16/2017 at 18:48 •
Yes, we can compute digits of pi! See below. First, a few enhancements:
We've had three predicate bits for a while now, and we've used them to offer predication on two processor status bits: the carry bit, and the zero bit, and to have predication on both bit is set and bit is clear. We've decided to rejig that: add a sign bit too, and remove the 'combination' predication. So now, any instruction can be predicated on any one of the three bits, in either sense, or be unconditional, or not execute.
After some to-and-fro to decide exactly how it should work, we've also added maskable interrupts and software interrupts, and a return-from-interrupt instruction. We keep discussing the possibilities of shadow registers, register windows, or multiple register banks, but at present we still just have the one logical file of 16 registers. (In the implementation, neither r0 or r15 is ever read, one because it must always readback as zero, and the other because it must read as the PC, which needs to be a register of its own to allow for increment.) Oh, we do have shadows for the PC and PSR, so the interrupt state can return to the main code without ever using complex push or pop operations, which this machine doesn't do.)
Now, for pi, we started with a 65Org16 program which was itself a quick port of the 65816 pi spigot written by Bruce Clark. We ported it to OPC5ls more or less mechanically - with the main difference that we could use registers instead of memory, and didn't need to juggle values. Here's a typical code fragment:
div: # uses y as loop counter
mov r10, r1 # sta r mov r3, r0, 16 # ldy #16
mov r1, r0, 0 # lda #0
add r11, r11 # asl q
d1: adc r1, r1 # rol
cmp r1, r10 # cmp r
nc.mov pc, r0, d2 # bcc d2
sbc r1, r10 # sbc r
d2: adc r11, r11 # rol
mov r3, r3, -1 # dey, don't affect the carry flag
nz.mov pc, r0, d1 # bne d1
Interestingly, where the 6502 takes 72000 cycles to compute 6 digits with this approach, our transliterated approach takes 85000 - but recoding into more idiomatic code for our OPC5ls machine gets it down to just 40000 cycles. That's nice! In fact even with two-cycle memory, because there are cycles which don't access memory, it only takes 65000 cycles, so that's a hopeful indicator should we fit an 8-bit memory system to this 16-bit CPU.
07/15/2017 at 13:32 •
Up to this point, we have a few snippets of code to exercise our CPU ideas. But looking at a simple program to compute the Fibonacci numbers (up to a 16 bit limit) we're able to compare our various efforts against the 6502:
Core Code size Cycles OPC-1 (8 bit CPLD sized) 172 5040 6502 (8 bit custom) 84 1710 OPC-3 (16 bit OPC-1) 216 2550 OPC-5 (16 bit 16 register) 70 921
We don't think that's too bad!
However, we noticed that with 16 registers we much more often operate on registers than on memory. So, we can rejig the machine to separate load and store operations and make those the only ones which operate on memory, and free up one bit for instruction encoding. (We did have another thought: if we drop to 8 registers we free up two bits...)
Here's the updated spec showing we now have 16 opcodes: we've added sub, sbc, cmp and cmpc, also not, byte swap, and access to the processor status register - which means an interrupt routine can now save and restore the machine state much more readily.
We're hoping this will improve both performance and code density. To figure that out, we've written some arithmetic routines: multiply, divide and square root.
So, we're still within 128 slices, easily, and generally a bit faster than 100MHz, which keeps us competitive with an FPGA version of 6502, although we are using a 16 bit wide memory, which would (in the day) have made for a much more expensive system. We're confident we could make a shim to connect to an 8 bit wide memory, but that would surely cost us performance.
One further improvement: that performance on the Fibonacci benchmark, of 921 cycles, we were able to improve our state machine to use fewer cycles, and get it down to 709. Over all our microbenchmarks, we got 30% performance increase. A little bit of pipelining goes a long way - as the 6502 designers also knew.
Just one more thing: we coded up a monitor program, by translating Bruce Clark's Compact Monitor, so we can more easily load and test code over a serial connection.
07/13/2017 at 20:20 •
Our first cut of OPC-5, the one-page CPU for FPGA, had a fixed two-word instruction format. But often the operand word will be zero, so we've used one additional bit of the instruction word to cover that case, and now we have a variable-length instruction machine.
We've also doubled the instruction count: instead of four basic instructions, with a choice of absolute or indirect addressing (load, store, add, nand) we've moved up to eight: load, store, add, sub, and, or xor and ror, still with the addressing choice. That should be more comfortable, even though the earlier set was enough for any program. We can improve code density and performance by offering more power in the instruction set.
In fact, after a little consideration, we've removed SUB, and instead split out an ADD and an ADC. We also changed the ROR to be a 17-bit rotate including the carry - this should help with multi-word arithmetic.
And finally (for now) we expanded our predicate idea - we still have one bit to spare - instead of just two bits for predication on zero and carry, we added an invert bit so we could make each instruction conditional on the flags being clear or set. In the case of instructions which modify r15, the program counter, that gives us a family of conditional branches (absolute or relative.)
06/25/2017 at 15:48 •
We were last seen with the OPC-3 - a one-page computer with 16 bit data, 16 bit addresses, and 16 valid opcodes, as a hastily-inflated version of OPC-1, our CPU for CPLD. It felt good to have 16 bit words - lots of room in the instructions, and fitting a full pointer into any register or location should be a relaxing change from 8 bit computing. And if it keeps the machine simple, so much more chance of doing something good in just 66 lines of source. But now we want to think bigger: up to 128 slices in an FPGA.
What to do with a 16 bit instruction word? Well, it will never be big enough for a full-size operand, so with simplicity in mind, our first take is to have every instruction word be followed by an operand word. We have room in the instruction for a 6 bit opcode field, two predicate bits, and two 4-bit fields for the registers: source and destination. A 16-entry register file should feel very roomy, and will be very compact in FPGA too (each LUT can be a logic gate or a 64x1 RAM!) We'll put the PC in register 15, and then predicates (on zero and carry) give us conditional branches. We'll have r0 be zero, and a dummy destination.
We decided on an 'effective value' idea to make use of our opcode - we always add the operand value to the source register before proceeding with the two-operand operation. It turns out this single addressing mode gives us some near equivalents to several conventional addressing modes, and it's simple to describe and to implement.
Although we have 6 bits for our operations, our first cut only has 8 opcodes: load, store, add, nand, each in two flavours - direct addressing and indirect addressing.
Here's the first OPC5 spec. Remarkably, all this fitted happily into 66 lines of verilog, and 66 lines of python for an emulator.
(What happened to OPC4? It's reserved, in case we want to flesh out OPC3, which was quick and dirty.)