I could have sworn that I'd posted an update already, but looking at my logs feed, I clearly have not.
Topics covered below include:
- Serial Interface Adapter Core Completed
- Initial Program Adapter Core
- KCP53010: Successor to KCP53000 CPU
Serial Interface Adapter Core Completed
Not much more to say than that. It's done. It's not as small as I'd like, but on the other hand, it's also more flexible than your typical UART design. It allows you to send and receive serial data streams (LSB first only), with or without start bits, stop bits, etc. Frame checking is up to the software using it. It supports configurable FIFO depths and widths (up to 16-bits wide), allowing you to tune the core for your needs. Those who have programmed the Commodore-Amiga's internal UART will be right at home with how this adapter works. A nice, wide divisor allows for data rates as low as hundreds of bits per second, to as high as tens of megabits per second.
Data is sent over a pair of wires, TXD and TXC, forming data and forwarded clock, respectively. Data is received on RXD and RXC, respectively. It should be noted that it can be synchronized on RXD, RXC, or both. For lower-speed applications, RXD is sufficient. For higher-speeds, you probably want to ignore RXD and focus just on RXC. The choice is yours.
This core provides a 16-bit Wishbone B.4 Pipelined Mode slave interface; it should be easily usable with 8-bit devices as well.
New Initial Program Adapter Core
The Kestrel-3 code-base now includes a new core, currently with the name "IPA". This core has one mission: to facilitate loading the initial bootstrap code into RAM on a ROM-less computer design. From the processor's perspective, it looks exactly like ROM memory, and sits where ROM normally would; however, on the back-end, it parasitically feeds of the RXD and RXC pins of the SIA core. The idea is simple: when the processor reads a half-word from anywhere in ROM's address space, it blocks until the IPA receives two bytes. The bytes must be sent in PC-standard 8N1 serial format. The IPA is synchronized on the RXC input, so you'll need either a proper USART or a microcontroller to drive it. Since I have two Arduinos and an ESP8266 microcontroller at my disposal, this is not a blocking drawback.
The idea is you spoon-feed the computer an instruction stream designed to explicitly store data into memory, like so:
; X1 = pointer into RAM ; X2 = value to store (byte) ADDI X1,X0,0 ADDI X2,X0,$03 SB X2,0(X1) ADDI X2,X0,$7F SB X2,1(X1) ; ...etc...and so on until you have loaded 1KB to 2KB worth of code into RAM. If you need more than this, you'll need to manually reset X1 somehow, and continue loading your data. This approach is slow, of course; however, it saves me the hassle of needing to implement a DMAC just for the serial port. LUTs are precious in these smaller FPGAs, so this is a pretty big win for me. Besides, this only has to happen exactly once upon system reset, and the bootstrapper doesn't need to be terribly large (4KB seems like an awfully large bootstrapper to me).
When the initial program is loaded, you kick it off by sending a JAL X0, 0(X0) instruction.
The IPA exposes a Wishbone B.4 Pipelined Slave interface, and only supports 16-bit half-words. Attempting to read or write bytes from this space will fail in unpredictable ways. Don't do it. Thankfully, when the CPU fetches instructions, it fetches them 16-bits at a time.
This is not the first ROM-less Kestrel computer I've made. Indeed, my very first, the W65C816-based proof of concept Kestrel-1, only connected to SRAM and a single VIA chip for I/O. The architecture of the Kestrel-1 and the iCE40-targetting Kestrel-3 designs share much in common.
|CPU||W65C816P-14, 4MHz||KCP530x0, 25MHz|
|Performance||2 MIPS max.||6 MIPS max. (KCP53000),|
12 MIPS est. max. (KCP53010)
|RAM||32KB max.||256KB min., 512KB typ., 2^60 B max.|
|I/O||1 VIA with 16-bit parallel I/O||1 SIA, V.4 compatible serial, 110bps to 12.5Mbps possible.|
|IPL Mechanism||Bus mastering IPL Port, driven from PC Parallel port||ROM emulation via IPA core, shared with SIA core.|
New KCP53010 CPU Design Coming
You may have noticed that I'm adopting the Wishbone B.4 Pipelined Mode interface going forward in most everything I do these days. The KCP53000, however, uses a Wishbone B.3 master interface. It also requires bus arbiters and 64b/16b bridges to talk to FPGA-accessible resources. Coupling B.3 to B.4 peripherals will require yet more logic to be added to the design. This is unsustainable, and requires that I rework the CPU's memory data paths. I'm also dropping Furcula bus support. Not that I don't think it's a good idea, but it didn't pull its weight like I expected it to.
Since B.4 supports pipelining, I'm once again attempting to work towards a 5-stage pipelined, RISC-V processor architecture. My initial attempt ended in flames, melting flesh, pestilence, and was probably partly responsible for Batman Vs. Superman. OK, mostly hyperbole, but no one who knew me or what I was experiencing at the time would argue that it didn't end up a categorical failure in every sense of the word. I learned a lot of what not to do, but not a whole lot of what to do.
The KCP53010 design aims to implement a text-book, five-stage pipeline: instruction fetch, decode and register-fetch, execute, memory access, and write-back. (I may have to split the decode and register-fetch steps because of how the block RAM works on iCE40 FPGAs; I'll cross that bridge when I get there.) This time, instead of working top-down, I'm working bottom-up. Or, more precisely, inside-out, and back to front.
This means I implement the register write-back stage first. This stage appears to work currently, although I know it's far from finished. That's OK though; it's always safe to revisit the design later as I learn new things, provided tests are maintained in synchrony. Keeping the tests updated is the key.
I'm currently working on memory access stage as I type this. Experience gained working on this stage has already taught me several things:
- Wishbone B.4 Pipelined Mode is surprisingly simple to implement when you have the right implementation strategy, and is insanely complicated if you don't. Thankfully, I've settled on a design which makes quite a bit of sense to me; in fact, I'm now thinking it's even simpler to implement than Wishbone B.3's quasi-asynchronous handshaking protocol, since I can think of commanding the bus and accepting responses in isolation of each other, instead of having to conflate the two. This leads to better testability and superior isolation of concerns in the Verilog source code.
- Some aspects of the data flow through a RISC-V processor which I thought belonged in the ALU more rightly belongs at the head-end of the writeback stage (e.g., zero- and sign-extension). This should reduce the ALU complexity compared to the KCP53000's ALU, which I believe will let me hit higher clock speeds, even if only slightly. The ALU is *the* limiting factor of the KCP53000 design, so this is a welcome insight.
As I type this, Yosys/Arachne-PnR report that the load/store circuitry consumes no more than 310 LUTs. Even if I use two of these (one for instruction fetch and one for data access), this results in a significantly smaller memory interface than the KCP53000+arbiters+bridges approach I've been using (estimated to be half the size!), with a whole lot less combinatorial logic in the hot path. I anticipate this will contribute towards an improvement in top clock speed supported. In fact, this is small enough that I can probably use three of these units, two for memory, and one for CSRs, greatly reducing the burden of making CSR access as fast as possible, and still have a design smaller than the memory access hierarchy that the KCP53000 required!
I expect to have a blog article written up on my current design soon. I'll post a link here when I finish it. I'll also be giving a talk on the design at the June SVFIG meeting.
My immediate goal is to create a dumb pipeline that takes pre-decoded RISC-V instructions and executes them. This forces the design to remain testable throughout the whole design and implementation steps. Control circuitry and required state machines won't go into the circuit until much, much later; I can probably reuse a lot of this from the KCP53000, honestly. Consuming pre-decoded instructions lets me focus on the essentials of the data-flow and error reporting circuits, again making things that much easier to implement/test.
That's it for now; I'll have more as time progresses of course. Progress is very slow due to work obligations, but I try to do what I can when I can. Until next time...