Bexkat1 CPU

A custom 32-bit CPU core with GCC toolchain

Similar projects worth following
This is a synthesizable 32-bit CPU core written in Verilog. I've also ported GCC, binutils, and newlib to produce machine code for this system. In addition to the CPU core, the project has a pretty wide selection of peripheral cores that I've developed or adapted from other open designs. The current project is configured for the Terasic DE2i-150 board and MAX10-lite (in progress), but should be synthesizable for many of the smaller Cyclone boards with appropriate adjustments.

Some core features:

32-bit data and address buses
16 general-purpose 32-bit registers
Absolute, direct, and relative addressing modes
Single-precision floating point support in hardware
Interrupt/exception handling
Wishbone compatibility
Supervisor/protected mode
(Incomplete) Port to MAX10-lite board
  • An annual update

    Matt Stock12/30/2019 at 01:38 2 comments

    I know my updates on the project have been slow, but I really am working on it.  It's been a bit of a challenge given a bunch of headwinds, but I'm excited to report that I've made some significant progress after a lot of refactoring work.  I have unified the code for a bunch of the different dev boards I've been using, and so now in theory the same code will run on at least the DE10-standard and DE2i-150.  The MAX10-lite hasn't been tested yet, but will probably work as well, even with the limited onboard memory resources.

    In addition to the SoC code itself, I also spent a lot of time bringing the toolchain up to date.  So now I have branches off of the master branch of gcc, binutils, and newlib which are current as of a week or so ago and appear to generate proper code.

    There's also a lot of work that's been done with verilog and the unit tests for both the "microcoded" and the pipelined version of the Bexkat CPU.  They both run the same tests, and get the same results - with one exception.... exceptions.  :-)  The issue is that my original ISA pushed both the CCR and the PC onto the stack before jumping to the ISR, and for a pipeline model that's not ideal.  I'm thinking about a redesign that will require the ISR to push and pop it, but I haven't implemented it yet.  Until then, technically the microcoded CPU is the one that works correctly, since the other just ignores the CCR.

    I'll be doing another push of the code to the public github repos in the next week or so, which should give a picture of what's been done.

  • More Pipelining

    Matt Stock04/13/2018 at 14:13 2 comments

    It's been a while since I made an update, but I am making progress in fits and starts.  I ran into some roadblocks with the pipelining when introducing exceptions and some of the other vagaries of a real design, and so went back and thought through some of my assumptions.  It turns out that I had a major error in how I understood the Wishbone bus specification.

    In short, I had struggled with how to deal with latency with a pipelined operation when coupled with multiple masters/arbitration.  If you allow bus preemption, it seems like you can lose data or have to reply requests, which doesn't make sense.

    Read more »

  • Pipelined

    Matt Stock11/27/2017 at 04:40 1 comment

    During the holiday break, I was able to make a significant amount of progress on the pipeline logic.  At this point, I have everything working with the exception of subroutines and...  exceptions.  Subroutines (push old PC to memory stack, update the PC) shouldn't cause too much trouble, and while I'm not sure if there are going to be surprises in the exception handling, I'm expecting it will be similar to the existing branch code.

    Read more »

  • Pipelining and Simulation

    Matt Stock11/15/2017 at 18:05 0 comments

    I'm returning to this project and made a few interesting improvements recently.  The first is that I cleaned up the verilog for the CPU core so that it could be built in Verilator, a pretty slick tool that takes a Verilog module and defines it as a C++ object.  You can then attach it to a test harness of your choosing to validate your work, check for regressions, etc.  Until now, I've been relying on tests on the FPGA systems themselves, and leaning heavily on the logic analyzer functions that Quartus provides to debug.  It works and is very powerful, but it's also quite slow and has limited flexibility.  This change, coupled with a new initialized RAM module allows me to compile and run arbitrary code pretty easily.

    The main reason I went down this road is because I was planning to do a redesign of the CPU to support pipelining.  I've made some progress here as well, building a 5 stage pipeline that at least seems to move the proper data and signals around.

    My challenge with pipelining in general is that most of the textbooks I've seen handwave over one of the most fundamental structural hazards - what to do when the instruction and data memory are on a common bus.  I decided to "solve" this problem by building the CPU core with two logical busses (data and instruction), and to marry them to a dual port RAM module.  Since the instruction bus will never do a write, this works well and will be sufficient to test out the pipelining.

    I don't know how other designers solve this problem in the real world, but my plan is to link the CPU to an L1 cache, and have the cache layer deal with the vagaries of the "outside" bus.  This should also reduce the number of clock cycles required in each pipeline phase.  Right now my bus access logic requires two clock cycles minimum, but I think I could reduce this to one without too much effort.  I'm kind of working on the pipeline stuff one issue at a time, since I don't really have a good reference to crib from.  If anyone has any suggestions on something that's not crazy complex and would help give me some direction, leave them in the comments. 

  • CPU Architecture Video

    Matt Stock02/13/2017 at 02:20 0 comments

    Here's the next video, which goes into more detail about the CPU design as well as walking though the state transitions for a simple add operation:

  • System Overview

    Matt Stock02/13/2017 at 00:36 0 comments

    I'm trying to get more documentation in place, in the form of some youtube videos. This one will give you a sense of the overall system architecture, and how the CPU interacts with other devices. Let me know if you have any questions or comments.

  • Supervisor mode

    Matt Stock01/02/2017 at 00:09 0 comments

    I've been working on fleshing out a supervisor mode with a goal towards being able to do multiprocessing in the unix way. The basic work is complete (protected opcodes, hardware and software interrupts that execute in supervisor mode, etc), but I'm working on the nuance now. In particular, I'm testing different ways to pass information from user space into kernel space. Since my current method of parameter passing is solely via the stack and the stack pointer swap out as part of the move to supervisor mode (supervisor stack pointer), this is mostly an exercise in C semantics now. My exception handler pushes the original stack pointer onto the supervisor stack before jumping to the exception handler, and so now I'm just working though the most sane way to reference that element (which isn't an argument to the interrupt handler!), and then use it as an index to pull out the other info on the user stack I care about.

  • ISA Rework

    Matt Stock01/03/2016 at 20:36 1 comment

    My first cut at an ISA was focused on getting the functions right, and leaving room to add more options later. Now that I've got most of the functionality I want, I can go back and look at ways to reduce the complexity, with a goal of improving performance.

    Read more »

  • Testing Part 2

    Matt Stock01/03/2016 at 20:25 0 comments

    As a mentioned earlier, I've been looking at pushing to the next round of project improvements, and that meant a better testing process. I tried using a "control" CPU, which would be compared to the output of the CPU under test, however that assumes that the number of clock cycles required for each operation wouldn't change. While useful in a few cases, a lot of the changes I'm interested in involve timing, and so that wouldn't work.

    I decided instead to make two ROM modes. The default one runs the monitor code, which allows for basic memory interaction as well as parsing of ELF binaries on the microSD to bootstrap other programs. The new ROM module is a set of POST routines written to progressively test the CPU as well as IO functions to check for functional regressions. This method has already paid for itself, since I found a small bug in a couple of the floating point opcodes.

    The method of test is fairly simple. I need to assume that some basic operations work, otherwise it won't even run the POST, which means immediate load of a register, immediate add, integer compare, and branch if not equal. The first tests evaluate register operations, the ALU and FPU. Then we test stack operations, branch tests, and all of the load and store operations. For the math and branch operations, we can compute the expected result and store them in the code, and generate an error when the result isn't as expected.

    In addition to the basic CPU tests, I'm also implementing a set of memory tests. This will allow me to better test the cache module, which I'll describe in the Doom project update.

  • Regression Testing

    Matt Stock12/12/2015 at 15:01 1 comment

    So far in these projects, I've been able to build iteratively and not run into too many nasty bugs. There are many layers of abstraction though (libraries, compiler, assembler, machine, CPU), and so when a bug does crop up, it can be really challenging to find.

    Most recently, I found that I had misunderstood some subtleties of transferring data between registers. The fix was simple - an opcode that zero fills the upper bits when you make a copy of an object smaller than the register size. But how this manifested itself was that sometimes printf() printed out the wrong character when printing a number. Eventually, I was able to isolate this to 33 % 10 resulting in 9 (not 3), which meant I didn't have to debug libc. After further narrowing the issue down to making a very small test case, I was able to see why the CPU was generating the incorrect value. That probably took me 4 days to debug.

    As I plan on making some radical changes that could break things, I need to consider how best to avoid introducing more of these kinds of issues, and if it happens, how to quickly determine the issue.

    Read more »

View all 13 project logs

  • 1
    Step 1

    Clone the three source repos.

  • 2
    Step 2
    mkdir bekkat1
    cd bexkat1
    mkdir gcc binutils newlib
  • 3
    Step 3
    cd binutils
    $(BINUTILSREPOPATH)/configure --target=bexkat1-elf
    sudo make install

View all 6 instructions

Enjoy this project?



Andre Powell wrote 10/08/2017 at 07:44 point

Hi Matt,

I watched the CPU architecture video,  impressive :).
From the video you mention your intention to pipeline your design. How many stages are you thinking of going for ?
Are you going to put the Hazaard avoidance in the tool chain or the Hazard logic into the design ? Tool chains are NOT my strong point so I went for the latter.

I don't know how complex your design is in Verilog but I would suggest you write each pipeline stage in a seperate module. If your design flow allows using System Verilog will be very helpful.

As your design becomes more and more complex I would highly recommend looking into simulation. You have several options these days, including the 'free' Modelsim simulator from Altera. I understand the free version is now mixed language.

Anyway congratulations on getting your architecture to work :).


  Are you sure? yes | no

Matt Stock wrote 10/08/2017 at 12:24 point


Thank you!  I'm at a bit of a crossroads on the project actually.  I'm still interested in CPU cores (including pipelining, branch prediction, etc), however I feel like the current implementation is too complex right now.  Given that it was evolved and has no simulation or regression testing framework, making changes are harder than they need to be.  I've relied on the logic analyzer tools within Quartus quite extensively.  It works, but not being able to test each component in isolation is...  not good.  :-)

I'm thinking about setting aside my current design for now, and aiming for something similar and more modular.  Simplify the ISA and go 16/8 or maybe 24/16 bit.  I could still work on some advanced features but the overall complexity and size would drop.  I'd lose some of the toolchain work I've already done, but rather than use gcc I'd probably build something with lcc and gas that would be perfectly sufficient for the kind of toy projects I'd want to work on.  If I wanted to re-port DOOM, I could dive into gcc again.

As for language, right now most of it is in System Verilog, but most of the extension relate to syntatic sugar  (e.g. for loops to initialize the register file) with some typing.  I don't know much about simulation, but if I started over I'd spend more time there.

Given your experience, I'm interested in what you suggest.

  Are you sure? yes | no

Andre Powell wrote 10/08/2017 at 18:16 point

Hi Matt,

I can understand how you feel. I think possibly a third option that may be attractive is to keep your present ISA but start to look at rewriting the core in the way that you wish.

You could start with analysing how you would like to create a pipe line of your present ISA, there may also be elements you could almost copy and paste.

This approach has the other advantage that you can use your present design as a 'Golden Reference'. You could have your new pipelined machine and your present design running in parallel where you could fire the same instruction into both and see if the end result is the same. You would of course need to take into account the number of clocks in your pipe line for the comparison.
This would give you an initial level of confidence.
Your initial pipeline could be developed as a freefall pipeline with no hazard checking that only does one instruction at a time. That would be easier to initially check. I would highly advise the use of Structs to bundle signals together and not to be afaid of using typedef to create new types, typedef enum will help you immensely to understand what is going on.

Then add Hazard checking, continue the single instruction comparison.

Then start to have more than one instruction in the pipeline and go from there.

This would give you confidence not only in your pipelined machine but also in your testbench skills.

Now as to simulators you have at least the freeby Modelsim simulator that can be downloaded from Altera, I think this can do System Verilog now and is a mixed language sim !. There is also also Icarus Verilog which I hear good things about, I don't know if it does System Verilog but that can be a research exercise.

Note if you are trying to compare two values in a testbench for equality make sure you use ===, these will catch andy 'X' values rather than just let them go through, this was a mistake I made.

I've just realised you dived into GCC and got it to generate your own instruction stream !


I had a look at doing that and decided it was toooo scary !

You sir are a Steely Eyed Missile Man !

Whatever you do just enjoy it, explore the options. Don't be afraid to do something different, if you feel you want to have a look at a different ISA then go for it, have a look at different architectures such as SIMD stuff.

One thing I would also advise is this, should you have a partner then tell them what you are doing and how much it means to you to explore your hobby,

You could of course come over to the Dark Side and do stuff in VHDL.

(Andre now ducks and runs away !!!)

Keep in touch !

  Are you sure? yes | no

edmund.humenberger wrote 07/06/2016 at 09:39 point

Would be interesting to know if your Verilog project would run on  FPGA hardware.  Has 8 kLUT and max of 1 MBYTE of SRAM  supported by open source FPGA toolchain Yosys and Arachne PnR.  Toolchain even running on RaspberryPi.

  Are you sure? yes | no

Matt Stock wrote 07/06/2016 at 13:23 point

I'll take a look. The open toolchain has some appeal. The main concern I would have is the use of hardware multipliers in the current logic. I'd need to see if the Lattice unit has something similar and with sufficient quantity.

My current plan is to possibly replace GCC with LLVM/clang, and to see what it would take to create an 8-bit variant of the CPU.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 12/09/2015 at 02:31 point

Please write and publish more documentation :-)

  Are you sure? yes | no

Matt Stock wrote 12/09/2015 at 03:10 point

Yes, I'll be adding more over the next few days.  I also have a companion project I'll be adding to demonstrate the Doom port I made to this architecture.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 12/09/2015 at 03:16 point

Yay ! Welcome to the DIY CPU club :-)

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates