16-bit 'modern' homebrew CPU

DME is 16-bit custom CPU complete with a full software stack and a multi tasking operating system.

Similar projects worth following
DME CPU is a 16-bit processor realised in an FPGA. It is fairly complete with user and supervisor modes, paging and interrupts. It currently has access to 512kb of total memory (amount of SRAM on my FPGA board). Each process has up to 64kb of addressing, broken down into 32 pages (2K).

The project consists of:
- the CPU realised in Verilog
- a micro-code assembler
- an assembler
- a retargetted c compiler (LCC)
- a gate-level(ish) simulator
- a functional simulator
- a pre-emptive multi tasking OS (based on XV6)
- a Quartus II bdf file that ties my implementation to my FPGA board

I run the CPU on a Terasic Cyclone V GX Starter Kit. It's not the best board for this project (when I ordered it, I had no idea what to look for) . It gives DME access to a serial port, an SD Card reader that serves as its 'HD' and lots of lights and switches.

Computers have always fascinated me. I programmed in various languages when I was younger for fun. At work, people come to me with computer related problems and issues (alas). It was sometime last year, that one of my colleagues asked me in a casual way 'what is a kernel anyway'. I did my best to give him a (partly made up) explanation. The fact was, I didn't fully know the answer myself (I think he saw through it).

So when I got home, I started reading up on kernels. Still being interested in programming, soon after I decided I would program one, just for the heck of it. I started on one (learning c in the process), using the BOCHS simulater and learned about paging and context switching. But this site is not about that project.

As I googled terms such as stack frame and trap handler I hit Bill Buzbees website ( This guy didn't write an OS, he designed a whole CPU! This was one (or several!) levels deeper down the rabbit hole. I found it very interesting, but couldn't conceive of doing anything similar. I kept reading up on CPU designs and the history of computers.

This led me to the book 'The elements of computing systems (Nisan & Schocken)'. As soon as I read what the book was about I ordered it online. Then, because I couldn't wait, found a pdf version and got started right away (I still received the 'legal' copy several days later). This brilliant book guided me to building, and more importantly, understanding a very simple CPU. I could now see how simple 1's and 0's got combined into more and more complex building blocks until you had a working CPU. As I worked through the chapters a voice in my mind started saying: 'now you could build that cpu, a real one'.

The final piece of the puzzle fell in place when I revisited Bill Buzbees website and read that he had plans to build another CPU. But instead of building it out of actual hardware components (coolest way by far), this time he would use an FPGA. I had never heard of FPGAs. But once I read up, I realized that if I used an FPGA, building a CPU might just be within the realm of the possible for me.

Thus started DME CPU.


DME's Instruction Set Architecture

Adobe Portable Document Format - 29.64 kB - 10/03/2017 at 19:59


  • DME does paging

    ErwinM04/29/2019 at 14:34 2 comments

    The theory

    Any modern CPU implements paging (the now defunct alternative being 'segmentation') and DME does so as well. In modern CPUs paging is used to give every process its own, clean, address space: a 32-bit CPU can address 4 GB of memory and with paging enabled, each process can address and access all of it (in practice the OS will limit each process size). For DME, paging provides another functionality: DME has 512kb of physical memory. However, being a 16bit architecture, DME can only address 64kb. Paging allows DME to use all of its 512kb of physical memory, while only being able to address 64kb of it simultanously. This way DME can keep multiple address spaces in memory and swap to the right address space just before it starts or resumes a process.

    Having explained the theory: paging is complex! Not so much the high level concept, or even the implementation, but just thinking about it is complex: what is suppose to happen?, what is actually happening?, does it work? For this reason, adding paging to DME was 80% translating the high level concept into a detailed concept - all with pen and paper. Once I had that figured out, the other 20%, the implementation in Verilog, took just a few lines of code.

    DME's paging concept
    To make paging a reality we need to implement the following:
    1. when turned on, logical addresses are translated to physical addresses
    2. this translation is dynamic: the (OS) programmer can manipulate how addresses are translated
    3. the implementation should recognize the concept of an address space: a bunch of translations that belong together
    4. paging functionality can be turned on and off (otherwise you go mad trying to program it)

    Conceptually, my implementation divides 512kb of physcial memory into 256 pages of 2kb each. Furthermore, it also divides 512kb into 8 address spaces of 64kb each (= 32 pages). This last point is a major simplification which I will discuss under point 3.

    1. Translate logical addresses to physical addresses
    The paging logic should take a 16-bit logical address and translate this in a 19-bit physical address (512kb = 2^19). It consists of:

    Page Table: actually not a table, but just a list consisting of 256 16-bit addresses, that hold a reference to physical page locations, so called Page Table Entries (PTE). To translate a logical address an index is calculated and this is used as an index into the Page Table (e.g. if ptidx equals 47 the 47th entry of the Page Table is used).

    Page Table Base (PTB): used to determine which of the 8 available address spaces is visible to the CPU. The PTB functions as an offset to the base of an address space in the Page Table. For example, the second address space would be indicated by a PTB value of 32, meaning the first page of the second address space is page 32 in the Page Table.

    Page Table Index (ptidx): used to determine which PTE should be used from the Page Table. In pseudo code it is calculated as: address space offset + page number. The address space offset is simply the value of PTB. The page number can be gotten from the logical address (held in MAR): the bottom 11 bits give the location within a page (2^11 = 2048), the highest 5 bits when shifted right 11 places give a number between 0 and 31 - the page number.

    The physical address (pageaddr) is then concatenated from PTE (highest 8 bytes) and the MAR (lowest 11 bytes) resulting in a 19-bit address.

    2. User editable Page Table
    Page Table Entries (PTEs) can be loaded into the Page Table by the command WPTE (Write PTE). It takes 2 arguments: the ptidx of the entry (0..255) and the physical page it should be mapped to shifted left 8 bit. In hindsight I could have shifted during the concatination of pageaddr, this would make the PTE entries easier to interpret. Perhaps I'll change it.

    3. Recognize address spaces
    The PTB register can be loaded with the WPTB (Write PTB) command. The implementation of the translation above...

    Read more »

  • This is DME CPU's ISA

    ErwinM10/03/2017 at 19:38 0 comments

    This is DME's ISA:

    It's heavily inspired (copied?) from Bill Buzbee who designed an ISA for his planned Magic1.6 project. When I started my project I had no idea what to look for in an ISA, so I took his concept as a starting point. I took the load/store and the arithmetic instructions pretty much as I understood them. The 'architecture specific'-instructions (dealing with page tables and interrupts) I added myself. I also deviated from the pure RISC pattern by adding PUSH and POP instructions. As I was doing a design with a stack, these were just too good not to have and not too hard to implement.

    When I started my decoder in Verilog the plan was to not use microcode. At the time, I only had about half the instructions finalized and was implementing them in the decoder with 'case' and 'if-then' clauses in Verilog. Two reasons made me switch to using a microcode approach:

    1. It was too 'programmy' for my liking

    I started this project because I was intrigued by how simple logic blocks turn into a full blown CPU. For this reason I have tried to keep my Verilog as simple as possible. Going for a "I should be able to implement it in hardware"-feeling. The complex decoding logic I was typing into Verilog felt too much like programming and not like I was designing anything to do with hardware. If I'd have to do this in hardware I would have to move to a microcode approach.

    2. Microcode is a lot easier to change

    In the early days of my CPU my ISA changed a lot. As I wrote the assembler and later the compiler I constantly ran into instructions that couldn't target a certain register or instructions I thought would be handy, but which the compiler in practice would never use. Being able to change the microcode just be editing a text file has been such a good solution to deal with all these changes.

    So I ended up with a microcoded decoder. The decoder uses the OPCODE number as an index into a table of 48bit microcode. The bits in the microcode drive the control lines in the CPU. The microword looks like this:

    It uses 46 out of 48 bits. I could get that down by quite some bits if required as a lot of duplication has crept in over the time. But there is no real need to do so: the FPGA has plenty of room and it's not really interesting to me at this time.

    Looking at the ISA today I could see how, with some sorting and clever grouping it could probably be realised in a less complex set of Verilog statements without needing any microcode. Still, I think the microcode-approach has worked out well. It was something I could definitely see myself implementing in hardware, it is a breeze to change and, this being an FPGA, I have a suspicion that at the LUT level both approaches come down to pretty much the same thing.

  • DME is flawed

    ErwinM09/24/2017 at 13:53 0 comments

    The trap logic 

    When I started DME, I decided I'd do a RISC kind of architecture (actually Bill Buzbee decided it for me, as I took his Magic1.6 ISA as a starting point). So for the trap mechanism, I opted for a RISC like approach as well. DME has two sets of registers: USER and IRQ. User programs run in the user registers until a trap happens, the cpu then immediately switches register banks and continuous operating using the IRQ registers. The upside of this, is that during most traps the OS does not need to spend time saving user registers as they are preserved and can just be switched back to once the OS returns to user space.

    The flaw

    DME does not service IRQs when it is in its IRQ mode. Initially, I thought that would be no problem: it would finish servicing the current IRQ and then after it returned to user space it would immediately trap again and service the missed IRQ (it does get saved). This thinking is still mostly valid except for: syscalls. As I started fleshing out the OS it occurred to me that often a sycall, for example 'read', would trigger DME into IRQ mode only to kick-off some operation, f.e. read from the SD Card, and then go to sleep (in IRQ-mode!) waiting for the operation to finish. This 'finish' is signaled by an IRQ! Now, if every process is in IRQ mode waiting for an IRQ, the system will deadlock as the IRQ will never get serviced.

    Possible workarounds

    1. Don't sleep waiting for IRQ, wait(poll) for finish
    I could simply make a process wait for whatever operation needs to happen (keyboard input, reading a disk block) instead of putting it to sleep. In practice, since DME will never have multiple users working on it at the same time, the user experience of both approaches would probably be much the same. However, it means I don't get to play with IRQs as much in my OS. Just as important, this does not work for the timer IRQ.

    2. I could tell my scheduler that 'if there aren't any processes to run' (all are sleeping), it should run each irq handler. This works, and is actually what my OS does now (simple to implement). However, again this does not work for the timer_irq: i cannot run that each time there is nothing to run (it would accomplish nothing, as there is nothing to run!). It means timer_irq is only triggered when a process returns to user space.

    3. I could have a special user space process that always operates in user space and never does a syscall and thus is always runnable to catch irqs. Although this works, its an expensive solution. The process would eat CPU slices just to catch irqs. No good.

    A better design

    The real solution is of course to add another bank of registers: user -> system -> irq. User would always trap to system. When operating in system we could then still catch (real, hardware) irqs which would push us into irq mode. Only when in irq mode are all irqs disabled.

    DME is defined in verilog and the addition of a third register bank would not be too complex I think (ha!). However, I've decided against tackling that task atm as I want to keep the momentum going. I may decide to do it later, but lets first see how big of a problem this turns out to be. Meanwhile, i'm going with workaround (2).

  • Dhrystone scores are in!

    ErwinM09/24/2017 at 13:38 1 comment

    Yesterday and today I have been trying to get the Dhrystone benchmark program running on DME. It has turned out to be a typical homebrew cpu/os experience that makes this project so interesting to me.

    After I downloaded the c code i tried to compile it. I got an error about missing malloc (which I have not implemented), but after replacing these with two static defines the code compiled, assembled and linked without errors. I tried to run it directly (without the OS): no go, the program gets stuck. Since I am running it without the OS i cannot write to the screen so I decide that debugging this way is a no go.

    Next, I try to run it through my (work in progress) OS. It loads the code, runs it and....gets stuck (even at this point in the project, my first reaction is still 'yeah, well can't expect random c code to work like that). After some googling I tracked down a newer version of Dhrystone (version 2.1, was working with 1.1 before). This version is much nicer for two reasons: it has more verbose output telling the user what SHOULD be happening. And secondly, it's source is annotated to explain even more.

    I add some debugging printf statements and run the code again. It is only then I noticed that *nothing* is actually getting printed to the terminal (this is not always immediatley obvious as my OS is still very noisy with lots of debugging info being printed). I poke at the problem a bit but soon give up and call it a day.

    When I got back today I decided there should be no reason printf should not be working, it has been working for the last few weeks. I put in a halt() just after the first printf, run it and...nothing. The Dhrystone code looks like an ancient dialect of c to me and has fairly long and complex strings in the printf statements. I decide to comment out most of them. Run the code and now the single printf statement is being printed to the screen. Aha! so now I only need to find the offending string and then figure out how to fix it! As I am commenting in 5 or so statements at a time and testing the code, another realisation dawns on me: each added statement makes the code longer, is all the code getting loaded in by the os? A quick look with the simulator reveals that the answer is no it's not. The last part of the program (where the strings are), does not get loaded.

    My OS uses a unix-like inode scheme for a FS (based of xv6) and it uses direct blocks and indirect blocks to load files. Indirect blocks are used for larger files. Turns out the Dhrystone program is the first program that utilised my indirect block code and it didn't work. After some probing at the code, I realised a 'greater than' should be replaced with a 'greater than or equal'. Solved! the Dhrystone program is now being loaded in its entirety and each printf statement is working as it should!

    Except that the program still gets stuck. Lucky for me, it gets stuck in a VERY basic routine (Proc_5), which only assigns a character to a pre-declared memory location. And this is not working. I isolate the code and sure enough it does not work: it seems as my asm/linker is not correctly assigning the addresses for the pre-defined char location. So I dive back into the assembler. Now, this thing has been working for the last couple of months, so that code is only a little familiar to me. I put in some breakpoints (yay for debugging scripting languages) and find out that no memory is being reserved at all for statements with a size of 1 byte! The assembler essentially skips them. Few lines of code later and the assembler now handles odd byte sizes (any odd number of bytes would have tanked) correctly in its bss section.

    I recompile Dhrystone and presto, it no longer gets stuck. Operation succes. Dhrystone 1.1 runs fine as well with the assembler fix. So in trying to get a piece of c code to run, I had to fix a bug in my OS, fix a bug in my assembler/linker and had to implement a missing instruction (unsigned greater than or equal) in my compiler, that I...

    Read more »

View all 4 project logs

Enjoy this project?



A Cern wrote 04/27/2019 at 22:31 point

How did you implement paging? I found what I think is the code that does it in your cpu.v file (lines 296 to 334)? Could you give an overview of how the block of code accomplishes paging? Thanks for your time.

  Are you sure? yes | no

ErwinM wrote 04/29/2019 at 14:37 point

My answer was getting longer and longer as I typed, so I decided to write up a log on it. Let me know if you have any more questions. You can find the log here:

  Are you sure? yes | no

A Cern wrote 05/28/2019 at 21:26 point

Wow, thanks a lot! I really appreciate the write up! You won't believe how many times I kept checking my feed and private messages hoping to get a notification of your possible reply (I think notifications fail in this regard). Today, I decided to come back here and just try harder to glean what I could. Imagine my excitement to see the writeup. Thanks a bunch!

  Are you sure? yes | no

A Cern wrote 04/27/2019 at 19:51 point

Would you consider making a video of the system in action?

  Are you sure? yes | no

f4hdk wrote 10/01/2017 at 04:55 point

I like the project. Retargeting a compiler to a new CPU architecture is not an easy task, at least for me. I would really like to know how to do it.

And the OS seems also quite complete. Congrats!

Have you seen my FPGA computer project? It is much less complex and performant than yours, but it has a VGA graphic output.

  Are you sure? yes | no

ErwinM wrote 10/03/2017 at 16:57 point

retargetting LCC is hard at first, then suddenly you are done. At least that is how it was for me. Although I must admit when I started reading the book I hardly understood anything at first. 

I really like your project! It involves actual wires which impresses me immediately. Also, I would also like to have played with VGA, but my board didn't come with a connector. 

You did all the design yourself. Did you let yourself at least be inspired by existing designs? 

  Are you sure? yes | no

f4hdk wrote 10/03/2017 at 18:55 point

Thanks for the “like”. For my A2Z CPU, my goal was to invent all the design from scratch, my own way. The design mainly comes from my own thoughts. I took the project like a high-school exam/problem: “you have to build a computer from scratch, you have 2 years, it starts now”… So I have implemented my own ideas, and I had rarely a look at existing architectures. The result is of course not optimal, but it works fine!
My project started in my mind exactly like yours: after the discovery of the 2 TTL-CPU projects Magic-1 and Big-Mess-O-wires, and the webring. I was immediately fascinated.

About retargeting a compiler, maybe I will try that later. Sure, I will read the book that you mention. But my list of future projects is already full for many years. 

About VGA with your FPGA-board, this is clearly feasible! You just need to add a few components on an additional board (veroboard) connected to the GPIO port. If you know how to solder, I can explain how to do it with simple schematics. Or you can buy a VGA-board like this and connect it with jumper wires.

  Are you sure? yes | no

Dr. Cockroach wrote 09/28/2017 at 11:02 point

Wow :-)

  Are you sure? yes | no

Ed S wrote 09/27/2017 at 17:11 point

You ported LCC - excellent step up for homebrew CPU.  About "Elements of computer science" - do you mean the 1973 book (Estes and Ellis) or the 1996 one (Jha, P. K. Mahanti, Sahoo)?

  Are you sure? yes | no

ErwinM wrote 09/28/2017 at 07:49 point

Turns out, I was actually talking about "The Elements of Computing Systems" by Noam Nisan and Shimon Schocken. It's a teach-yourself-by-doing book. I highly recommend it.

  Are you sure? yes | no

Ed S wrote 09/28/2017 at 09:04 point

Ah yes - that's the one associated with Nand2Tetris, always seemed like a great idea although I haven't read it.

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates