PDP - Processor Design Principles

Distilling my experience and wisdom about the architecture, organisation and design choices of my CPUs

Similar projects worth following
Here I try to gather and organise key elements that guide my choices when I design a processor. This is more fundamental and thorough than #AMBAP, since the AMBAP principles follow from the present principles, which I hadn't explicitly examined yet.

This might as well serve as an introductory course in CPU architecture, for the curious readers or those who consider designing their own. I try to keep the perspectives as wide as possible, and not follow what classic texts (such as P&H) perpetuate. Now, whenever I encounter a design decision, I write logs here. With time, it could serve as a guideline, or a "cookbook" for one's own CPU, whether they follow my advices or not. This is a recollection of my experiences as well as a sort of checklist of fundamental aspects to keep in mind when choosing a feature or characteristic.

Feel free to add suggestions, subjects or details, there is so much to cover!

I must emphasize: this is not a "general introduction" to CPU or their design but a general analysis of how I design MY processors:

Note: these are "general purpose" cores, not DSP or special-purpose processors, so other "hints" and "principles" would apply in special cases.

I list these "principles" because :

  • I encourage people to follow, or at least examine, these advices, heuristics, tricks. I have learned from many people and their designs, and here I contribute back.
  • They help people understand why I chose to do something in a certain way and not another in some of my designs. I have developed my own "style" and I think my designs are distinctive (if not the best because they are mine, hahaha.)
  • It's a good "cookbook" from which to pull some tricks, examine them and maybe enhance them. It's not a Bible that is fixed for ever. Of course this also means that it can't be exhaustive. I document and explain how and why this, and not that, along the way...
  • I started to scratch the surface with #AMBAP: A Modest Bitslice Architecture Proposal but it was lacking some depth and background. AMBAP is an evolution of all those basic principles.
  • The 80s-era "canonical RISC" structure needs a long-awaited refresh !
  • This can form the foundations for lectures, lessons, all kinds of workshops and generally educative activities.

Hope you like it :-)

1. Use binary, 2s complement numbers
2. Use registers.
3. Use bytes and powers-of-two
4. Your registers must be able to hold a whole pointer.
5. Use byte-granularity for pointers.
6. Partial register writes are tricky.
7. PC should belong to the register set
8. CALL is just a MOV with swap
9. Status registers...
10. Harvard vs Von Neuman
11. Instruction width
12. Reserved opcode values
13. Register #0 = 0
14. Visible states and atomicity
15. Program Position Independence
16. Program re-entrancy
17. Interrupt vector table
18. Start execution at address 0
19. Reset values
20. Memory granularity
21. Conditional execution and instructions
22. Sources of conditions
23. Get your operands as fast as possible.
24. Microcode is evil.
25. Clocking
26. Input-Output architecture
27. Tagged registers
28. Endians
29. Register windows
30. Interrupt speed
31. Delay slots

  • Reset values (bis)

    Yann Guidon / YGDESa day ago 0 comments

    I have been suggested the following:

    "When initializing the registers after a hardware reset, try to have the CPU hardware revision number

    in one of the registers. This way, if some instructions are missing or buggy in the CPU you have
    a fair chance of getting around this by software."

    I much prefer the core's type, revision, capabilities etc. stored in ROM in the Special Registers (or IO) space because you don't have to reset the CPU to get the information :-)

  • Delay slots

    Yann Guidon / YGDES3 days ago 0 comments

    Delay slots, or delayed jumps, are one of these neat tricks you can use for a fixed architecture, and it becomes a nightmare when the architecture changes. That's why it helped MIPS processors take off, but the ALPHA avoided it (which was wise).

    So yes, it's a pretty cool trick when you have a canonical RISC pipeline with single-issue. Otherwise, stay away.

  • Interrupt speed

    Yann Guidon / YGDES3 days ago 0 comments

    Swapping the register set has always been a concern, for various reasons... but yes, mainly because speed.

    Some architectures have windowed registers (SPARC has one kind, TMS9900 has a different one). This creates some kinds of issues or others.

    Some have two or more banks (DSP often have two sets for almost instant IRQ handling).

    Some just prefer the slow way, or even microcoded operations.

    Some just don't bother and let the tedious IRQ work be done by smaller, nimbler but better adapted companion processors : the recent ARM "big/little" and "little/big" approach, or simply the CDC6600/CDC7600 PP (Peripheral Processors) delegate the tedious tasks and concentrate on the hard work (thus simplifying the main CPU btw)

    F-CPU's FC0 introduced the SRB system : the "Smooth Register Backup" spies on the register set and performs the transition in the background. But it's still not ideal.

    Traps are annoying as well, but context switches also occur when sending data to a different process : this is actually the real speedbump if you listen to the microkernel people. Then, different mitigation systems are required...

    But overall, don't focus too much on this because CPU waste so much time in so many different things !!

  • Register windows

    Yann Guidon / YGDES3 days ago 0 comments

    SPARC uses register windows to provide a bunch of fresh registers across function calls. It was touted as a very RISC thing and history has shown that it was not the best idea, overall. So yeah, forget about it, as is, because it only moves the actual problems to where KISS doesn't work.

    Instead, why not just map more than one data register to memory for each address register ? (see Memory-mapped registers in the F-CPU project)

  • Endians

    Yann Guidon / YGDES3 days ago 0 comments

    Little Endian has won.

    Yet, be ready to swap bytes...

  • Tagged registers

    Yann Guidon / YGDES3 days ago 0 comments

    One of the tricks I included in F-CPU FC0 was flags associated to each register, holding hidden (and restorable) states about the contents.

    One of these flags is the ZERO flag, calculated each time the register is written. This works like a distributed status register.

    Another flag is a "valid" flag : the SRB (smooth register backup) steals cycles to save or restore the monolithic register set across thread switches or during IRQ.

    Also very interesting is an address valid flag, meaning that the register contains a pointer that has been cleared in the TLB. The tag should also contain the access rights, for example to prevent a store if the page is read-only. More information can be added such as the cache set, or other architecturally-specific details, which accelerates execution of a load/store instruction.

    Similarly a flag can indicate whether a register contains a valid instruction pointer, for example for looping or function return. Not only can it say that the TLB should not be checked again, but also indicate the cache line number.


    As long as you can recover these informations, you can cache them. They might be erased during a context switch, a TLB invalidation, whatever... Restoring the state will add a few cycles of penalty but it will function just as well.

  • Input-Output architecture

    Yann Guidon / YGDES3 days ago 0 comments

    In the wild, you will find two approaches, best illustrated by the Motorola vs Intel debate.

    • Motorola and their ilk map peripherals in the memory. Typically you end up with a single address decoding logic ("glue" chips) with a pretty wide variety of granularities.
    • The Intel school have a dedicated IO space that uses a few dedicated instructions.

    In the 70/80s, separate IO spaces would ease decoding at the cost of more IO pins on the CPU.

    In the 90s/2k, well, memory has become black art then PCI arrived so the mess is much worse.

    I design a dedicated "space" to separate differing resources because they have different requirements : latency, speed, bandwidth, granularity, protection/safety, ordering, restartability...

    • Memory can be weakly ordered and optimised for bandwidth, it uses various cache levels and has a coarse granularity for protection. Usually, there is one main area of memory, maybe split among several homogeneous banks. You usually move cache-line-wide chunks of data in interleaved transactions...
    • IO can have many uses, from controlling internal CPU resources such as TLB, protection settings, essential peripherals... to yexchanging data with other (more or less dependent) units such as other CPU or coprocessors... You need a clear and clean execution where access rights are immediately evaluated, with maybe some latency, but no speculative execution or risk to re-execute the instruction after a trap (for example) because this would mess with the environment.

    I use IN and OUT instructions to access anything that is not related to data storage. This is more or less equivalent to Intel's MSR introduced with the Pentium, 25 years ago. Semaphores, synchronisation, interrupt management, debug, profiling... can only work with word-wide accesses and fine-grained rights. This allows capability-based (or whitelist, or object-based) rights management, for example each peripheral could be accessed only by a given thread ID. Of course this also greatly simplifies the memory system because you don't rely on certain properties, that are relegated to a dedicated channel.

  • Clocking

    Yann Guidon / YGDES3 days ago 0 comments

    How should your core be clocked ?

    • DFF are practical and very clean but take some silicon space.
    • Transparent latches use half the transistors but twice the routing resources because you now need two clock networks that MUST not have jitter or phase noise (ask Seymour Cray about this, when designing the Cray 2 or the Cray 3).

    So yeah, it depends. My approach is to design with classic DFF with a 4-stages pipeline to allow an easy transformation to 4-phases clocking, which has some advantages when you can control your technology very tightly.

  • Microcode is evil.

    Yann Guidon / YGDES4 days ago 3 comments

    Forget about microcode.

    Microcode is CISC.

    Microcode is like a computer in a computer, it increases the complexity of the whole system, slows everything down, makes testing miserable, and many other "sins" that RISC addresses.

    A direct mapping of the instruction word to the datapath is the best way to have a simple and efficient ISA.

  • Get your operands as fast as possible.

    Yann Guidon / YGDES7 days ago 0 comments

    Looking back at the 6809's manual, I am now appalled by the elaborate addressing modes. No wonder CISC almost died !

    In my designs, there is another rule : provide the operands as fast as possible to the execution units. Keep the path as short as possible, and uninterrupted, from the instruction decoder and the ALU. This means that only two types of data are encoded in the instructions :

    • Literal data, sometimes shifted and/or sign-extended. The latency is a few gates and some fanout.
    • Register values : just read the register set (some gates of latency, a bit more than literal data but not so much).

    Two other sources of data are flags/condition codes, and In/Out port values but they are treated separately.

    Anyway, when done right, your units get data to process one cycle after you got your instruction word.

    Don't waste time with memory, it's a mess. Careful coding with a register-mapped memory system should shadow some of the latency. Indexed addressing modes, or indirect, or these crazy systems slow everything down. Traps become a trainwreck. Main memory is the enemy. KISS !

    Of course, reading a data memory register might stall. There is no absolute and perfect  way around slow memory. But keep as much data as possible in your registers so they can be addressed almost immediately. The decoder can speculatively decode the register numbers and read corresponding words of data before you have checked a pointer's validity.

View all 32 project logs

Enjoy this project?



Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates