PDP - Processor Design Principles

Distilling my experience and wisdom about the architecture, organisation and design choices of my CPUs

Similar projects worth following
Here I try to gather and organise key elements that guide my choices when I design a processor. This is more fundamental and thorough than #AMBAP, since the AMBAP principles follow from the present principles, which I hadn't explicitly examined yet.

This might as well serve as an introductory course in CPU architecture, for the curious readers or those who consider designing their own. I try to keep the perspectives as wide as possible, and not follow what classic texts (such as P&H) perpetuate. Now, whenever I encounter a design decision, I write logs here. With time, it could serve as a guideline, or a "cookbook" for one's own CPU, whether they follow my advices or not. This is a recollection of my experiences as well as a sort of checklist of fundamental aspects to keep in mind when choosing a feature or characteristic.

Feel free to add suggestions, subjects or details, there is so much to cover!

Thanks to @Morning.Starfor drawing the project's avatar :-) Of course the current source is missing, but it's on purpose. It's a reminder that a nice looking idea can be badly engineered and you have to try it to see how or why it doesn't work. This picture is the simplest expression of this principle :-)

I must emphasize: this is not a "general introduction" to CPU or their design but a general analysis of how I design MY processors:

Note: these are "general purpose" cores, not DSP or special-purpose processors, so other "hints" and "principles" would apply in special cases.

I list these "principles" because :

  • I encourage people to follow, or at least examine and comment, these advices, heuristics, tricks. I have learned from many people and their designs, and here I contribute back.
  • They help people understand why I chose to do something in a certain way and not another in some of my designs. I have developed my own "style" and I think my designs are distinctive (if not the best because they are mine, hahaha.)
  • It's a good "cookbook" from which to pull some tricks, examine them and maybe enhance them. It's not a Bible that is fixed for ever. Of course this also means that it can't be exhaustive. I document and explain how and why this, and not that, along the way...
  • I started to scratch the surface with #AMBAP: A Modest Bitslice Architecture Proposal but it was lacking some depth and background. AMBAP is an evolution of all those basic principles.
  • The 80s-era "canonical RISC" structure needs a long-awaited refresh !
  • This can form the foundations for lectures, lessons, all kinds of workshops and generally educative activities.

Hope you like it :-)

1. Use binary, 2s complement numbers
2. Use registers.
3. Use bytes and powers-of-two
4. Your registers must be able to hold a whole pointer.
5. Use byte-granularity for pointers.
6. Partial register writes are tricky.
7. PC should belong to the register set
8. CALL is just a MOV with swap
9. Status registers...
10. Harvard vs Von Neuman
11. Instruction width
12. Reserved opcode values
13. Register #0 = 0
14. Visible states and atomicity
15. Program Position Independence
16. Program re-entrancy
17. Interrupt vector table
18. Start execution at address 0
19. Reset values
20. Memory granularity
21. Conditional execution and instructions
22. Sources of conditions
23. Get your operands as fast as possible.
24. Microcode is evil.
25. Clocking
26. Input-Output architecture
27. Tagged registers
28. Endians
29. Register windows
30. Interrupt speed
31. Delay slots
32. Reset values (bis)
33. TTA - Transfer-Triggered Architectures
34. Divide (your code) and conquer
35. How to design a (better) ALU
36. I have a DWIM...
37. Coroutines
38. More condition codes
(some drafts are pending completion)


The design of the 6809

Adobe Portable Document Format - 543.33 kB - 02/26/2018 at 23:43


  • How to name opcodes ?

    Yann Guidon / YGDES12/29/2018 at 20:03 0 comments

    Oh, what a wide ranging subject...

  • More condition codes

    Yann Guidon / YGDES12/27/2018 at 09:36 0 comments

    Condition codes are notoriously bad. Log 21. Conditional execution and instructions doesn't even scratch the whole subject but my latest "trick" is worth a whole log.

    During private discussions with @Drass about the memory granularity, I realised that unaligned access was probably handled in the YASEP with the wrong perspective. When doing an insertion or extraction of a 16-bits word in a 32-bits architecture, the instruction would set the Carry flag to indicate when the extracted sub-word overlaps the natural memory word boundary. This way the program can detect if/when more code must be executed to fetch the remaining byte from the next word.

    It's a little harmless kludge but it works in theory. Unfortunately this is not scalable. This wouldn't work with #F-CPU  for example.

    Enters 2018 and new insight.

    My CPUs typically have the following condition codes or flags :

    • Carry (for add/sub/comparison)
    • Zero (the result, or the register, is cleared)
    • Sign (the MSBit of the register is the sign, useful for shifts and giggles)
    • Parity (odd/even is the LSBit of the register, quite useful as well)

    Today's unkludging is an extension of the parity bit : more bits are created that provide information about alignment  before the fact, unlike the IE method. It also simplifies the logic for carry generation (I don't know why I chose to modify this bit in particular, and not the Parity bit)

    In summary:

    • For your 16-bits CPU, you just need the P flag.
    • For 32 bits, you also need an extended parity that is a OR of the 2 LSBits, so you would have flags P1 and P2.
    • For 64 bits : same again, you would have P1, P2 and P3, that you can check before any access to a Data register.

    This is "good" because:

    • Usually, to perform the same function, you have to issue a AND instruction with a bitmask (1, 3 or 7 for example) which adds some latency to the program and requires a temporary register to hold the result.
    • The results of the ORing for the LSB can be stored in "shadow" bits that can be placed closer to the branch decision logic.
    • You can test the pointer before using it, instead of after the fact
    • The instruction decoder could eventually trap before emitting the unaligned instruction.

    The less nice part is the increased coding space required in the instructions, to hold 1 or 2 more codes.

  • Coroutines

    Yann Guidon / YGDES08/28/2018 at 04:23 4 comments

    A discussion with an old-timer reminds me this coding technique for cooperative threads.

    Coroutines are not used anymore because C and other languages don't allow this structure, so few people bother today... Yet they are still very interesting.

    (more about them later)


  • I have a DWIM...

    Yann Guidon / YGDES08/06/2018 at 20:10 0 comments

    There is no such thing as a DWIM.

    DWIM = Do What I Mean

    It's the classic "joke instruction" which illustrates why computers are so "unnatural" to mere humans : they can't do telepathy and they require clear, unambiguous and effective code sequences, using instructions that they already have, operating on data they can reasonably manage...

    When creating an instruction set architecture, simplicity is the rule. Unless you design an application-specific processor (such as a DSP), stick to the very basics. Don't include an instruction that requires a logic diagram you can't easily draw on a napkin and find the lowest common denominator to prevent duplication.

    If you're familiar with the RISC methodology, this sounds obvious, but most beginners (including me...) want to include "their instruction" because they don't know how to use the existing methods. Have a look at the Hackmem and similar "programming tricks" :-)

  • How to design a (better) ALU

    Yann Guidon / YGDES04/12/2018 at 23:34 0 comments

    As @Martian created his new architecture #Homebrew 16-bit CPU and Computer, he credited this PDP project for ideas and guidance. In return I take the opportunity to comment on his ALU, which contains typical "misunderstandngs" that I have seen in many other "amateur" designs. These are harmless in a SW simulation but their hardware implementation would be significantly slower, larger and less efficient than what can be done. So this is not at all a criticism of Martian but a little lesson in logic design which, by the way, explains HOW and WHY my own CPU are organised in their precise ways.

    Note : ALU design is already well covered there : This is a MUST READ but it might focus on particular technologies and here I'll cover some required basics.

    At least, Martian's ALU is very easy to understand : there is a big MUX that selects between many results of computations. One input for every opcode. It's large, the many inputs imply a very large fan-in and there is quite a lot of redundancy. See :

    Another version:

    So the name of this game is to factor operations.

    I'll simply start with the elephant in the room : addition and subtraction.

    In typical code (think C/Verilog/VHDL), you would write something like :

    if opcode=ADD then result=ADD else result=SUB

    if opcode=ADD
        then result=ADD
        else result=SUB

    Which synthesises into one ADD block, one SUB block and a MUX2. This would be OK if the opcode signal came late but in this case, the opcode is the first thing that we know in the pipeline. The cost is pretty high, because ADD and SUB are actually very very similar circuits so only one is actually required.

    The key is to remember that in 2s-complement :

    C = A - B = A + (-B) = A + (~B) + 1

    C = A - B = A + (-B) = A + (~B) + 1

    so if you want to compute C, you need to

    • invert all the bits of C (a row of XOR does the trick)
    • increment the result

    Oh wait, that's another addition unit added to our critical datapath... But sweat not. Since it's only 1, there is another trick : set the CARRY IN bit.

    Tadah, you have it : a unit that contains only one carry chain and has the same critical datapath as a MUX-based ADD/SUB, but with

    • almost half the parts count / surface / power requirement / etc.
    • half the fan-in (because only one ADD is fed)
    • a bit more speed

    And this is just the first of many tricks in the book...

    Another unit in the diagram is a "comparison" block. Which is... well, usually, a subtractor. The carry-out and/or the sign bit of the result will indicate the relative magnitude. But we already have a subtractor, right ? Just execute a SUB instruction and don't write the result back. That's it.

    OK so now there is this row of XORs in the critical datapath. Is it a curse or a blessing ?

    Actually it is welcome because it helps to factor other operations than the ADD/SUB. In particular, some boolean operations require one operand to be inverted : ANDN, XORN, ORN

    So you only need to implement OR XOR and AND and you save even more units (and you reduce the fanout).

    But wait, it doesn't stop there ! There is even more factoring to do if you are brave enough to implement your own adding circuit. And I discovered this trick in one of the historical FPGA CPUs more than 20 years ago.

    A typical adder uses a "carry-propagate" circuit where two input bits of equal weight are combined to create a "generate carry" signal and a "propagate carry" signal. Both are the result of a very simple boolean operation :

    • Generate is created by AND
    • Propagate is created by XOR

    See what I see ?

    These operations can be factored and shared between the adder unit and the boolean unit. The only gate to implement separately is OR.

    I have implemented these tricks in the ALU of #YGREC8  at the gate level and it works very well. Look at the VHDL source code and the logs...

    Read more »

  • Divide (your code) and conquer

    Yann Guidon / YGDES02/26/2018 at 23:00 0 comments

    This log is about a particular microkernel approach.

    One important result of my explorations is to keep code small and well defined and turn the knob of the GNU HURD all the way up to 11 : "everything is a server". This has been studied with the #YASEP Yet Another Small Embedded Processor and resulted in the creation of the IPC/IPR/IPE trio of opcodes.

    Ideally, any program would be split into chunks of about 64KB of instructions. The only way to communicate is to call the IPC instruction (InterProcess Call) with the ID of the called process, and an offset into the code. This call is validated by the IPE (InterProcess Entry) instruction, on the called code, which saves the callee's information in registers so the called code can accept or deny execution of the following code. IPR (InterProcess Return) is called to return to the callee's context.

    This is a clear separation between code domains. This should  prevent both the Spectre and ROP (Return Oriented Programming) types of attacks. This might look a bit slower than the "direct call" approach (as used in Linux "because it's faaaaast") but when well implemented, it shouldn't be much slower than a CALL within a process. It just exposes the costs of jumping from some code to another, without microcode or stuff like that.

    So I advocate the following : split your code into small bits, for security and safety. Make them communicate with each other with protected calls to prevent mischief. This also promotes maintainability, modularity and code reuse, in an age of bloatware and executable programs that weigh many megabugs, I mean, megabytes...

    Of course, this requires special instructions and a total rewrite of the operating system. Which means that current applications must be reviewed and there is no direct path to port GNU/Linux stuff. OTOH, the POSIX model offers little intrinsic security...

    Another aspect to this is : separate processes with private data spaces need a SAFE, efficient, simple and flexible way to exchange blocks of data. This is a subject for a future log...

  • TTA - Transfer-Triggered Architectures

    Yann Guidon / YGDES02/25/2018 at 12:10 0 comments

    We can find some TTA processors on, such as #TD4 CPU so the idea has a low barrier of entry. It was also once considered for the early F-CPU project. Some commercial processors, such a Maxim's MAXQ, exist.

    One advantage I see is : along with a VLIW approach, it could potentially handle OOO inherently, with low overhead. However, code density might not be good enough and people are not used to TTA idioms so on-the-fly binary translation from more classic RISC instructions sounds like a nice compromise. However this raises the complexity and increases latency, for an unknown gain in ILP, so a simple RISC (even superscalar) would work better in the beginning.

    I also think that the system presented in is pretty under-efficient... I object that the very large crossbar bus that adds significant load (and surface), slowing down the whole (been there with FC0). There is also some redundancy, in even a simple case where you want to perform an addition : the 3 instructions contain the meaning of addition 3 times, in the 3 addresses of the source and destination registers. OTA mentions addition only once in the opcode...

    TTA makes sense in a VLIW system, with some clever partitioning (to reduce bus complexity and fanout/fanin), which is already inherent in OTA architectures. This is why YASEP and YGREC are somehow a mix of OTA (for basic operations) and TTA (for memory and instruction control), they use the best approach for each case where it makes sense.

    Oh, and how do you handle faults, context swaps, and register backup/restore in general ?

  • Reset values (bis)

    Yann Guidon / YGDES02/23/2018 at 12:29 0 comments

    I have been suggested the following:

    "When initializing the registers after a hardware reset, try to have the CPU hardware revision number in one of the registers. This way, if some instructions are missing or buggy in the CPU you have a fair chance of getting around this by software."

    I much prefer the core's type, revision, capabilities etc. stored in ROM in the Special Registers (or IO) space because you don't have to reset the CPU to get the information :-)

  • Delay slots

    Yann Guidon / YGDES02/22/2018 at 05:48 0 comments

    Delay slots, or delayed jumps, are one of these neat tricks you can use for a fixed architecture, and it becomes a nightmare when the architecture changes. That's why it helped MIPS processors take off, but the ALPHA avoided it (which was wise).

    So yes, it's a pretty cool trick when you have a canonical RISC pipeline with single-issue. Otherwise, stay away.

  • Interrupt speed

    Yann Guidon / YGDES02/22/2018 at 05:44 0 comments

    Swapping the register set has always been a concern, for various reasons... but yes, mainly because speed.

    Some architectures have windowed registers (SPARC has one kind, TMS9900 has a different one). This creates some kinds of issues or others.

    Some have two or more banks (DSP often have two sets for almost instant IRQ handling).

    Some just prefer the slow way, or even microcoded operations.

    Some just don't bother and let the tedious IRQ work be done by smaller, nimbler but better adapted companion processors : the recent ARM "big/little" and "little/big" approach, or simply the CDC6600/CDC7600 PP (Peripheral Processors) delegate the tedious tasks and concentrate on the hard work (thus simplifying the main CPU btw)

    F-CPU's FC0 introduced the SRB system : the "Smooth Register Backup" spies on the register set and performs the transition in the background. But it's still not ideal.

    Traps are annoying as well, but context switches also occur when sending data to a different process : this is actually the real speedbump if you listen to the microkernel people. Then, different mitigation systems are required...

    But overall, don't focus too much on this because CPU waste so much time in so many different things !!

View all 39 project logs

Enjoy this project?



Yann Guidon / YGDES wrote 12/27/2018 at 10:13 point

  Are you sure? yes | no

Yann Guidon / YGDES wrote 06/26/2018 at 20:08 point

More discussions about major architecture features :

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates