Close
0%
0%

PDP - Processor Design Principles

Distilling my experience and wisdom about the architecture, organisation and design choices of my CPUs

Similar projects worth following
Here I try to gather and organise key elements that guide my choices when I design a processor. This is more fundamental and thorough than #AMBAP, since the AMBAP principles follow from the present principles, which I hadn't explicitly examined yet.

This might as well serve as an introductory course in CPU architecture, for the curious readers or those who consider designing their own. I try to keep the perspectives as wide as possible, and not follow what classic texts (such as P&H) perpetuate. Now, whenever I encounter a design decision, I write logs here. With time, it could serve as a guideline, or a "cookbook" for one's own CPU, whether they follow my advices or not. This is a recollection of my experiences as well as a sort of checklist of fundamental aspects to keep in mind when choosing a feature or characteristic.

Feel free to add suggestions, subjects or details, there is so much to cover!

I must emphasize: this is not a "general introduction" to CPU or their design but a general analysis of how I design MY processors:

Note: these are "general purpose" cores, not DSP or special-purpose processors, so other "hints" and "principles" would apply in special cases.

I list these "principles" because :

  • I encourage people to follow, or at least examine and comment, these advices, heuristics, tricks. I have learned from many people and their designs, and here I contribute back.
  • They help people understand why I chose to do something in a certain way and not another in some of my designs. I have developed my own "style" and I think my designs are distinctive (if not the best because they are mine, hahaha.)
  • It's a good "cookbook" from which to pull some tricks, examine them and maybe enhance them. It's not a Bible that is fixed for ever. Of course this also means that it can't be exhaustive. I document and explain how and why this, and not that, along the way...
  • I started to scratch the surface with #AMBAP: A Modest Bitslice Architecture Proposal but it was lacking some depth and background. AMBAP is an evolution of all those basic principles.
  • The 80s-era "canonical RISC" structure needs a long-awaited refresh !
  • This can form the foundations for lectures, lessons, all kinds of workshops and generally educative activities.

Hope you like it :-)


Logs:
1. Use binary, 2s complement numbers
2. Use registers.
3. Use bytes and powers-of-two
4. Your registers must be able to hold a whole pointer.
5. Use byte-granularity for pointers.
6. Partial register writes are tricky.
7. PC should belong to the register set
8. CALL is just a MOV with swap
9. Status registers...
10. Harvard vs Von Neuman
11. Instruction width
12. Reserved opcode values
13. Register #0 = 0
14. Visible states and atomicity
15. Program Position Independence
16. Program re-entrancy
17. Interrupt vector table
18. Start execution at address 0
19. Reset values
20. Memory granularity
21. Conditional execution and instructions
22. Sources of conditions
23. Get your operands as fast as possible.
24. Microcode is evil.
25. Clocking
26. Input-Output architecture
27. Tagged registers
28. Endians
29. Register windows
30. Interrupt speed
31. Delay slots
32. Reset values (bis)
33. TTA - Transfer-Triggered Architectures
34. Divide (your code) and conquer
35. How to design a (better) ALU
.
.
(some drafts are pending completion)

byte_6809_articlesx3.pdf

The design of the 6809

Adobe Portable Document Format - 543.33 kB - 02/26/2018 at 23:43

Preview
Download

  • How to design a (better) ALU

    Yann Guidon / YGDES04/12/2018 at 23:34 0 comments

    As @Martian created his new architecture #Homebrew 16-bit CPU and Computer, he credited this PDP project for ideas and guidance. In return I take the opportunity to comment on his ALU, which contains typical "misunderstandngs" that I have seen in many other "amateur" designs. These are harmless in a SW simulation but their hardware implementation would be significantly slower, larger and less efficient than what can be done. So this is not at all a criticism of Martian but a little lesson in logic design which, by the way, explains HOW and WHY my own CPU are organised in their precise ways.

    Note : ALU design is already well covered there : http://6502.org/users/dieter/ This is a MUST READ but it might focus on particular technologies and here I'll cover some required basics.

    At least, Martian's ALU is very easy to understand : there is a big MUX that selects between many results of computations. One input for every opcode. It's large, the many inputs imply a very large fan-in and there is quite a lot of redundancy. See https://hackaday.io/project/131983/log/143820/ :

    Another version:

    So the name of this game is to factor operations.


    I'll simply start with the elephant in the room : addition and subtraction.

    In typical code (think C/Verilog/VHDL), you would write something like :

    if opcode=ADD then result=ADD else result=SUB

    if opcode=ADD
        then result=ADD
        else result=SUB

    Which synthesises into one ADD block, one SUB block and a MUX2. This would be OK if the opcode signal came late but in this case, the opcode is the first thing that we know in the pipeline. The cost is pretty high, because ADD and SUB are actually very very similar circuits so only one is actually required.

    The key is to remember that in 2s-complement :

    C = A - B = A + (-B) = A + (~B) + 1

    C = A - B = A + (-B) = A + (~B) + 1

    so if you want to compute C, you need to

    • invert all the bits of C (a row of XOR does the trick)
    • increment the result

    Oh wait, that's another addition unit added to our critical datapath... But sweat not. Since it's only 1, there is another trick : set the CARRY IN bit.

    Tadah, you have it : a unit that contains only one carry chain and has the same critical datapath as a MUX-based ADD/SUB, but with

    • almost half the parts count / surface / power requirement / etc.
    • half the fan-in (because only one ADD is fed)
    • a bit more speed

    And this is just the first of many tricks in the book...


    Another unit in the diagram is a "comparison" block. Which is... well, usually, a subtractor. The carry-out and/or the sign bit of the result will indicate the relative magnitude. But we already have a subtractor, right ? Just execute a SUB instruction and don't write the result back. That's it.


    OK so now there is this row of XORs in the critical datapath. Is it a curse or a blessing ?

    Actually it is welcome because it helps to factor other operations than the ADD/SUB. In particular, some boolean operations require one operand to be inverted : ANDN, XORN, ORN

    So you only need to implement OR XOR and AND and you save even more units (and you reduce the fanout).


    But wait, it doesn't stop there ! There is even more factoring to do if you are brave enough to implement your own adding circuit. And I discovered this trick in one of the historical FPGA CPUs more than 20 years ago.

    A typical adder uses a "carry-propagate" circuit where two input bits of equal weight are combined to create a "generate carry" signal and a "propagate carry" signal. Both are the result of a very simple boolean operation :

    • Generate is created by AND
    • Propagate is created by XOR

    See what I see ?

    These operations can be factored and shared between the adder unit and the boolean unit. The only gate to implement separately is OR.


    I have implemented these tricks in the ALU of #YGREC8  at the gate level and it works very well. Look at the VHDL source code and the logs...

    Read more »

  • Divide (your code) and conquer

    Yann Guidon / YGDES02/26/2018 at 23:00 0 comments

    This log is about a particular microkernel approach.

    One important result of my explorations is to keep code small and well defined and turn the knob of the GNU HURD all the way up to 11 : "everything is a server". This has been studied with the #YASEP Yet Another Small Embedded Processor and resulted in the creation of the IPC/IPR/IPE trio of opcodes.

    Ideally, any program would be split into chunks of about 64KB of instructions. The only way to communicate is to call the IPC instruction (InterProcess Call) with the ID of the called process, and an offset into the code. This call is validated by the IPE (InterProcess Entry) instruction, on the called code, which saves the callee's information in registers so the called code can accept or deny execution of the following code. IPR (InterProcess Return) is called to return to the callee's context.

    This is a clear separation between code domains. This should  prevent both the Spectre and ROP (Return Oriented Programming) types of attacks. This might look a bit slower than the "direct call" approach (as used in Linux "because it's faaaaast") but when well implemented, it shouldn't be much slower than a CALL within a process. It just exposes the costs of jumping from some code to another, without microcode or stuff like that.

    So I advocate the following : split your code into small bits, for security and safety. Make them communicate with each other with protected calls to prevent mischief. This also promotes maintainability, modularity and code reuse, in an age of bloatware and executable programs that weigh many megabugs, I mean, megabytes...

    Of course, this requires special instructions and a total rewrite of the operating system. Which means that current applications must be reviewed and there is no direct path to port GNU/Linux stuff. OTOH, the POSIX model offers little intrinsic security...

    Another aspect to this is : separate processes with private data spaces need a SAFE, efficient, simple and flexible way to exchange blocks of data. This is a subject for a future log...

  • TTA - Transfer-Triggered Architectures

    Yann Guidon / YGDES02/25/2018 at 12:10 0 comments

    We can find some TTA processors on Hackaday.io, such as #TD4 CPU so the idea has a low barrier of entry. It was also once considered for the early F-CPU project. Some commercial processors, such a Maxim's MAXQ, exist.

    One advantage I see is : along with a VLIW approach, it could potentially handle OOO inherently, with low overhead. However, code density might not be good enough and people are not used to TTA idioms so on-the-fly binary translation from more classic RISC instructions sounds like a nice compromise. However this raises the complexity and increases latency, for an unknown gain in ILP, so a simple RISC (even superscalar) would work better in the beginning.

    I also think that the system presented in https://web.archive.org/web/20071013182106/http://byte.com/art/9502/sec13/art1.htm is pretty under-efficient... I object that the very large crossbar bus that adds significant load (and surface), slowing down the whole (been there with FC0). There is also some redundancy, in even a simple case where you want to perform an addition : the 3 instructions contain the meaning of addition 3 times, in the 3 addresses of the source and destination registers. OTA mentions addition only once in the opcode...

    TTA makes sense in a VLIW system, with some clever partitioning (to reduce bus complexity and fanout/fanin), which is already inherent in OTA architectures. This is why YASEP and YGREC are somehow a mix of OTA (for basic operations) and TTA (for memory and instruction control), they use the best approach for each case where it makes sense.

    Oh, and how do you handle faults, context swaps, and register backup/restore in general ?

  • Reset values (bis)

    Yann Guidon / YGDES02/23/2018 at 12:29 0 comments

    I have been suggested the following:

    "When initializing the registers after a hardware reset, try to have the CPU hardware revision number in one of the registers. This way, if some instructions are missing or buggy in the CPU you have a fair chance of getting around this by software."

    I much prefer the core's type, revision, capabilities etc. stored in ROM in the Special Registers (or IO) space because you don't have to reset the CPU to get the information :-)

  • Delay slots

    Yann Guidon / YGDES02/22/2018 at 05:48 0 comments

    Delay slots, or delayed jumps, are one of these neat tricks you can use for a fixed architecture, and it becomes a nightmare when the architecture changes. That's why it helped MIPS processors take off, but the ALPHA avoided it (which was wise).

    So yes, it's a pretty cool trick when you have a canonical RISC pipeline with single-issue. Otherwise, stay away.

  • Interrupt speed

    Yann Guidon / YGDES02/22/2018 at 05:44 0 comments

    Swapping the register set has always been a concern, for various reasons... but yes, mainly because speed.

    Some architectures have windowed registers (SPARC has one kind, TMS9900 has a different one). This creates some kinds of issues or others.

    Some have two or more banks (DSP often have two sets for almost instant IRQ handling).

    Some just prefer the slow way, or even microcoded operations.

    Some just don't bother and let the tedious IRQ work be done by smaller, nimbler but better adapted companion processors : the recent ARM "big/little" and "little/big" approach, or simply the CDC6600/CDC7600 PP (Peripheral Processors) delegate the tedious tasks and concentrate on the hard work (thus simplifying the main CPU btw)

    F-CPU's FC0 introduced the SRB system : the "Smooth Register Backup" spies on the register set and performs the transition in the background. But it's still not ideal.

    Traps are annoying as well, but context switches also occur when sending data to a different process : this is actually the real speedbump if you listen to the microkernel people. Then, different mitigation systems are required...

    But overall, don't focus too much on this because CPU waste so much time in so many different things !!

  • Register windows

    Yann Guidon / YGDES02/22/2018 at 05:17 2 comments

    SPARC uses register windows to provide a bunch of fresh registers across function calls. It was touted as a very RISC thing and history has shown that it was not the best idea, overall. So yeah, forget about it, as is, because it only moves the actual problems to where KISS doesn't work.

    Instead, why not just map more than one data register to memory for each address register ? (see Memory-mapped registers in the F-CPU project)

  • Endians

    Yann Guidon / YGDES02/22/2018 at 04:54 0 comments

    Little Endian has won.

    Yet, be ready to swap bytes...

  • Tagged registers

    Yann Guidon / YGDES02/22/2018 at 04:54 0 comments

    One of the tricks I included in F-CPU FC0 was flags associated to each register, holding hidden (and restorable) states about the contents.

    One of these flags is the ZERO flag, calculated each time the register is written. This works like a distributed status register.

    Another flag is a "valid" flag : the SRB (smooth register backup) steals cycles to save or restore the monolithic register set across thread switches or during IRQ.

    Also very interesting is an address valid flag, meaning that the register contains a pointer that has been cleared in the TLB. The tag should also contain the access rights, for example to prevent a store if the page is read-only. More information can be added such as the cache set, or other architecturally-specific details, which accelerates execution of a load/store instruction.

    Similarly a flag can indicate whether a register contains a valid instruction pointer, for example for looping or function return. Not only can it say that the TLB should not be checked again, but also indicate the cache line number.

    .

    As long as you can recover these informations, you can cache them. They might be erased during a context switch, a TLB invalidation, whatever... Restoring the state will add a few cycles of penalty but it will function just as well.

  • Input-Output architecture

    Yann Guidon / YGDES02/22/2018 at 04:15 0 comments

    In the wild, you will find two approaches, best illustrated by the Motorola vs Intel debate.

    • Motorola and their ilk map peripherals in the memory. Typically you end up with a single address decoding logic ("glue" chips) with a pretty wide variety of granularities.
    • The Intel school have a dedicated IO space that uses a few dedicated instructions.

    In the 70/80s, separate IO spaces would ease decoding at the cost of more IO pins on the CPU.

    In the 90s/2k, well, memory has become black art then PCI arrived so the mess is much worse.

    I design a dedicated "space" to separate differing resources because they have different requirements : latency, speed, bandwidth, granularity, protection/safety, ordering, restartability...

    • Memory can be weakly ordered and optimised for bandwidth, it uses various cache levels and has a coarse granularity for protection. Usually, there is one main area of memory, maybe split among several homogeneous banks. You usually move cache-line-wide chunks of data in interleaved transactions...
    • IO can have many uses, from controlling internal CPU resources such as TLB, protection settings, essential peripherals... to yexchanging data with other (more or less dependent) units such as other CPU or coprocessors... You need a clear and clean execution where access rights are immediately evaluated, with maybe some latency, but no speculative execution or risk to re-execute the instruction after a trap (for example) because this would mess with the environment.

    I use IN and OUT instructions to access anything that is not related to data storage. This is more or less equivalent to Intel's MSR introduced with the Pentium, 25 years ago. Semaphores, synchronisation, interrupt management, debug, profiling... can only work with word-wide accesses and fine-grained rights. This allows capability-based (or whitelist, or object-based) rights management, for example each peripheral could be accessed only by a given thread ID. Of course this also greatly simplifies the memory system because you don't rely on certain properties, that are relegated to a dedicated channel.

View all 35 project logs

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates