Close
0%
0%

PDP - Processor Design Principles

Distilling my experience and wisdom about the architecture, organisation and design choices of my CPUs

Similar projects worth following
Here I try to gather and organise key elements that guide my choices when I design a processor. This is more fundamental and thorough than #AMBAP, since the AMBAP principles follow from the present principles, which I hadn't explicitly examined yet.

This might as well serve as an introductory course in CPU architecture, for the curious readers or those who consider designing their own. I try to keep the perspectives as wide as possible, and not follow what classic texts (such as P&H) perpetuate. Now, whenever I encounter a design decision, I write logs here. With time, it could serve as a guideline, or a "cookbook" for one's own CPU, whether they follow my advices or not. This is a recollection of my experiences as well as a sort of checklist of fundamental aspects to keep in mind when choosing a feature or characteristic.

Feel free to add suggestions, subjects or details, there is so much to cover!

Thanks to @Morning.Starfor drawing the project's avatar :-) Of course the current source is missing, but it's on purpose. It's a reminder that a nice looking idea can be badly engineered and you have to try it to see how or why it doesn't work. This picture is the simplest expression of this principle :-)


See also #POSEVEN (ex "NPM - New Programming Model")


I must emphasize: this is not a "general introduction" to CPU or their design but a general analysis of how I design MY processors:

Note: these are "general purpose" cores, not DSP or special-purpose processors, so other "hints" and "principles" would apply in special cases.

I list these "principles" because :

  • I encourage people to follow, or at least examine and comment, these advices, heuristics, tricks. I have learned from many people and their designs, and here I contribute back.
  • They help people understand why I chose to do something in a certain way and not another in some of my designs. I have developed my own "style" and I think my designs are distinctive (if not the best because they are mine, hahaha.)
  • It's a good "cookbook" from which to pull some tricks, examine them and maybe enhance them. It's not a Bible that is fixed for ever. Of course this also means that it can't be exhaustive. I document and explain how and why this, and not that, along the way...
  • I started to scratch the surface with #AMBAP: A Modest Bitslice Architecture Proposal but it was lacking some depth and background. AMBAP is an evolution of all those basic principles.
  • The 80s-era "canonical RISC" structure needs a long-awaited refresh !
  • This can form the foundations for lectures, lessons, all kinds of workshops and generally educational activities.

Hope you like it :-)


Logs:
1. Use binary, 2s complement numbers
2. Use registers.
3. Use bytes and powers-of-two
4. Your registers must be able to hold a whole pointer.
5. Use byte-granularity for pointers.
6. Partial register writes are tricky.
7. PC should belong to the register set
8. CALL is just a MOV with swap
9. Status registers...
10. Harvard vs Von Neuman
11. Instruction width
12. Reserved opcode values
13. Register #0 = 0
14. Visible states and atomicity
15. Program Position Independence
16. Program re-entrancy
17. Interrupt vector table
18. Start execution at address 0
19. Reset values
20. Memory granularity
21. Conditional execution and instructions
22. Sources of conditions
23. Get your operands as fast as possible.
24. Microcode is evil.
25. Clocking
26. Input-Output architecture
27. Tagged registers
28. Endians
29. Register windows
30. Interrupt speed
31. Delay slots
32. Split vs unified register set (draft).
33. Code density (draft)
34. Reset values (bis)
35. ILP - Instruction Level Parallelism (draft)
36. Stack processors (draft)
37. VLIW - Very Long Instruction Word architectures (draft)
38. TTA - Transfer-Triggered Architectures
39. Divide (your code) and conquer
40. How to design a (better) ALU
41. I have a DWIM...
42. Coroutines
43. More condition codes
44. How to name opcodes ? (draft)
45. Be ready to change your opcode map a lot, until you can't
46. The perfect ISA doesn't exist.
47. Learn and know when to stop
48. The 4 spaces
49. NAND or NOR ? A taxonomy of topologies
50. Choosing boolean instructions
51. Privileged instructions
52. Tagged registers: type
53. Carry and borrow
54. Use two separate stacks for data and control flow
55. Let the hardware do what it does best.
.

(some drafts are still pending completion)

mashey-on-RISC.lmth

html-formatted explanation of John Mashey's view of the difference between CISC and RISC.

lmth - 26.19 kB - 06/04/2021 at 22:55

Download

byte_6809_articlesx3.pdf

The design of the 6809

Adobe Portable Document Format - 543.33 kB - 02/26/2018 at 23:43

Preview
Download

  • Let the hardware do what it does best.

    Yann Guidon / YGDES12/20/2023 at 01:35 0 comments

    The title says it all but a deeper explanation will shine more light on it.

    This is another point where I diverge from the Patterson & Hennessy canon, which emphasises the study of real-life code to see which opcodes are the most used. This was a major argument in the 80s when the RISC vs CISC debate raged, and we should not forget that at this time, most computers (microprocessor-based or not) were microprogrammed (let's exclude Cray's designs of course).

    Reducing the instruction set complexity also reduces the weight of the microprogram, which has almost vanished by Y2K (if you forget the x86 of course). So today, when you create your own ISA, you look at the "Quantitative Approach"'s figures and should ask yourself :

    • Does this apply to me ?
    • Is it my relevant to my application or domain ?
    • What is the effect of the compiler ?
    • Haven't times changed since ?

    P&H's RISC works (and their predecessors) have a very good point against microcode, which has been cemented during the last 40 years. But this argument has sometimes been taken to the extreme, creating weird feedback loops.

    Let's take C: it was designed on a PDP-7 then PDP-11 and inherited some features of these platforms. In particular the absence of rotation operator. And there are only the AND/OR/XOR/NOT operators. So what did the RISC people do ? They found that none of their benchmarks (written in "portable" C) would include rotation or combined boolean operations. So the SPARC and MIPS didn't have one.

    My point is : if you already have a barrel shifter, you can do rotation with a small additional work. The counter-argument is "it will slow down the whole pipeline for an operation that is barely used and can be emulated with 3 instructions" (if not 4). Now, when the codepath reaches the point where rotation is important, these 3 opcodes slow the tight loop down, increase the time and energy to move data around (in and out of the register set etc.) as well as the register pressure (to name a few).

    OK maybe that was not the perfect example, let's look at the boolean operations: there is a whole world beyond AND, OR and XOR. Particularly if they are integrated in the ALU where the adder needs one operand to be inverted to perform SUB. This inverter (alread covered in other logs) can be reused to perform ANDN, ORN and XORN. A bit more fiddling gets you NAND and NOR, for almost no real penalty in clock frequency. And yet these are not implemented.

    Intel brought ANDN with the Pentium MMX and the full ROP3 set in the latest Core generations so there is some merit to it right ?

    But the common languages (and C at their roots) does not provide these operators, which must be inferred by the compiler, leading to the underuse of these opcodes.

    These are only 2 examples but more exist. Thus

    When an operation provides more options for marginal overhead, expose them at the opcode level. Languages and compilers may ignore them but when you need them, you'll be happy. This is why I provide ANDN with the YGREC8 and the whole ROP2 in the YASEP. The cost of these "extra features" is anecdotal and the times it will save your a$$ will be less remembered than when you miss them.

    Because remember: it's a different case than the one against microcode. And don't let others tell you what you don't need.

  • Use two separate stacks for data and control flow

    Yann Guidon / YGDES01/11/2023 at 23:11 11 comments

    (to be expanded later)

    This is not just so it could run FORTH almost natively.

    This is a HUGE safety and security feature.

    Plus it also helps performance by separating accesses to different memory regions, while restricting access to the control flow structure (reducing the attack surface).

    The call stack should be almost inaccessible to the lay program.

    More than one data stack is welcome but not necessary. And another type of stack would be for error handling.

    The data stacks should grow "up".

  • Carry and borrow

    Yann Guidon / YGDES11/04/2021 at 18:51 0 comments

    If there is anything that fueled the RISC vs CISC debates over the decades, it's the status flags, and among them, the carry flag is an even hotter topic.

    This flag has been a common feature of most architectures and MIPS slashed it. Considering that only ASM would support it, it made sense, in a way, because implementing it would not significantly improve the overall performance. After all it's a 32-bit architecture so multi-precision is not required, right ?

    Looking at the shenanigans required to implement multi-precision additions on YGREC8 (5 instructions and 2 branches) and the end-around-carry in C for PEAC (or: "wrestling with GCC's intrinsics and compiler-specific macros"), the carry flag is not a done deal. The POWER architecture implements a sophisticated system to get around hardware limitations but nobody seems to have caught on it.

    There is a big problem as well: the high-level languages. This can't be easily fixed because this side is so deeply entrenched that no solution is in sight. We end up with hand-coded libraries for multi-precision numbers (gmp ?), non-portable macros, weird intrinsics, platform-specific code...

    Platform-wise, there is no easy solution but it is not hopeless.

    Not all architectures need to avoid the carry flag, in particular microcontrollers or cores that will not grow to the point of using out-of-order.

    For Y8, the opcode space is so short that I had to ditch the add-with-carry opcode. This is replaced by a 5-opcode macro (10 bytes instead of 2) whenever needed. We'll have to deal with it.

    The YASEP has a larger opcode space and includes add-with-carry and sub-with-borrow. This is very classic here.

    The F-CPU however is a totally different beast. Not only can't it have a single carry flag (due to the split pipelines), but because it is an inherent SIMD architecture, several carries would be required. There, the idea is to use a normal register to hold the carry bit(s), using either a 2R1W form or a 2R2W form (ideally).

    • 2R2W is conceptually best except for the need to write 2 values to the register set at the same time. Each partial carry gets its own place in the register in SIMD mode, and there is no bottleneck for the single carry flag at the decoding stage. The single opcode works in a single cycle. FC1 could write the extra data to another cluster.
      Eventually, the opcode could be a two-cycle instruction, first delivering the sum then the carry on the second cycle.
    • 2R1W splits the operation results in 2 separate opcodes, while only one operation is performed. It is slower because 2 successive opcodes are required but it is an easy addition to any existing ISA. On the programming language side, an extra operator "carry" (probably @+ and @- ?) can do the trick.

    There is no perfect solution. Each type of core requires a complex balance so as usual, you're the one who decides (and is to blame).

  • Tagged registers: type

    Yann Guidon / YGDES10/19/2021 at 02:36 0 comments

    Log 27. Tagged registers talked about "hidden"/"restorable"/"implicit"/caching tags.

    There is another type of register tag that is explicit and directly interacts with the ISA, the programs and their state. In this case, the goal is to move the ever-appearing size and SIMD flags of many instructions, to the register set.

    This saves at least 3 bits from every instruction so this is a great win in general-purpose processors like #F-CPU. This is less an issue for YASEP and YGREC8 which only deal with single-size values.

    However the side effects easily counter-balance this easy win. Trap behaviour must be clearly defined and consistency checks must be sprinkled all over the system.

    A simple instruction/opcode like ADD then has many behaviours, depending on hidden values : integer or FP ? scalar or SIMD ? what size ?

    The behaviour is not apparent from the opcode alone and we can't know which unit(s) to use at decoding time.

    OTOH tagging the registers could be a "cheap and dirty way" to extend an ISA while reusing opcodes with mixed data.

  • Privileged instructions

    Yann Guidon / YGDES10/19/2021 at 01:48 0 comments

    Don't have instructions that can run only in certain "modes".

    I know many CPU do it, even RISC-V, but I can't find a decent reason to accept this.

    • It's a slippery slope : you'll want to do more and more.
    • It's a waste of instruction coding space, which also reduces the overall coding efficiency.
    • It increases the decoder's complexity.
    • What does an instruction do that the Special Registers can't ?

    Remember : Special Registers are where all the configuration, and protection, occurs, so there should be only two instructions that enforce "capabilities" : GET and PUT. They, and only they, clear the pipeline and ensure serialisation, trigger a trap/fault if the address falls in a forbidden range.

    The SR space can grow while keeping the very same management opcodes. You don't need to change the decoder if you add features, it's all in the SR circuits.

    KISS, guys.

  • Choosing boolean instructions

    Yann Guidon / YGDES12/17/2020 at 02:56 0 comments

    My choice follows the classic SIMD Intel instructions, when 2 bits are available in the opcode (such as the #YGREC8) :

    • AND
    • OR
    • XOR
    • ANDN

    The last one is combined with AND and OR to create a MUX2 function for example.

    When 3 bits are available, like the #YASEP, I add the remaining ROP2 functions:

    • NAND
    • NOR
    • ORN
    • XORN

    There are 16 combinations for 2-inputs gates, 8 of them are "degenerate" (output if 0 or 1, or A or B or /A or /B, or a swap of inputs for ANDN/ORN).

    With the full 8 functions, any significant boolean calculation becomes possible and efficient. This is particularly useful for "lateral computations" such as #ANGOLA - A New Game Of Life Algorithm or even some particular implementations of crypto algos, where the LUT sequential lookups are replaced by parallel boolean computations.

  • NAND or NOR ? A taxonomy of topologies

    Yann Guidon / YGDES12/16/2020 at 21:47 10 comments

    When you start to design a computer, you usually start with rule 1. Use binary, 2s complement numbers (unless you're @SHAOS or @alice crush) but it's only the beginning. You have to choose a technology, which is a complex entangled web of compromises and constraints.

    • cost
    • speed
    • size
    • space
    • availability
    • power consumption
    • interfacing with other stuff
    • what's new/fun/enticing/trending

    Of course,

    • if you use vacuum tubes, integrated circuits or FPGA, the rest is mostly moot due to their inherent properties.
    • I also assume you're not going anywhere near #Diode Resistor Logic either, though you might find inspiration here.
    • Relays have a great range of "topological expressiveness" that are worth a whole article.

    In fact, from here on, I guess you'll use discrete (MOS)FET or BJTs and things get downhill from there.

    Many Hackaday projects, and in particular the #Hackaday TTLers, love to explore and play with various ways to interconnect discrete transistors. It's just a notch below the thrill of designing an integrated circuit but way above that in the BMOW wow-factor. You can touch it, probe it, fix and patch it, show it off and impress...

    Creating a logic gate is an art in itself (ask Harvey about it), which draws from both boolean arithmetic and small-signal analog design, and we can dig in the literature, in the past or in the present, to find inspiration. Where each "logic family" differs is what they can express and how they perform each function. The range of possible functions is a sort of expressiveness, and this also deeply influences your higher-level design, as well as design methodology.

    This page is important because it gives you a taste of how miserable you will feel, once you get a few working gates. For example, if too few inputs are possible, the final circuit will be slower because of logic fan-in restrictions: for logic reduction, a 2-tree uses more stages than a 3-tree, and a 4-tree is faster, but might each gate might be slower, so you must find a balance somewhere. One typical example is how you design a XOR gate: it might cripple your critical datapath and speed if your logic family is too limited.

    The basic gates provide NOR and/or NAND, which I call "first order gates" because there is one level of inversion. Along with the MUX2, these are the "most basic gates" from which you can build all other circuits, but MUX2 is cheating because it contains both inversions, OR and AND. Building a computer with them is possible, but still a challenge ! You have to turn everything into a sequence of NOR or NANDs. This is why "expressiveness" is so important for architects and circuit designers : some families offer a wider range of gates that make life easier, critical datapaths shorter and circuits simpler.

    Another choice or constraint is your ability or willingness to use complementary transistors. If all you have, or can get, is a single type, then you are more limited in the possible topologies and you are forced to perform more boolean trickery with the few possible gates. OTOH if you can source both P and N types, you can exploit both NOR and NAND, or combine them in creative ways that could reduce both complexity and power.

    But mostly you end up with NANDs and/or NORs... you have to choose carefully !

    • Chances are you'll want to try your hand with a proven and reputed family : TTL, the one and true Transistor-to-Transistor Logic. One transistor per input and one for the output sounds reasonable and the basic gate is the NAND, such as the 74x00.
      You notice the weird transistor, which briefly existed in discrete form half a century ago. You can substitute a pair of classical transistors (tripoles) for each of them, which is mostly irrelevant in a digital design (unless your parts are really badly binned)
      You can't go wrong with such a classic but soon you'll be trying to get more performance with Baker clamps or speedup capacitors. Oh and you can't find discrete transistors with multiple emitters. In the end, the...
    Read more »

  • The 4 spaces

    Yann Guidon / YGDES03/30/2020 at 21:32 0 comments

    I always keep 4 independent address spaces :

    1. Program memory (read only, high bandwidth)
    2. Data memory  (read and write, as well as scrambled access orders)
    3. I/O and Special registers (serialising, one access at a time)
    4. Registers (can sustain multiple simultaneous reads & writes)

    Each is optimised for a specific purpose and things get nasty when they are mixed.

    • mix data and program memory, and you get a lot of security troubles.
    • mix data and special memory-mapped registers, and you kill your system that must recognise IO regions and addresses, as well as ordering and write-through/writeback/grouped/caching attributes
    • mix data ram and registers and it becomes ridiculously slow (TI tried this last with the TMS9900 and you see what happened)
    • ... you get the idea.

  • Learn and know when to stop

    Yann Guidon / YGDES05/30/2019 at 14:46 0 comments

    Elegance is one of the most striking qualities of the classic Cray computers.

    It's not just the raw performance that counts, if you can't make it right and useful.

    Seymour Cray was there at the dawn of the "Computer Big Bang" and pioneered many groundbreaking technologies but also used conservative approaches where it mattered.

    Today, something else counts more than before : he envisioned things and thought them through, thoroughly, in his big obsessed brain. Today's computers are so complex that it's impossible to have the complete picture in one's head. But the fact that Cray's mythological machines could fit in one's brain means that they are easier to use, their raw power is more accessible and useful.

    Cray machines are notoriously not-microcoded, then there was the big RISC race in the late 80s/early 90s but the philosophy has been lost and today it's almost impossible to reach the theoretic peak MIPS/MFLOPS.

    K.I.S.S. : Keep It Simple and Smart ;-)

  • The perfect ISA doesn't exist.

    Yann Guidon / YGDES04/08/2019 at 04:17 2 comments

    As noted in the previous log, an ISA is always a compromise and can only be judged for a given task or set of tasks. But since everybody has their own "requirements" and they change all the time, in the end, there is no way to get an absolute judgement. Synthetic benchmarks (SPEC etc.) help but they evaluate a complete system in a specific configuration, not the ISA.

    So when you must design an ISA from scratch, you're left with habits, conventions, heuristics, cool ideas you picked somewhere, sprinkled with bikeshedding and countless "what if I tried this or that ?" but you have to set the bar somewhere.

    If you want to do too many things, the architecture becomes good at nothing, or the price to pay (in complexity, price, consumption...) will be very high (x86...).

    If you want to do one thing very well, you need a specialised processor, such as a DSP, GPU or even a special coprocessor (such as a compression or crypto accelerator).

    In-between, you wish to have a reasonable, simple and flexible ISA, but you'll always find cases where you missed a "magic instruction" that boosts your performance. For this discussion, I'll refer you to the arguments for RISC vs CISC in the famous Patterson & Hennessy books, where they crunch the numbers and prove that any added feature must considerably speed up the whole system to be worth it. Which is seldom.


    My personal approach is to let the HW do what it does well and let SW access it as easily as possible, which implies a lot of orthogonality. I start from what logic does easily (move bits around, change them), design the few vital units (boolean unit, add/sub, shuffler) then connect them efficiently to the register set. Extra instructions come from available features or options that require very little effort to implement.

    For example : the boolean unit shares the operands with the adder, which requires one input to be inverted. This is an opportunity to get the ANDN opcode for free, where X = A and not(b). It just occupies one more slot in the opcode map, and a few gates of decoding logic, but no modification to the main datapath. Clock speed and silicon surface are not affected but some software that processes bits and bit vectors (blitting, masking, ...) can save some time in a critical loop.

View all 50 project logs

Enjoy this project?

Share

Discussions

Yann Guidon / YGDES wrote 06/04/2022 at 20:11 point

https://player.vimeo.com/video/450406346 Nice little conference about the Larrabee, KNI, AVX512 saga

  Are you sure? yes | no

Yann Guidon / YGDES wrote 12/27/2018 at 10:13 point

  Are you sure? yes | no

Yann Guidon / YGDES wrote 06/26/2018 at 20:08 point

More discussions about major architecture features : https://danluu.com/new-cpu-features/

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates