-
Clocking
02/22/2018 at 03:47 • 0 commentsHow should your core be clocked ?
- DFF are practical and very clean but take some silicon space.
- Transparent latches use half the transistors but twice the routing resources because you now need two clock networks that MUST not have jitter or phase noise (ask Seymour Cray about this, when designing the Cray 2 or the Cray 3).
So yeah, it depends. My approach is to design with classic DFF with a 4-stages pipeline to allow an easy transformation to 4-phases clocking, which has some advantages when you can control your technology very tightly.
-
Microcode is evil.
02/21/2018 at 08:26 • 13 commentsForget about microcode.
Microcode is CISC.
Microcode is like a computer in a computer, it increases the complexity of the whole system, slows everything down, makes testing miserable, and many other "sins" that RISC addresses.
A direct mapping of the instruction word to the datapath is the best way to have a simple and efficient ISA.
-
Get your operands as fast as possible.
02/18/2018 at 01:50 • 0 commentsLooking back at the 6809's manual, I am now appalled by the elaborate addressing modes. No wonder CISC almost died !
In my designs, there is another rule : provide the operands as fast as possible to the execution units. Keep the path as short as possible, and uninterrupted, from the instruction decoder and the ALU. This means that only two types of data are encoded in the instructions :
- Literal data, sometimes shifted and/or sign-extended. The latency is a few gates and some fanout.
- Register values : just read the register set (some gates of latency, a bit more than literal data but not so much).
Two other sources of data are flags/condition codes, and In/Out port values but they are treated separately.
Anyway, when done right, your units get data to process one cycle after you got your instruction word.
Don't waste time with memory, it's a mess. Careful coding with a register-mapped memory system should shadow some of the latency. Indexed addressing modes, or indirect, or these crazy systems slow everything down. Traps become a trainwreck. Main memory is the enemy. KISS !
Of course, reading a data memory register might stall. There is no absolute and perfect way around slow memory. But keep as much data as possible in your registers so they can be addressed almost immediately. The decoder can speculatively decode the register numbers and read corresponding words of data before you have checked a pointer's validity.
-
Sources of conditions
02/18/2018 at 00:29 • 0 commentsConditional execution is a cornerstone of computing. I've said it already in the log 21. Conditional execution and instructions.
There is more to it than how the instructions work, though. Where do data come from ?
The decision is often made from the flags in a status register, in the CISC processors. They hold values determined by the result of the precedent instruction(s):
- Carry
- Sign
- Zero
- Overflow
- Parity
- decimal overflow, for BCD-capable processors
- and whatever some guy thought would be cool
Those status flags got a lot of bad press when CISC processors got more elaborate, complicated, pipelined, superscalar... POWER created a pretty elaborate system to overcome this.
MIPS OTOH decided to go the other way and avoid status flags. Just like the GOTO statements were considered harmful 15 years earlier, the RISC canon (borrowing significantly from Cray's designs) simply dropped them altogether. Sure, it simplified many things and speed could increase quite a lot. But the critical datapath from evaluation to decision would greatly increase too and the speed gain would plateau around Y2K or so.
A new middle ground must be found, and some ideas like the "Mill" pop up but mostly, there situation stagnates. And status flags are sufficient for middle-range and low-complexity, moderately slow processors that don't need to scale up.
The #YGREC8 has the minimal 3 status flags (C/S/Z) and the 4th condition is "True" to make it a round, nice 2-bits condition. A 3rd bit negates the condition, and since there was one free bit remaining, I added a further feature : select 4 arbitrary bits as condition sources, in a way that reminds of the RCA1802 (and Analog Devices DSPs) because this is very convenient for an embedded/microcontroller application (you need to make instant decisions from reading the pins). You could reconfigure the source of the "flags" for whatever purpose, like individual bits of a register or peripherals (like FIFO status).
The #YASEP Yet Another Small Embedded Processor dedicates more bits to the conditions : 7 bits, including a register number. With that, you can test a register for the Zero, Sign or LSB bits of a given register. It's pretty RISC because it just reuses the MUX of a pipeline. But there is more :-) The Always and Carry bits can be read in the special case of PC. And you can select one bit out of 16, that can be mapped to a register, peripherals or GPIO.
In the end, conditions are quite simple : you have to reduce something down to a bit, then change execution depending on the bit's value. The CPSZ set is quite easy to get :
- C : Carry is a separate flag. Nice and simple, though this outlier can be quite annoying. And C is not allowing you to use this flag.
- P : Parity. Unavailable in #YGREC8. The way I do it is to get the LSB of the designated source register. You can access with &1. x86 guys thought it would be awesome to compute byte parity (LSB of number of set bits) but I have never seen it used anywhere, even for communications projects. C language doesn't let you access this "feature", so die alone.
- S : Sign. Again, pretty easy : just take the MSB of the designated register (or the last result). Very useful to check for counter wrap-around in counters, for example, and easily coded as >=0 or <0.
- Z : Zero. This is easy : a big OR. But implementations have to take care of the delay.
- in the #YGREC8, the result bus is checked for 0.
- read the whole register in #YASEP Yet Another Small Embedded Processor because you might have loaded a new word from memory.
- F-CPU uses a "shadow" flag that is recomputed everytime the register is written.
Other flags should have an inherent latency because they can become complex, but if you have these 4 flags, you can do anything. It's just a matter of moving data around.
-
Conditional execution and instructions
02/17/2018 at 23:28 • 0 commentsConditional execution is a cornerstone of even basic computing. This is required to even get near Turing-completeness.
There are two main ways to perform an operation when a condition is met.
- Conditional jump was the favorite way in the 70s : Motorola and Intel had the classic "branch" instructions, for example the "Jxx" in the 8086 that jumps +/-128 bytes
- Later, "predicated" instructions had more exposure (notably in the ARM world)
Actually there is a whole spectrum of solutions and approaches in this subject. This small angle could help characterize a processor, besides the register width, register count, speed etc.
This is also because there are quite a few ways to get them done, and I'm not sure I could list them all. Many systems have been tested, look at the PIC16F family's skip instructions, or the RCA1802... I have very little familiarity with the POWER architecture, for example, which is a very elaborated system and I'll need to review it soon. All help is welcome, please submit your favorite weirdo :-)
For my part, starting with the #YASEP Yet Another Small Embedded Processor and later expanded in the #YGREC8/#YGREC16 - YG's 16bits Relay Electric Computer , I use a mixed method: some instructions that need long literals are "plain", while those with many register fields have a few bits (4 to 7) to inhibit the instruction's result writeback. This is enough to get things done (particularly when PC is a register like the others) but I am conscious it can only work with microcontrollers.
The whole thing breaks apart for a superpipeline application processor... I am rethinking so much about this subject, pondering what would be the best way to implement the #F-CPU's FC1. For example, what's best for a tight pipeline ? "early selection" or "late selection" of the instruction's condition ? What should it affect ?
-
Memory granularity
02/17/2018 at 17:46 • 0 commentsYou are now used to computers with byte-granular pointers. Users of CRAY computers, however, don't have this luxury and must deal with 64-bits words. Now, have fun with ASCII/UTF8 strings !
OK, a CRAY is not designed for word processing. And it can handle 8 bytes at a time. But still, it's miserable when you must deal with data that doesn't have the same granularity as your registers and pointer type.
"Canon" RISC machines such as MIPS and Alpha have byte pointer granularity and word-wide registers and memory granularity. It's a compromise. Trap when the pointer is not aligned and... OK it sucks too. But less because the pointers have a reasonable, matching granularity.
And then, there's the SHARC. Analog Devices' ADSP21k family is not the most world's best-known 32-bits DSP but it's a really wonderful architecture. I have learned so many tricks by reading its manual 20 years ago... One interesting "trick" is to split the addressing range into sections with different word granularities.
The core is 32-bits but depending on where you address data, it will come back as a byte, a half-word or a word. The size is not encoded in the instruction, no "special unit" (like YASEP's LSU) to perform post-load adjustment. It's in the pointer.
This idea is great... to a certain extent. The SHARC has no virtual memory, no problem with aliases, it's not a multi-user application CPU. It's a beast that performs large fancy FFT with carefully crafted code, written by careful people.
If you want such a heterogeneous granularity in an application processor, you'll face quite a few problems because data has to be shifted/masked at one point or another. Either it's implicit with the pointer and you'll have to include shifters in the datapath of the memory load/store unit (which will slow things down), or you have to do it explicitly. In all cases, format changes take space and time. Everything has a price...
In the end, if a heterogeneous space works for a specialised processor, it creates more problems than it solves for a General Purpose Application Processor.
-
Reset values
02/15/2018 at 02:05 • 0 commentsWhat should be initialised by hardware, and how much must be delegated to software ?
The internal clock tree is already a significant resource, which dissipates quite a lot of power, and uses real estate (and quite some of one metal layer). Adding another global signal that will be used infrequently can be a sad waste of resources.
One example is the Alpha 21064 (if I remember correctly) that has no /RESET signal for the register set and many other resources. The firmware had to initialise everything with carefully selected instruction sequences, which are very implementation-dependent.
So basicly you just need to reset PC to the right place (at the beginning of the bootstrap code, which is address 0) and you save a bunch of /RESET lines. The savings are beneficial with more metal area and better routing of other useful signals, eventually better speed or things like that. There are certainly some other signals that are really critical to the chip, such as IO pin directions and states, but much fewer signals than the various high-speed arrays (register set, cache, LRU, TLB, Branch Predictor History, BTB...)
FPGA usually force you to initialise your FF. They have dedicated hardware and routing nets for this purpose. But you usually don't initialise all your SRAM blocks (though some FPGA can do it) so if you aim for high performance in ASIC, save those resources used by /RESET to get actual work done. Try to simulate/emulate your design with the flip-flops initialised to undetermined states like "U" or "X" before removing the /RESET lines. Use latches to save 1/2 of the surface of flip-flops, when you can. Don't get lazy, FPGA are only for prototyping ;-)
-
Start execution at address 0
02/15/2018 at 01:53 • 2 commentsSeveral processors start execution at more or less arbitrary addresses. There is actually a large variety of approaches and implementations. This was mostly driven in the early (golden age of) microprocessors by the need to have separate RAM and ROM address ranges. Some had a "fast page mode" that favors the first 256 bytes of memory so it was impossible to put ROM at the lower addresses.
The Intel 8088/8086 would boot to FFFFEh (mapped to BIOS EPROM) and locate the IRQ table at 00000h (in DRAM). But the reset vector is also sometimes one entry in the "interrupt vector table" (see the 6809). Should it be in ROM or RAM ? If it's in ROM, you can boot but you can't modify or reallocate the vectors to user code. See the last log Interrupt vector table
Anyway, many current processors boot from a serial EEPROM. Some circuits load the contents of the EEPROM into main RAM or in the cache, so the problem of the overlap of RAM and ROM is now a thing of the past.
Booting at address 0 is easy and future-safe (just look at the x86 and the extension of its address space...)
-
Interrupt vector table
02/15/2018 at 01:39 • 0 commentsHow should the IRQ be handled ?
Early microprocessors had a fixed-size, fixed-address table of addresses that point to user code. It's a compromise that worked rather well back in the days but today's architectures have evolved. Old problems must now be solved because the relative latency of main memory has skyrocketed.
Today I advocate for 2 things :
- Store not a pointer, but instructions directly. This is inspired by the ADSP2106x family that stores 4 instructions in the Interrupt Vector Table. Instead of 4 (which was suitable for the fast onchip SRAM), I'd recommend to have one or two cache line worth of instructions. The idea is to start prefetching the next cache lines from the real interrupt handler, while doing critical housekeeping (such as state backup) and early processing, if any.
- Set the start of the vector table in a register that is reset to 0 (to point to startup ROM) then changed by the operating system to a table built in RAM. This solves the old dilemma of "should the IVT be in ROM or RAM ?"
.
-
Program re-entrancy
02/13/2018 at 00:27 • 1 commentAnother constraint in the design of the 6809 :
That quality of a program which allows a subroutine to be shared by several tasks concurrently, without destroying the return addresses by nesting routines.
So yes, a proper stack (Hello, COSMAC/1802 :-) )
But also: stack frame, local variables on the stack...