Init: By whygee on Saturday 20 November 2010, 09:24 - Architecture
(post version : 20110108)
(update : 20110515 : environment inheritance)
Recently (2010/11/20) I found the critical elements that solves a crucial problem that the Hurd team submitted to me in ... 2002. It took time and many attempts but I think that the YASEP is a great place to experiment with this idea and prove its worth.
The Hurd uses a lot of processes to separate functions, enforce security and modularize the operating system. It uses "Inter Process Communication" (IPC) such as message passing and this is snail slow on x86 and most other architectures [thank you microcode].
The YASEP uses hardware threads which is a concept close, but not identical, to the processes of an operating system. And these last days I have found what was missing : the "execution context" ! So with the YASEP, a process is a hardware thread (a set of registers and special registers) associated to an execution context (the memory mapping, the access rights etc.)
Repeat after me : a process is a thread in a context.
This distinction is necessary because threads are activated for handling interrupts, operating system functions, library function calls and communication between the programs. It's a major feature of the processor which should provide functionalities that go beyond a mere microcontroller...
So IPC is necessary to make a decent OS and it requires several hardware threads (threads can be interleaved at the hardware level to provide with concurrency and better performance) and several contexts (for the operating system, device drivers, libraries, interrupt handlers...). The processor state can jump at will from one to the other with much less latency than an usual CPU.
The antagonistic requirements are as follows :
- A process must be able to call code from another context FAST, as fast as possible.
- The mechanism must be totally SAFE and SECURE.
- The physical implementation must be SIMPLE.
Simple and fast go hand in hand (ask Seymour Cray. Oh, wait, too late...). In the YASEP, communication takes place with a restricted variant of the function call instruction. Function calls are difficult to "harden" and more generic and specific instructions are usually found in other architectures to provide IPC or system calls. These are quite simple to implement in a CISC architecture like x86 because microcodes can do whatever is required... But they are slow because several dependent memory fetches must be performed (read the access rights table then find the address of the code to execute, whatever...)
The YASEP is a RISC-inspired architecture and requires a new approach. What I have found requires just 3 new opcodes :
- IPC : InterProcess Call
- IPE : InterProcess Call Entry
- IPR : InterProcess Call Return
Since the YASEP has a bank of several threads in the register set, the context switch is a matter of a few cycles only. One way to further reduce the execution time is to pre-calculate the destination address of the called code : no call table or things that require several chained/dependent memory accesses. In order to obtain the jump address, a thread must register itself in the called process and obtain the context number and the effective address. The calling thread can then modify its own code (update the constants) or variables to make the proper IPC later. Here is how simple it gets :
IPC R1, R2 ; call context number R2 at address R1 IPC 1234h, R2 ; call context number R2 at immediate address 1234
Security is a bigger beast and just changing the TID (Thread ID) value is not a good method. The first big problem is that any code can call any context at ANY address and a security mechanism is required to block unwanted calls from succeeding. The policies could be arbitrarily complex (depending on the OS strategies) and don't belong in hardware (unlike x86), a software-based authorisation system is preferred (like MIPS !). This is the role of the IPE instruction :
- IPE provides the Thread ID and Process ID of the calling thread (it's a kind of GET). From this, the callee can choose to accept or refuse the call, provide a specific service or even choose to not check at all. Any software can create its own policy, call by call !
- IPE is NECESSARY for the IPC instruction to complete. If IPC points to an instruction that is NOT IPE, an error is triggered. This prevents all applications from jumping anywhere in any code.
- Each thread can restrict the range of callable addresses so calls can't enter data sections. This is the role of additional registers.
When the thread calls code from another context at the right address, the register set is preserved (not touched) so the transmission of parameters takes no effort. However several new issues appear.
For example, how can one thread in a different context access data from the previous context ? The proposed solution is to provide an attribute to each Address register : the context number. Upon call, the newly spawned process will modify the necessary attributes to access to both the current and the calling process. Which means that all the previous contexts must be kept in the processor (since interthread calls must be reentrant). Before the call, the calling process should mark the memory ranges it accepts to share with the called process (marking the range as "shareable"). This way, no data copy is necessary !
The return address and thread/process/context IDs must be managed by the CPU core itself to prevent tampering by the caller or callee. This is the last point that needs some big work and HW real estate... A classic stack, with a stack pointer, stack base and stack limit, are necessary hardware resources to add.
So let's sum up the added hardware :
- Each context must be able to mark memory ranges as data-read and/or data-write by other threads. This can be indicated by flags for each page in the page table. How this can be restricted to certain threads (that are in the call stack) is still uncertain, a token scheme should be created where a permission can be passed to (and inherited from) another thread.
- Each context has 2 registers that are compared to the called address to restrict unwanted calls.
- Each process has a set of 3 registers for the IPC stack (pointer, base, limit). Pointer and limit are compared for equality upon call and pointer and base are compared during return.
- There are also 5 new thread-private registers that determine the owner (thread number) of a pointer. They must be preserved in HW if the caller or callee are not trusting or trusted.
That makes about 10 new registers ! How this will be implemented is still uncertain. Maybe a hardcoded sequence of instructions will be streamed through the instruction decoder, unless everything is done in parallel in big enough chips. This reminds me that in the past, I wanted to add "attributes" to the address generators of the VSP, with base/index/limit/stride, now there is the context number that is some kind of "address space number" (ASN). We can finally merge these ideas and in 16 bits code, we can use ASNs like segments in x86 : one for executable data, one for the stack, several for data, and no opcode prefix is needed.
Whatever the implementation, we're going here from a system initially designed for libraries and system calls, extended to the next level : a micro-kernel oriented architecture where processes can share memory they own so others can work on it, with little overhead. Will the Hurd people be finally happy now ?
The IPC instructions are a major enhancement of the YASEP. To this day, it's a very significant architectural feature that is still being expored. This work should continue in #NPM - New Programming Model !
The rest is ... more complicated and still debated. In particular, sharing data is a big headache but I am refining a new/better approach that does not involve "ASN" shadow registers, because it's just too heavy and not practical.
For communication between threads without forcing copies, there needs to be an instruction that "yields" a pointer (and the block size) to a given different thread.
ALL data pointers are protected by a TLB and each entry is associated to a thread ID so the YIELD instruction would simply mark the pointer as belonging to someone else.
The addressable data space of each context could be split into a "general purpose range" (for the stack, heap and stuff) and a "intercom range" with blocks shared with other contexts. For example, the intercom range is indicated by the MSB of the pointer set to 1.
ALL processes could allocate some memory in the intercom range for communication purposes with a call to a process ("com alloc" ?), they request a given size and the alloc process returns a pointer that gets yielded (the owner is changed from the alloc process to the calling process). The TLB entry type is selected depending on the block size (the block could be split into sub-blocks to reduce fragmentation and/or unused space).
A LOT of work would occur in this space so it would be probably insufficient for the 32-bits version (for a desktop) and clearly inadequate for the 16-bits version but ideal for F-CPU in 64 bits...
This is something to discuss in #NPM - New Programming Model!