Difference between revisions of "Protection"
Line 76: | Line 76: | ||
== Rationale == | == Rationale == | ||
− | On conventional architectures context switches are incredibly expensive. They can run into hundreds of cycles just to change the processor core state. And on top of that comes the cache threashing and all the memory accesses to switch the working data sets. Context switches used to be used to hide memory latencies on single core machines. But over time, with improved caching and reduced memory latencies, exploding internal processor state and increasing core counts context switches increasingly become the main cause of memory stalls on conventional architectures. | + | On conventional architectures context switches are incredibly expensive. They can run into hundreds of cycles just to change the processor core state. And on top of that comes the cache threashing and all the memory accesses to switch the working data sets. Context switches used to be used to hide memory latencies on single core machines. But over time, with improved caching and reduced memory latencies, exploding internal processor state and increasing core counts, context switches increasingly become the main cause of memory stalls on conventional architectures. |
− | As a result operating system architectures | + | As a result operating system architectures more and more revolve around avoiding context switches, even in places where they would actually be needed from a security and stability standpoint. This cost is the main reason why microkernels are impractical on mainstream processors. As a result it is still common for buggy device drivers to take down whole systems. And where the security features are absulutely unavoidable systems often spend a third and more of their cycles on context switches and related management like <abbr title="Translation Lookaside Buffer">TLB</abbr> and cache shuffling.<ref>[https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6 Linus Torvalds on the cost of page fault interrupts]</ref> |
To avoid all this, the goal should be to just pass the needed parameters back and forth between the protection domains with mutually protected code, but don't switch out all the surrounding context with it each time. This is of course an ideal and not fully attainable. But the closer you get to this ideal, the cheaper and cleaner the protection primitives become. And this is what drove the design. | To avoid all this, the goal should be to just pass the needed parameters back and forth between the protection domains with mutually protected code, but don't switch out all the surrounding context with it each time. This is of course an ideal and not fully attainable. But the closer you get to this ideal, the cheaper and cleaner the protection primitives become. And this is what drove the design. |
Revision as of 06:00, 2 August 2014
All security features on the Mill revolve around permissions on address space. There are no privileged instructions or protection rings. All processes are equal in the sense that they can share the permission they have, but only those they have, with whoever they choose, to the degree they choose.
Using the Mill security primitives in practice is not any different from calling normal a normal function in most cases, and also not much more expensive.
Contents
Protection Lookaside Buffer
In Architecture and Memory overview charts you can see that right above the highest cache lines there are the address protection units and the attached <abbre title="Protection Lookaside Buffer">PLB</abbr>s. There are two separate and specialized PLBs for data and instructions. And they all work on virtual addresses. Virtual address space is protected, independent of the translation to physical addresses.
This makes those buffers small and fast, and in fact protections can be checked in parallel with cache lookups, faulting when there are no permissions.
The reason address translation and protection were done together is mainly because historically the limited addresss space of 32bit was too small for all programs on the system, so each program got its own address space.
Regions and Turfs
What does a region look like? It's just a continuous stretch in the address space with a start and an end and a few attached attributes. They can overlap. A PLB region entry contains:
- lower and upper bounds of the region
- the rights, read and write for the data accesses, execute and portal for the instruction side
- a turf ID and a thread ID, both which may be wildcarded, when both are set, both must match
All regions that share the same turf ID together comprise a turf. That's all a turf is, a collection of regions. The turf, identfied by its ID, is the protection domain the Mill operates on, in and width.
There is nothing preventing several region entries to have the same bounds, as long as the IDs in the entry are different. The IDs are maintained by hardware and cannot be forged.
On boot the hardware provides a root turf containing one region with all rights to the first thread, the all-turf. From there on out the first thread can pass on any right and region subdivision as it sees fit to other threads, which those threads can do with what they got, and so it cascades down.
Threads
A thread is mostly quite conventional. It is a contained flow of execution and identifiable via its ID. A thread always executes within its protection domain, within its turf. It only ever has one turf at any one time, but the turf can change over time by crossing protection barriers. Many threads can share the same turf.
Well Known Regions
As fast as the PLB checks are in comparison to conventional combined translation/protection, they still take time and energy you would like to avoid if you can. And the vast majority of memory accesses for each program happen to the same memory regions, namely the code, data, const data and bss segments, as well as the stack as well as the thread local storage.
For this reason those regions are defined via Registers. Each turf has the code, data and const data well known regions, defined by the cp, ccp and dp registers respectively. The program loader fills those registers, and they are saved and restored by the hardware as needed on context switches.
Each thread has the stack and the thread local storage, defined by the sp and tp registers. The stack region grows and shrinks together with the stack pointer. You can't access the stack beyond the top of the stack, and of course also not below the base.
Protecting Stacks
So you can't access above the frame pointer. And newly allocated frames are automatically zero. This means you never can access any data on the stack that you shouldn't. And those are pure data stacks, too.
They contain no control flow information, in particular no return addresses. The control stacks are maintained entirely within the hardware, inaccessible from programs. They get saved and restored by the Spiller as needed. This makes several classes of common security exploits simply impossible.
Portals
The question now is, how do you change the turf? Of course you could do a normal task switch and change the thread. That changes the turf if the new thread has a different turf, but that's not ecessarily case. Many threads, all the worker threads in a simple web server for example, could have the same turf.
If you want to change into a different turf, to access an OS managed resource for example, you have to go through a portal. A portal defines a gateway into a specific turf or protection domain. All the information needed for that, all the information for a portal, is contained within one flow instruction cache line. It contains:
- an entry address to the code to run
- a turf ID
- the registers that define the well known regions of the turf
A program calls a pointer to a portal just the same as it calls a normal function pointer. Because of the portal rights attached to the region of the address, the hardware knows it's gotta do a portal call.
And that's what is does, by saving the turf ID and well known region registers of the current turf of the thread and then switching them out with the ones from the portal descriptor. This is followed with a normal call to the entry address given in the descriptor.
This is not a task switch. The result of a portal call is the same thread running in a new protection environment, in a new turf. All the working data sets are still in the caches, all the thread state, except for the well known regions, is still there. No bubbles in the pipeline etc. Prefetch works across portal calls just as it works across normal calls.
Services
There is a major difference in portal calls to normal function calls: They can hold state between invocations, and they do it for every turf/thread combination. In that sense they are more like coroutines, except every portal is an explicitly defined entry point that can be explicitly called. You could also interpret a collection of portals to a turf as a class interface, and once one portal is called from a thread that class has been instantiated in a thread. You could also shoehorn it in the closure category. It's also not a process because it doesn't have its own thread of execution.
In any case the differences are always significant enough that a new terminology for this kind of behavior is warranted, so the term we use is "service". An OS will provide services like an IO service with open and write and close portals that all alter service internal state.
And it should be remembered that all parties are equal towards each other. Just as much as the OS is a collection of services for the applications, the applications are services to the OS.
Interrupts
Interrupts are just unscheduled function calls, triggered by hardware. The interrupt handler can be normal function pointers, in case a program wants to handle overflows itself for example, and registers an own handler. It can be normal functions in OS defined tables. It can also can be portals in defined tables for handlers to execute in different turfs. Overall there are 4 levels of interrupt/fault/trap handlers, and only when they all fail to catch the interrupt, the system powers down.
The only additional costs for an interrupt in comparison to a normal function or portal call is that it can't be predicted and prefetched, pretty much by definition, and almost always at least leads to a top level instruction cache miss, and is also more likely to miss the I$1 and L$2.
Implementation
Stacklets
Granting and Revoking
Region Table
Rationale
On conventional architectures context switches are incredibly expensive. They can run into hundreds of cycles just to change the processor core state. And on top of that comes the cache threashing and all the memory accesses to switch the working data sets. Context switches used to be used to hide memory latencies on single core machines. But over time, with improved caching and reduced memory latencies, exploding internal processor state and increasing core counts, context switches increasingly become the main cause of memory stalls on conventional architectures.
As a result operating system architectures more and more revolve around avoiding context switches, even in places where they would actually be needed from a security and stability standpoint. This cost is the main reason why microkernels are impractical on mainstream processors. As a result it is still common for buggy device drivers to take down whole systems. And where the security features are absulutely unavoidable systems often spend a third and more of their cycles on context switches and related management like TLB and cache shuffling.[1]
To avoid all this, the goal should be to just pass the needed parameters back and forth between the protection domains with mutually protected code, but don't switch out all the surrounding context with it each time. This is of course an ideal and not fully attainable. But the closer you get to this ideal, the cheaper and cleaner the protection primitives become. And this is what drove the design.
All the context that needs to be switched to safely cross a protection barrier can be contained in one cache line. And in this clean and simple and cheap definition and execution of a security domain switch lies the basis of a flexible and reliable and efficient security framework. You define the gateways between the protection areas yourself and contain those definitions in small protected packages.
The protection domains, the regions and turfs, can be counted in the thousands or tens of thousands on a system. This is relative coarsely grained security. Having true capabilites, and protecting on the level of single objects and functions would be even better and cheaper from the security perspective, with even less or no context to switch and so forth. But implementing this requires non-commodity memory, and it also conflicts with the memory model of the most common languages like C, as they are used.
Media
Presentation on Security Features by Ivan Godard - Slides