Difference between revisions of "Protection"

From Mill Computing Wiki
Jump to: navigation, search
m (Region Table)
m (Stacklets)
 
(One intermediate revision by one other user not shown)
Line 74:Line 74:
 
Those stacklet blocks are automatically allocated by hardware. For this reason the upper part of the address space is reserved for stacklets in a 2 dimensional array indexed by thread ID and turf ID. The hardware can trivially compute the address. Due to [[Memory#Implicit_Zero_and_Virtual_Zero|implicit zero]] you can load from it. You can store to it, it is still [[Memory#Backless_Memory|backless memory]], until the first evict happens if the thread runs long enough.
 
Those stacklet blocks are automatically allocated by hardware. For this reason the upper part of the address space is reserved for stacklets in a 2 dimensional array indexed by thread ID and turf ID. The hardware can trivially compute the address. Due to [[Memory#Implicit_Zero_and_Virtual_Zero|implicit zero]] you can load from it. You can store to it, it is still [[Memory#Backless_Memory|backless memory]], until the first evict happens if the thread runs long enough.
  
How large stacklets are and how many turfs and threads are available and how much of the address space they hold overall is implementation specific. A system that allows 1M turfs and 1M threads and 8k stacklets reserves the upper 8GB of the address space.
+
How large stacklets are and how many turfs and threads are available and how much of the address space they hold overall is implementation specific. A system that allows 1M turfs and 1M threads and 8k stacklets reserves the upper 8PB of the address space.
  
 
==== Stacklet Info Block ====
 
==== Stacklet Info Block ====
Line 104:Line 104:
 
== Rationale ==
 
== Rationale ==
  
On conventional architectures context switches are incredibly expensive. They can run into hundreds of cycles just to change the processor core state. And on top of that comes the cache thrashing and all the memory accesses to switch the working data sets. Context switches used to be used to hide memory latencies on single core machines. But over time, with improved caching and reduced memory latencies, exploding internal processor state and increasing core counts, context switches have become one of the main causes of memory stalls.
+
On conventional architectures, context switches are incredibly expensive. They can run into hundreds of cycles just to change the processor core state. And on top of that comes the cache thrashing and all the memory accesses to switch the working data sets. Context switches used to hide memory latencies on single core machines. But over time, with improved caching, reduced memory latencies, exploding internal processor state, and increasing core counts, context switches have become one of the main causes of memory stalls.
  
As a result operating system architectures more and more revolve around avoiding context switches, even in places where they would be really needed from a security and stability standpoint. This cost is the main reason why microkernels are impractical on mainstream processors. As a result it is still common for buggy device drivers to take down whole systems. And where the security features are absolutely unavoidable, systems often spend a third and more of their cycles on context switches and related management like <abbr title="Translation Lookaside Buffer">TLB</abbr> and cache shuffling.<ref>[https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6 Linus Torvalds on the cost of page fault interrupts]</ref>
+
As a result, operating system architectures more and more revolve around avoiding context switches, even in places where they would be really needed from a security and stability standpoint. This cost is the main reason that microkernels are impractical on mainstream processors. So it is still common for buggy device drivers to take down whole systems. And where the security features are absolutely unavoidable, systems often spend a third or more of their cycles on context switches and related management like <abbr title="Translation Lookaside Buffer">TLB</abbr> and cache shuffling.<ref>[https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6 Linus Torvalds on the cost of page fault interrupts]</ref>
  
To avoid all this, the goal should be to just pass the needed parameters back and forth between the protection domains with mutually protected code, but don't switch out all the surrounding context with it each time. This is of course an ideal and not fully attainable. But the closer you get to this ideal, the cheaper and cleaner the protection primitives become. And this is what drove the design.
+
To avoid all this, the goal should be to just pass the needed parameters back and forth between the protection domains with mutually protected code, without switching out all the surrounding context with it each time. This is of course an ideal and not fully attainable. But the closer to this ideal they come, the cheaper and cleaner the protection primitives become. This is what drove the design.
  
All the context that needs to be switched to safely cross a protection barrier can be contained in two cache lines. And in this clean and simple and cheap definition and execution of a security domain switch lies the basis of a flexible and reliable and efficient security framework. You define the gateways between the protection domains yourself, and contain those definitions in small protected packages.
+
All the context that needs to be switched to safely cross a protection barrier can be contained in two cache lines. And in this clean, simple, cheap definition and execution of a security domain switch lies the basis of a flexible, reliable, and efficient security framework. The gateways between the protection domains are defined by the user, and contain those definitions in small protected packages.
  
The protection domains, the regions and turfs, can be counted in the thousands or tens of thousands on a system. This is relative coarsely grained security. Having true [http://en.wikipedia.org/wiki/Capability-based_security capabilities], and protecting on the level of single objects and functions would be even better and cheaper from the security perspective, with even less or no context to switch and so forth. But implementing this requires non-commodity memory, and it also conflicts with the memory model of the most common languages like C, as they are used.
+
The protection domains, the regions and turfs, can be counted in the thousands or tens of thousands on a system. This is relatively coarsely grained security. Having true [http://en.wikipedia.org/wiki/Capability-based_security capabilities], and protecting on the level of single objects and functions would be even better and cheaper from the security perspective, with even less or no context to switch and so forth. But implementing this requires non-commodity memory, and conflicts with the memory model of the most common languages like C, as they are used.
  
 
== Media ==
 
== Media ==

Latest revision as of 17:16, 10 June 2015

All security features on the Mill revolve around permissions on address space. There are no privileged instructions or protection rings. All processes are equal in the sense that they can share the permission they have (but only those they have), with whoever they choose to the degree they choose.

Using the Mill security primitives in practice is not any different from calling normal a normal function in most cases, and also not much more expensive.

Protection Lookaside Buffer

In Architecture and Memory overview charts you can see that right above the highest cache lines there are the address protection units and the attached PLBs. There are two separate and specialized PLBs for data and instructions. And they all work on virtual addresses. Virtual address space is protected, independent of the translation to physical addresses.
This makes those buffers small and fast. And in fact protections can be checked in parallel with cache lookups, faulting when there are no permissions.

The reason address translation and protection were done together is because historically the limited address space of 32bit was too small for all programs on the system, so each program got its own address space.

Regions and Turfs

A region is just a continuous stretch in the address space with a start and an end and a few attached attributes. Regions can overlap. A PLB region entry contains:

  • lower and upper bounds of the region
  • the rights: read and write for the data accesses, execute and portal for the instruction side
  • a turf ID and a thread ID, both which may be wild-carded. When both are set, both must match.

All regions that share the same turf ID together comprise a turf. That's all a turf is: a collection of regions. The turf, identified by its ID, is the protection domain the Mill operates on, in, and with.
There is nothing preventing several region entries to have the same bounds, as long as the IDs in the entry are different. The IDs are maintained by hardware and cannot be forged.

On boot the hardware provides a root turf containing one region with all rights to the first thread, the all-turf. From there on out the first thread can pass on any right and region subdivision as it sees fit to other threads, which those threads can also do to what they received. And so it cascades down.

Threads

A thread is mostly quite conventional. It is a contained flow of execution and identifiable via its ID. A thread always executes within its protection domain, within its turf. It only ever has one turf at any one time, but the turf can change over time by crossing protection barriers. Many threads can share the same turf.

Well Known Regions

As fast as the PLB checks are in comparison to conventional combined translation/protection, they still take time and energy that ideally would be avoided. And the vast majority of memory accesses for each program happen to the same memory regions, namely the code, data, const data, and bss segments, as well as the stack and the thread local storage.
For this reason those regions are defined via Registers. Each turf has the code, data, and const data of well known regions, defined by the cp, ccp and dp registers respectively. The program loader fills those registers, and they are saved and restored by the hardware as needed on context switches.
Each thread has the stack and the thread local storage, defined by the sp and tp registers. The stack region grows and shrinks together with the stack pointer. The stack can't be accessed beyond the top of the stack or below the base.

Protecting Stacks

So anything above the frame pointer can't be accessed. And newly allocated frames are automatically zero. This means any data on the stack that shouldn't be accessed can not be accessed. And those are pure data stacks, too.

They contain no control flow information. In particular, no return addresses. The control stacks are maintained entirely within the hardware, inaccessible from programs. They get saved and restored by the Spiller as needed. This makes several classes of common security exploits simply impossible.

Portals

A turf could of course be changed through a normal task switch and change of the thread. This method changes the turf if the new thread has a different turf, but that's not necessarily the case. Many threads, all the worker threads in a simple web server for example, could have the same turf.

If if the goal is to change into a different turf, to access an OS managed resource for example, this would be done by going through a portal. A portal defines a gateway into a specific turf or protection domain. All the information for a portal is contained within one flow instruction cache line. It contains:

  • an entry address to the code to run
  • a turf ID
  • the registers that define the well-known regions of the turf

A program calls a pointer to a portal just the same as it calls a normal function pointer. Because of the portal rights attached to the region of the address, the hardware knows it has to do a portal call.
And that's what the hardware does, by saving the turf ID and well-known region registers of the current turf and of the thread before switching them out with the ones from the portal descriptor. This is followed with a normal call to the entry address given in the descriptor.

This is not a task switch. The result of a portal call is the same thread running in a new protection environment, in a new turf. All the working data sets are still in the caches, all the thread state, except for the well-known regions, is still there. No bubbles in the pipeline etc. Prefetch works across portal calls just as it works across normal calls.

Services

There is a major difference in portal calls to normal function calls: they can hold state between invocations, and they do it for every turf/thread combination. In that sense they are more like coroutines, except every portal is an explicitly defined entry point that can be explicitly called. A collection of portals could also be interpreted to a turf as a class interface, and once one portal is called from a thread, that class has been instantiated in a thread. You could also shoehorn it in the closure category. It's also not a process because it doesn't have its own thread of execution.

In any case, the differences are always significant enough that a new terminology for this kind of behavior is warranted. The term we use is "service". An OS will provide services like an IO service with open and write and close portals that all alter service internal state.
And it should be remembered that all parties are equal towards each other. Just as much as the OS is a collection of services for the applications, the applications are services to the OS.

Interrupts

Interrupts are just unscheduled function calls, triggered by hardware. The interrupt handler can be normal function pointers, in case a program wants to handle overflows itself for example, and registers an own handler. It can be normal functions in OS defined tables. It can also can be portals in defined tables for handlers to execute in different turfs. Overall there are 4 levels of interrupt/fault/trap handlers, and only when they all fail to catch the interrupt, the system powers down.
The only additional costs for an interrupt in comparison to a normal function or portal call is that it can't be predicted and prefetched, pretty much by definition, and almost always at least leads to a top level instruction cache miss, and is also more likely to miss the I$1 and L$2.

Implementation

Stacklets

Each turf and thread combination has its own stack memory block called a stacklet. When a thread moves through the portals into different turfs, the whole logical thread stack is a linked list of stacklets. This is a common implementation of process stacks on many systems. Each stacklet has its own well-known region protection, because each is its own turf. That's the whole point: to isolate different turfs from each other. No extra PLB entries needed.

Those stacklet blocks are automatically allocated by hardware. For this reason the upper part of the address space is reserved for stacklets in a 2 dimensional array indexed by thread ID and turf ID. The hardware can trivially compute the address. Due to implicit zero you can load from it. You can store to it, it is still backless memory, until the first evict happens if the thread runs long enough.

How large stacklets are and how many turfs and threads are available and how much of the address space they hold overall is implementation specific. A system that allows 1M turfs and 1M threads and 8k stacklets reserves the upper 8PB of the address space.

Stacklet Info Block

To preserve the stack state of each thread on each turf in each stacklet, you need an info block. Those stacklet info blocks are allocated the same as the stacklets themselves from a reserved address space indexed with turf ID and thread ID, each is a cache line wide and contains the top of stack, the stack base and stack limit. Uninitialized stacklet info blocks are, as all uninitialized memory, zero, and as such intialize the default state of an empty stacklet. When a portal call occurs, the info block is written before leaving.

Each portal call has to pull the portal descriptor, and then the stacklet info block, two cache lines. Thus the overall cost of a portal call is a call and two cache pulls. The writing of the current info block is masked by the two pulls.

Granting and Revoking

Granting and revoking rights are hardware operations. They can be explicitly done to a specific turf or thread with grant and revoke, which puts a new region entry into the PLB or removes it, respectively.

It also can be done implicitly and temporarily with pass. Pass puts a temporary grant for the thread with a wild-carded turf into the PLB. It is grouped with a portal call. While in a different turf after a portal call, the location is still in the same thread, and as such the rights for the region passed by pass apply. When the called portal function returns, the passed rights are automatically removed again.
This is only needed when the required arguments can't be passed over the belt and must go through memory, like strings or buffers or more complex and larger data structures. Usually those grants are hidden in the OS libraries like libc, so applications don't need to worry.
There is also an operation called args, which ordinarily has nothing to do with granting rights. It reserves a portion of the data stack by setting the outp register to some value below the sp, as needed. The call hardware sets the inp to the set outp and now the callee has access to this data stack portion. As it is part of the well-known stack region of the thread, this works across portal calls, too.

Region Table

Above it has been said that the PLB is fast and small. This is only true in comparison to the TLB. The TLB needs only to be fast in comparison to DRAM accesses, since translations must happen before physical memory accesses to get the physical address. The PLBs need to be as fast as the L1 caches, since PLB lookups run in parallel to the cache lookups. So PLB lookups are still an energy cost best avoided, hence the well-known regions as optimizations for the most common cases.
Stacklets are another way to keep the clutter out of the PLBs; the call frames on each portal call would have to be protected with a new region to protect the stacks of the services from each other. Instead, only one region for each turf/thread combination exists, and this is a well-known region not appearing in the PLBs.

Even though almost all access rights are managed within well known regions and almost all transient right grants only ever appear in the PLBs, there are longer standing grants. Also, the PLBs are of limited size and can run out of space.
This is why the PLBs are backed by memory in a structure called the region table. It is set up by the OS.

As already stated, grants are entered into the PLB. New grants have a novel-bit set in the PLB. When the explicit or implicit revoke happens and the novel bit is set, then the entry is deleted.
On the other hand, if on a new grant the oldest entry is evicted from the PLB, it is written into the region table. When a PLB lookup misses, the hardware searches the region table, which is organized as an augmented interval tree and has logN complexity for all accesses. And when it finds an entry it is hoisted up into the PLB again, this time without the novel-bit set.
If a non-novel entry is to be deleted, it also has to be removed from the region table. Whether or not that is done lazily or eagerly, by the OS or by hardware, is implementation specific.

Rationale

On conventional architectures, context switches are incredibly expensive. They can run into hundreds of cycles just to change the processor core state. And on top of that comes the cache thrashing and all the memory accesses to switch the working data sets. Context switches used to hide memory latencies on single core machines. But over time, with improved caching, reduced memory latencies, exploding internal processor state, and increasing core counts, context switches have become one of the main causes of memory stalls.

As a result, operating system architectures more and more revolve around avoiding context switches, even in places where they would be really needed from a security and stability standpoint. This cost is the main reason that microkernels are impractical on mainstream processors. So it is still common for buggy device drivers to take down whole systems. And where the security features are absolutely unavoidable, systems often spend a third or more of their cycles on context switches and related management like TLB and cache shuffling.[1]

To avoid all this, the goal should be to just pass the needed parameters back and forth between the protection domains with mutually protected code, without switching out all the surrounding context with it each time. This is of course an ideal and not fully attainable. But the closer to this ideal they come, the cheaper and cleaner the protection primitives become. This is what drove the design.

All the context that needs to be switched to safely cross a protection barrier can be contained in two cache lines. And in this clean, simple, cheap definition and execution of a security domain switch lies the basis of a flexible, reliable, and efficient security framework. The gateways between the protection domains are defined by the user, and contain those definitions in small protected packages.

The protection domains, the regions and turfs, can be counted in the thousands or tens of thousands on a system. This is relatively coarsely grained security. Having true capabilities, and protecting on the level of single objects and functions would be even better and cheaper from the security perspective, with even less or no context to switch and so forth. But implementing this requires non-commodity memory, and conflicts with the memory model of the most common languages like C, as they are used.

Media

Presentation on Security Features by Ivan Godard - Slides

References

  1. Linus Torvalds on the cost of page fault interrupts