Security

staff

Talk given by Ivan Godard - 2014-03-21 at Google.

Slides: Powerpoint (.pptx)

Security and reliability on the Mill CPU:

Naughty, naughty; bad program, mustn’t do that!

Software bugs have always been a problem, but in recent years bugs have become an even more serious concern as they are exploited to breach system security for privacy violation, theft, and even terrorism or acts of war.

The Mill CPU architecture addresses software robustness in three basic ways: it makes impossible many errors and exploits; it detects and reports many errors and exploits that cannot be prevented; and it survives and recovers from many detected errors and exploits. None of these ways involve loss of performance.

The talk describes some of the Mill CPU features that defend against well-known error and exploit patterns. Examples include:

a call stack structure that cannot be overwritten to redirect execution on return

an instruction format that makes “return-oriented programming” exploits very difficult

an inter-process protection mechanism that lets applications, server code, and operating systems follow “least privilege” principles

These features will be discussed in the context of the overall Mill CPU security model, which defends not only against known errors and exploits, but also against unanticipated future failures.

squizzle

Another fascinating talk. I've got a few questions from it (you might notice a theme :)

Could a thread manipulate its own spReg to allow it to make calls without overflowing the stack, BUT overflowing the state in the spiller?

Could a thread do a grant in a loop, generating regions until something breaks? It sounds like the OS won't get a chance to stop it until PLB eviction occurs. Also, if it's on a family member which does it in hardware, that removes the OSs ability to regulate a rogue process.

If sounds like if a thread fills the region table with regions overlapping one address, the augmented interval tree will handle it happily (ignoring the previous question), however what if the PLB is (mostly) full of regions that all overlap an address, which a thread then accesses? Would it need to hit an excessive amount of regions in the cache to resolve the final permission set?

On the topic of overlapping regions, I assume it's an OR of the permissions, not an AND?

Does the iPLB ignore read permissions? (ie, can code be execute, non-read)

Keep up the good work

ivan

Could a thread manipulate its own spReg to allow it to make calls without overflowing the stack, BUT overflowing the state in the spiller?

Not unless it had rights to the MMIO register that shadows spReg, and if it has rights like that then it's part of the OS and better know what it is doing. spReg (and nearly all the specRegs) are not user-assignable except via MMIO. Ofg course, they change all the time as a side effect of executed operations like call, but not directly.

Could a thread do a grant in a loop, generating regions until something breaks? It sounds like the OS won’t get a chance to stop it until PLB eviction occurs. Also, if it’s on a family member which does it in hardware, that removes the OSs ability to regulate a rogue process.

The grant quantum, while set by policy, is managed by hardware and will throw a fault if exceeded. We didn't want to track it at evict time in case we have members that do evicts (with PLB entry creation) in hardware.

If sounds like if a thread fills the region table with regions overlapping one address, the augmented interval tree will handle it happily (ignoring the previous question), however what if the PLB is (mostly) full of regions that all overlap an address, which a thread then accesses? Would it need to hit an excessive amount of regions in the cache to resolve the final permission set?

The region table is searched until the first entry that would give permission is found. Entries are not concatenated during the search; permission for the whole access must be provided by a single entry. That keeps the search cost manageable; if you really need to have an access that is part permitted in different entries then you must use a service that will push a new grant covering the union of your existing entries.

On the topic of overlapping regions, I assume it’s an OR of the permissions, not an AND?

Any single entry is satisfied if all of the region, thread, turf and rights match.

Does the iPLB ignore read permissions? (ie, can code be execute, non-read)

The iPLB is used only for execute access, and its entries need only distinguish eXecute from Portal. Similarly the dPLB is used only for load and store, and need only have Read and Write permission bits. Table entries have all the bits so as to reduce the number of entries.

David

Just a super-quick verification about things I presume, but weren't explicitly stated:

For the inp/outp data passing within a local function call (not a portal), the implicit arguments do not undergo any additional memory copying during the function call?

Implicit arguments are simply framePointer[-N], whereas local stack registers are framePointer[+N], and thus are the same cost to access?

I'm pondering the Lisp/Javascript/etc style of effectively always-varargs parameter passing, and it seems that this would be the mechanism employed.

vapats

Wow. Simply. Brilliant. I see why you folks are busy securing your patents!

Y'all must have been letting some of these ideas stew for the past several decades...

cheers, - vic

Will_Edwards

Yes, you can use the stack to pass arguments between frames.

You may also use the heap. And how you do it depends a lot on whether you make a distinction between objects and primitives, and how you do garbage collection.

Normal calls within a compilation unit are very conventional and all the normal approaches work; its only if you want to take advantage of the Mill protection mechanism or its convenient for dynamic linking that you use portals.

ivan

A bit more - while you can use your convention-of-choice to pass arguments in memory, passing them in the belt will be much faster, with a performance difference roughly equal to the difference between passing in registers vs. memory on a legacy architecture. So I'd expect performance-oriented Lisp/JS implementations to use the belt when they can, and restrict the always-varargs approach to blind calls when they can't optimize.

yorik.sar

Great talk!

I found myself thinking that what you're saying is mostly what I was thinking about when I was comprehending Genode OS Framework. I understood that you cannot do those ways efficiently and securely on x86. What's missing is small and fast portals that you have! Are you going to build your OS for Mill with Genode or something like it?

yorik.sar

I also have some questions that bothered me throughout the talk.

* How do you create a turf? Who can create it? Can it be abused?
* Can you create a VMM that doesn't have right to read/write/execute all the memory but can grant access to it? Can you decouple VMM from the rest of OS?

yorik.sar

Sorry for mixing things up in the comments. I can't post long posts here somehow.

yorik.sar

* Some services might require passing a list of callbacks. Application will have to essentially build a number of portals in its data space. Can any application grant portal permissions to some address space that it can write?

yorik.sar

If it does, how is that region verified? Should turf ids be the same as turf that running thread is in currently? Can application abuse this by saying "All TBs of my memory are portals. Go, verify"?
* How does that need for portals map to existing C APIs?

Thank you very much for what you do in both creating a really good architecture and teaching it.

ivan

We are working on a reference port of Linux/L4 (https://en.wikipedia.org/wiki/L4_microkernel). Other implementers will have their own approach; Windows on a Mill is an interesting idea :-)

Turfs, threads, and portals (other than the first, created by the hardware) are built by trusted services, principally the loader. The critical steps include allocating turf and thread ids and blessing a hunk of address spaces with the Portal permission. Family members vary in the hardware help for these steps, but all unsafe parts of the work are done via MMIO, so only those with rights to the relevant MMIO regions can do it.

The software that does these steps will presumably have some notion of quantum, to prevent "portal bombs" such as you suggest. However, that's all policy, and so not defined by the Mill architecture.

We have had no problem mapping Mill security to the C API. The general strategy is to use a user-space shim that is linked right into the app. The app calls the shim, and the shim does the portal call, any necessary passing of arguments, and so on. However, we are far from done with the work and may find a gotcha as we get further. If so we'll deal with it then.

Callbacks are handled much like linking of dynamic libraries (.so files). hen you link to a dynlib on a conventional system, the app must itself export its callback entry points so the linker can fix up the calls in the library just like it fixes up the calls in the all to point to the entry points in the library; thgis is a fairly sophisticated use of the linker, but well known and quite routine.

On a Mill you do the same thing, only the exported entries are made into portal by the loader, rather than rewriting the code addresses. However, if you are a JIT so the entry address isn't known at load time then the entry portals (in either direction) must point to trampoline code at fixed addresses, where the trampoline indirects to the moveable entry point. It costs one more cycle in and out, so no buig deal.

There's currently no way to grants something unless you have the same access yourself. It's doable - another permission bit - but we have no use case that needs it.

Ivan

rpjohnst

I like it!

Some questions about a few fixed-size things. Stacks will often need to be bigger than one segment/stacklet. There may also be more processes/threads than available turf/thread ids- Linux's pid_t is 32 bits for example. How would these things be handled?

Stacklets could just be entry points that switch to the real stack I guess. How would that work, since call/return seems to be the only (user-available) way to mess with the stack pointer? This also goes with an earlier question I asked about continuations/coroutines.

ivan

Neither threads nor turfs are processes in the Unix sense, so pid_t can remain unchanged. A Unix process can be seen as an initial thread residing in a new turf, but you can have turfs without threads on the Mill, which is not possible on Unix because there is no way to have an isolated load module that is a pure service.

Stack segments can be arbitrary sized, subject to OS enforced quanta per usual. The info block describes the top segment of the segmented stack. A new service call (not a callback) will always use the reserved stacklet, but it initially has no data frame (spiller has a frame of course). Creating a frame is a distinct Mill operation stackf, which carries a size in bytes. If that exceeds the current limit then the hardware traps to a handler (which would ordinarily be a portal to a service) that runs out and allocates enough space for another segment, fixes up the links and the frame and stack pointer registers, and updates the info block. The handler enforces whatever quantum policy is desired. The alloca operation can also trigger stack overflow, with similar handling.

Subsequent callbacks will use the new segment. An exit followed by a new call can be optimized if the handler keeps a previous, now empty, segment around to avoid allocate/deallocate overhead when call/returns are bouncing over a segment boundary.

JonathanThompson

Hi rpjohnst,

For low-level hardware things I'll leave that to Ivan or someone else formally of Mill Computing.

However, for all hardware Linux runs on, memory management, pids/threads and the like are also an abstraction. Where there's more required than hardware provides, you'd never see a difference in code, as that's also abstracted: that sort of thing is handled the same way virtual memory is. That a pid is 32 bits is purely a housekeeping detail that is convenient for the software, and has no connection with the hardware it runs on or its limitations. All those tables can be swapped out as needed by the OS to handle as many as desired.

And, seriously: I'd be shocked if there's a single CPU with any number of cores that Linux runs on that you'll find 2^22 threads/processes in-use at any given time ;)

yorik.sar

Genode is an OS framework ("an offspring of the L4 community" they say on wiki). It uses whatever microkernel you give it (NOVA, Fiasco, L4) and provides everything you need to get proper OS on top of it. It can also run Linux or FreeBSD processes within it. And it heavily uses services that Mill supports in hardware.

So I wonder if it would be easier to port L4 with Genode instead of Linux on top of it to Mill to get proper OS faster and with less effort.

yorik.sar

On the second thought, you don't even need microkernel here. Genode has "hw" platform that uses hardware features to provide everything that's usually done in kernel. And Mill might do really good with it.

ivan

The Mill does good with a lot of things :-)

Seriously, while we know that we will do a reference microkernel, the decision of which extant system to build on must wait until the new compiler is up enough to take the load, which won't be for a while yet. It also depends on who we find as partners in the effort; if a grad student at University A gets fired up about porting system X to the Mill then we will arrange access to our tools and simulators and, eventually, hardware. We'll also support ports by B of Y, C of Z and so on; may the best and fastest win. The Mill is intended to be policy-agnostic, so the more different systems that get ported the more confident we are in the hardware model provided. As research ports come out, we will be working to make one or another be commercial grade, or driving a merger of the most successful ideas of several.

But all that waits on the tool chain for now.

rpjohnst

Ah, I like the hardware support for segmented stacks.

On thread/turf ids, how would the swapping out be handled with e.g. currently running threads needing their id revoked (as an analogy to swapping out an LRU page)? I'm asking out of pure "640k should be enough for anyone"-curiosity. :P