Security

ivan

We have not put hardware encryption into the base Mill, because we expect encryption methods to change over the life of the architecture and don't want to be saddled with outdated legacy requirements.

That said, the Mill has enough horsepower to do a reasonable software crypto. For application that want more, the Mill defines an extension interface to which things like crypto and other specialized engines can be hooked. The interface makes the engine look like an i/o device from the program using it.

We have also considered supporting a physical seed unit. Such units give only a few bits of entropy and so cannot themselves be used for crypto, but the provide an uncrackable seed for regular algorithms. The decision on that has been deferred until after the FPGA version.

ivan

Certainly. I was describing what I expected the common implementation to be. We will support the specialize API as standard system software; it's up to the JIT whether to use it or not.

ivan

As suggested by the comments above, the thread id is unchanged through a portal call - it's the same thread, just running in another protection environment. The current thread id is a readable specReg, so the service can know who it is working for. From that it can find more info in data structures it or other services maintain.

However, it also can keep client-specific info in Thread Local Storage. Each client/service combination has its own TLS just like it has its own stacklet, addressed by a hardware base register like the hardware frame pointer.

squizzle

I think not supporting an AES primitive is a mistake. One day AES is likely to be replaced, but that's certainly many years away, and unless there's a massive and complete break, it will be even longer before people actually stop using it (how many people still use SHA1? MD5?).

If it (or perhaps a generic crypto instruction with a parameter for algorithm) is in the machine independent instruction set, it can be emulated when it reaches the point of being removed from silicon. When we reach this point, the emulation will only need to be correct, not super hand optimized for every cycle, so the cost required to maintain the specialiser on a new family member would presumably be very low / none.

ivan

There are issues with AES, any other crypto, and any block functional unit of any purpose. Recall that the Mill is a statically-scheduled fully pipelined machine with exposed timing. Long latency operations don't play well, and the AES is hundreds to thousands of cycles depending on implementation.

Moreover, on a fully-pipelined machine like the Mill you must be able to issue a new AES every cycle, which means that you need hundreds to thousands of AES engines because the iterative nature of the algorithm doesn't pipeline.

Next there are issues with data width. AES supports different data widths, 128-bit being typical. How would we feed it on Mills that do not support quad width?

There are similar issues with long-latency scheduling too, The compiler will find at most a handful of other operations that it can schedule in parallel with an AES operation, so the rest of the machine will be stalled for the great majority of the time of the AES. The stall would likely also block interrupts as well.

I sympathize with your desire that AES should be supported (and there are quite a few plausible other block functions that you don't mention). However, I think you are confusing a desire for a primitive, which is a semantic notion, with an operation, which is an implementation notion. AES may make a very good primitive that the market will demand and we should support; it makes a very bad operation. Instead, it should be implemented as an out-of-band block functionality akin to an i/o device in its interface. That was it doesn't have to fit into the decode/pipeline/belt that the Mill uses, and you only need one of them rather than hundreds.

It's easy to think that what appears primitive in software can be primitive in hardware. I wish it were that easy :-)

PeterH

Allowing that the OS can give threads with a shared security domain common local memory in the service turf, putting service state in thread local memory should work beautifully. A handle may be a pointer, and any thread that can access the appropriate memory space in the service turf can then use the handle. Nice and fast.

And since attempted read to forbidden memory produces metadata state, the service can check if the handle is valid for very low cost.

imbecile

I got 3 small questions.

1. Considering stacklets are only 4kb in size, services probably can't use to large data structures on the stack. But considering argument passing happens on the belt and call stacks in the spiller, the stack pressure on the Mill is vastly reduced. I'm assuming normal applications can have larger stacks than 4kb though.

2. There are a few harware functionalities that could be useful if exposed to the programmer, like the lookups in the PLB and TLP. I'm not sure how feasible and secure this is, but could those lookup algorithms implememted in hardware be accessible to the programmer through a service/portal call? Or are they too tightly tied to their dedicated data tables?

3. Are you aware of the MIT exokernels? I think the Mill architecture lends itself beautifully to it, and the service concept even makes some of the contortions they go through to avoid or secure task switches unnecessary, like dedicated languages that results in code passed to privilged kernel driver modules.

David

Speaking of operating systems, while it has gotten on in years I think AmigaOS would be a great fit. If I recall everything correctly, it already assumes a flat memory model, uses by-reference IPC data passing, and OS calls are normal function calls. Memory allocation requires protection descriptions as to how they'll be shared.

I don't know how much of this has changed in AmigaOS 4, but the assumptions made for simplicity and speed back then would gave great alignment with how the Mill accelerates and secures those assumptions.

ivan

1) Bottom stacklets are fixed size. Stack overflow (which can happen on the very first frame if it's big enough) pushes a grant of the occupied part of the overflowing segment, allocates a bigger (policy) stack segment somewhere, and sets the various specRegs to describe it and thereby grant permission. Return unwinds. Unwind is lazy so you don't get ping-ponging segments if you are real close to a segment boundary.

2) I doubt that the search hardware would be exposed. To specialized, and will vary by member so not portable.

3) I looked at the exokernel work. So far as I can see "exokernel" is just marketese for "microkernel" plus libraries. The libs would be services on a Mill, but otherwise I haven't seen anything that novel compared to prior work in microkernels and capability architectures. Please point out anything I have missed.

ivan

Yep. The Amiga got a lot of things right; amazingly so considering the vintage. AmigaOS was actually one of the (mental) use-cases during Mill development. Back even further, Cedar/Mesa was also a base.

imbecile

Yes, the exo-kernels are pretty much microkernels. The difference to other microkernels like the L4 is that the API has even lower level abstractions. They don't even really have a concept of threads for example, they work by granting processor time slices to memory mappings.

David

To my understanding, exokernels expect hardware drivers, filesystems, and other abstractions to be linked directly into user-space programs, so there is no IPC or context switching in those layers. Application optimizations can therefore drill to any abstraction depth to skip and/or cache more levels of processing and decision making than normal abstraction layers. The kernel security is only permission to hit the hardware, or some portion thereof.

However, the organization is fairly similar to microkernels. One could consider exokernels to be a particular (and peculiar) optimization of microkernel architecture.

rpjohnst

Exokernels are actually pretty different from microkernels (although the difference is somewhat orthogonal, rather than mutually exclusive).

Microkernels implement things like drivers and file systems in servers, essentially taking a piece of a monolithic kernel and moving it into a process. Applications still need to go through this interface to get things done without violating protection, so applications still can't e.g. change around file system disk block allocation algorithms, or make their own decisions regarding swapping.

Exokernels, on the other hand, provide protection at a lower level- applications can access things like disk blocks directly rather than through a privileged server. This is typically done through libraries which run with exactly the same privileges as the application itself, but which can be bypassed for greater flexibility. For example, database indexes are sometimes faster to regenerate than to swap in from disk, so a database application could choose to discard indices rather than having its LRU page swapped out.

This is why exokernel research tends to show performance improvements over monolithic kernels, and microkernels research tends to be about minimizing performance losses compared to monolithic kernels. :P

As I mentioned, you could easily have a microkernel whose servers expose their interfaces at the exokernel level rather than at the monolithic level. This would work really well on the Mill, where separating what would ordinarily be pieces of a kernel into services has a much lower cost.

ivan

Thank you; the explanation cleared up a lot for me.

I would be very doubtful about directly exposing a device to applications, the premise of exokernels. Devices (the external thingy itself, not the driver) often have *very* fragile interfaces. Apps in quest of speed are notoriously buggy. There's no conflict if the device can be assigned to just one process; the process is then both app and driver, and can tinker to its heart's desire and harm no one but itself. But few devices are as simple and inherently single-user as a card reader these days.

For a shared device, such as a drive, the app may get speed if it does its own driving. However, this exposes all other uses of the device to screw-ups on the part of the app/library/driver. Apps will tend to "just tweak it a bit", and break everybody else.

There are also issues with global behavior not being the same as local behavior in shared devices. There is a reason for central control of drives: all requests from all apps need to be sorted into seek order instead of request-issue order, or all apps using the drive suffer.

Now if all the library does is what a monolith driver would do, and the apps are trusted not to tinker with it, then the library is a service in the Mill sense, and on the Mill the app no longer must be trusted.

So again, I'm not really seeing much different between the micro- and exo-kernal approaches, at least on a Mill. In the micro, a driver process becomes a fast, cheap, secure service on the Mill; in the exo a trusted library becomes a fast, cheap, secure service. Calling the service a micro or an exo seems mostly a matter of legacy terminology and marketing.

BTW, nothing on the MIT exo site is less than a decade old, so I guess they abandoned the idea. I'd really like to see a paper that tells why.

rpjohnst

I should've been more clear- exokernels expose the driver directly (or even a very thin layer on top), not the device hardware itself. That way they can control access more granularly (e.g. disk blocks rather than whole disks). MIT's exokernels had all their drivers in a monolithic kernel; they removed things like file systems and the network stack into libraries, with the kernel interface designed to allow securely cooperating implementations.

The premise is to allow different applications to all run at once even when doing crazy lower-level optimizations. One of their big motivating examples is a web server (closer to a proxy cache today- it just serves static files) that allocates particular disk blocks to put files from individual pages closer together, TCP packet merging because of their knowledge of how the response is laid out, etc. It ended up being 8x faster than Harvest (which became squid) and 4x faster than their port of Harvest to the exokernel.

Another example I liked because it was particularly simple and clean was a 'cp' implementation. It finds all the disk blocks in all the files it's copying and issues big, sorted, asynchronous reads (the disk driver merges this with all other schedules, including other cp's, etc). Then, while that's going on, it creates all the new files and allocates blocks for them. Finally, it issues big asynchronous writes straight out of the disk block cache. This ended up being 3x faster than their cp using a unix libOS.

Neither of these (or any of their other examples) require any extra permissions- the libraries don't even have to be in services because if they break it just takes down the application using it, instead of corrupting the disk or something. The exokernel already took care of only granting permissions to disk blocks that the application would access anyway.

There are still a more recent few exokernel projects floating around, but MIT has moved on. It probably didn't catch on because some of its systems are either too complex to be worth the effort (file systems required downloading small bits of code into the kernel to verify things about metadata generically) or didn't solve all the problems they had (packet filters still have some ambiguities that aren't easy to resolve well for everyone).

However, many of their ideas have made their way into existing systems- Linux DRI is a very exokernel-like interface to the video card. Many ideas could still work out if you were willing to break compatibility as well- for example, a new exokernel could pragmatically decide to understand file system structures, while still letting the applications do most of the legwork themselves (and thus in the way best suited to their domain).

ivan

The following was from James Babcock and sent to me directly; repeated here for wider comment:
\

In the security talk, you said that the Mill will not generally have
high-granularity entries in its PLB, for performance reasons, but I
don't think you said anything either way about the TLB. Will the Mill
support fine-slicing of address spaces in the TLB? If so, how much do
slices cost, and if not would it be feasible to add? I ask mainly
because a finely-sliced address space in the TLB, combined with some
memory-allocator tricks, could solve the use-after-free security
problem, which Mill has not yet proposed a solution for.

The essence of the fix is separating reuse-of-address-space from
reuse-of-memory, and not reusing the address space of freed objects
for as long as possible. If it were cheap to reclaim memory without
having to reuse associated address space, for objects sized and
aligned to >=32 bytes but smaller than a traditional 4kb page, then
the use-after-free problem would be pretty much solved.

ivan

The TLB supports scaled sizes, as many modern TLBs do, but is unusual in that the smallest size is one line rather than 4KB. However, one-line pages exist to permit on-th-fly allocation of backing DRAM when a dirty line is evicted from the last-level cache. The one-line pages are scavenged by a system process and consolidated into more conventionally-sized pages. If scavenging were not done then the number of one-line entries would grow to the point that the underlying page tables would occupy most of physical memory, which is rather pointless.

There's another difference between the PLB and the TLB structure: while the granularity of pages in the TLB varies, it is always a multiple of lines, whereas the PLB has byte granularity. Consequently you can protect a single byte (which is actually done, for example to protect an MMIO register for a device), but you can't page one to disk.

Moving on to your suggestion to use the TLB for checking for use-after-free. A crawling allocation only works if it has a crawling free, or you wind up with internal fragmentation and, again, a lot of tracking entries and poor performance. If you do have a crawling free, for example if you are using a copying garbage collector, then it becomes possible to protect the freed region. However, the TLB is probably not the right place to do it. That's because on the Mill there is not necessarily any memory backing the address space - at all. Backless memory lets data have an existence only in cache, and is a significant performance win for transient data. The problem is that the hardware does not know that the software considered the data to be free, so if it is dirty (which it necessarily will be), it will eventually age out of cache, be evicted, and have physical DRAM allocated for it, all pointlessly.

The hardware handles the case of transient data in the stack, because allocated frames are automatically freed on return, and the whole call/return protocol is built into the hardware so the hardware knows to discard the lines covering the exited frame. For random memory it doesn't know about free unless the software tells it.

The good news is that the software is able to tell it. There are several cache management operations in the Mill that let software convey useful information to the cache. One of those operations tells it that an address range is dead, and lines covering that range can be discarded, nominally dirty or not. However, these ops do not recover allocated backing DRAM if it exists, because tracking DRAM is too complex for hardware and the OS must get involved.

And here's where the TLB gets back in the act. Once the crawling free has freed a whole page, and discarded the lines in cache, it is possible to remap the page in the TLB from physical memory to an error trap. There's your use-after-free case. However, I don't think that it is quite what you were hoping for, because it doesn't protect free spaces in fragmented memory, and it only protects a page granularity. Hence there's a window of vulnerability while waiting for a partially free page to become wholly free.

I'm not sure I have answered your question in this; feel free to expand or refine it here.

jimrandomh

That does indeed answer the question, although I feel like the problem is unsolved and there's room for innovation here.

It's not entirely critical that reading or writing in a freed block actually fault, as long as it's not receiving data from or overwriting pieces of some other object that was allocated into reused address space. If stray pointers lead into phantom backless memory instead of into an immediate error trap, that's not as good, but it's still a massive improvement.

With the Mill as it currently is, if I wanted to write a paranoid version of malloc and free, then free would first instruct the cache to discard or zero the object's lines, then check whether it was freeing the last entry in its page; if not, it'd either shrink or destroy that page table entry and replace it with two new ones on either side. causing a proliferation of page-table entries. This would be cheaper than every object having its own page table entry, but not by very much.

Cache-line granularity is fine, unless the cache lines are wider than I think.

I have two more ideas for dealing with this, which I'm going to continue in email (and maybe post here later) since it may bear on things that are NYF.

ivan

All of the finer-granularity solutions that we are aware of, where the granularity is fine enough to be used for red zones and to cover padding bytes in structures, have implementation and use costs that would restrict them to high-end machines that use (and pay for) non-standard DRAM configurations such as those needed for ECC. I could be considered if we enter the main-frame business.

Line granularity poisoning (as opposed to silent zeroing) is possible at an area, bandwidth and performance cost of a few percent. Line granularity is sufficient to detect use-after-free, but not for use as red zones.

All such schemes have a UI aspect too: the Mill is intended to runn programs straight off the net without rewrite. When there is hardware that detects use-after-free for example, we have to be wary of the reputational damage that may happen when the faulted user howls "But it works on an x86!". We could easily be blamed for program bugs that other machines ignore. Sad, but human nature :-)

akohdr

Can I second the request for RNG opcode(s)? It occurs to me that having RNG in the belt loop could provide orders of magnitude improvements in Monte Carlo based simulation. Having an opcode would save the I/O required feeding preconditioned random values quite literally removing noise from the data bus.

I know you guys aren't going after HPC but having this facility would allow you to go after a number of other verticals that use the Monte Carlo approach.

« Previous Page Next Page »