Security

ivan

Thread and turf ids last as long as the identified entity and are not swapped out and reused in the anticipated usage. A (typical) two million or so active ids should be sufficient for what a single chip can do. If you need more, then you'll have to implement pseudo-ids in software and write some kind of mapping that overlays them on the available hardware id-set and specRegs. Not trivial, but could be done based on the Mill primitive model. However, note that direct access to those specRegs is very much a kernel function, so the mapper would be part of the OS whatever it is.

BobC

The portal/service concept, along wiht much of hte rest of the Mill security architecture, would appear to be useful as a enhancement/modification to other CPU architectures. It appears to be lighter weight and more powerful than a traditional MMU (well, Memory Protection Unit, since there's no hardware virtual memory on the Mill).

The Mill is a 64-bit machine with a massive 60-bit physical address space ("an exabyte should be enough for anybody"), so an MPU is more appropriate. But there are lots of other processor uses that would benefit from a flexible MPU managing non-virtual memory: Just about every embedded system (my specialty).

Do you think all or part the Mill security model could be usefully ported to 32-bit embedded processors? If so, is MillComputing considering any of the following?

1. Port Mill innovations to existing architectures (with available IP: ARM, Nios, Mips, etc). (This could allow MillComputing to get fab experience and hardware revenue before the Mill itself is ready for roll-out.)

2. License Mill innovations to those wanting to incorporate them into existing architectures. (License revenue, like ARM.)

3. Some combination, where an experienced fabless partner (say, Qualcomm) would get limited Mill IP (for Snapdragon) in exchange for helping push the Mill toward silicon.

Yes, ARM and Nios already have MPUs available, but they seem primitive compared to the Mill security architecture.

JonathanThompson

The ramifications of the talk have sunk in, and they're funny in a brilliant way: whereas x86 has rings 0-3 (usually only using 2 of those unless virtualization is used in some form) for levels of memory protection and supervisor/user privileges, the Mill architecture has, by virtue of removing the concept of supervisor/user mode created a fractal tree of up to 2^22 protection levels that are hardware-accelerated and stupidly easy and cheap to manage. All that, and the virtualization facilities haven't been revealed as of yet! Sure, in theory, you could lock out access in x86 or comparable architectures to not have any given task have access anywhere else, but it would have massive overhead both in software and hardware to do so.

As mentioned by another poster regarding embedded software, these ramifications are rather interesting: I've not seen any kind of mention in my knowledge/understanding of machine architectures where protection levels are so fine and easy to work with. I am curious about details of MMU functionality for each of the regions, if it has present/not present bits, to make it comparable in that aspect of things: I suspect it does. In a finite physical memory system, where code is, I'd expect it'd need to use jump tables or all relative code so it could be swapped out, due to physical addresses being the same as virtual addresses. For data, it means that either data needs to be in separate physical regions for all allocated data, or there needs to be a method provided for fixing up pointers for when regions are swapped in and out.

But one of the funniest and best ramifications of the region/turf setup is the ability to perfectly isolate all data accesses and code accesses so precisely that it'd make tracking down stray pointers in the most complex of code bases a dream: since you could make each and every subroutine a service that has explicitly isolated memory accesses both for code and data, no buggy code, even in the "kernel" (Mill greatly confuses what that means in practice as one of the ramifications!) can't stomp on anything but its own dynamic WKR, thus making it easy to isolate such faults to either a very small part of the code base, or... hardware defects (pretending that won't happen is insane, as all know). Thus, if a service is known-good code, and something messes up, it's inherently traceable to a greater degree than probably any previously existing architecture that it was, indeed, a hardware error, even without ECC or equivalent, because if the only code that can access a small subset of RAM is known-good, then it can be demonstrated that the hardware did something wonky (perhaps DMA/bus mastering, or just plain failure).

This would make the Mill architecture an absolutely stunning processor for proving (as much as software can be proven) software code correct, especially kernels and their drivers for any operating systems, and then recompiling it for other architectures, if you felt a strange need to work with legacy hardware ;)

And that's the rub: it (the Mill architecture) needs to be adopted over other things for the long-term success it needs, but there's a huge amount of inertia in regards to not only rewriting code (it's not all portable, and often makes many assumptions about system/CPU architecture that may not be true on Mill) by also the chipsets. I would be so very unhappy if the Mill architecture is stopped not by something clearly superior for architecture, but merely because it didn't have a large enough quantum leap to supplant the existing base of higher-end processors along with chipsets. There are too many cases where the "good enough" is the enemy of the much better system, because the "much better system" had to overcome a rather sizable inertia to change by users, commercial and private.

Past attempts at emulating previous instruction sets (Crusoe with their recompiling on the fly, or pure emulation) have been less than ideal: the most practical thing is that code needs to be completely rebuilt for a native instruction set, and while that can be and has been done, that's a Super-man's leap of effort for many to accomplish. Recompiling portable source code is so much easier in many respects to get done right.

Perhaps the security aspects of the Mill may be, in combination with so many of the other things, that straw that healed the camel's back and brings it into widespread adoption in non-tiny spaces: that, and the fact that x86/ARM architecture with registers and complex instruction decoding seems to be hitting a wall for speed/power, regardless of how many gates you throw at it. At least, that's what I'm hoping for: so many code exploits are such a problem for people that costs everyone money and insecurity in regards to if your system and data is secure, and software is getting too complex/fast-developed to catch it all that the machine needs to be pro-active in architecture to make it impossible for it to be code-related, even with sub-par code.

ivan

Actually there is hardware virtual memory on the Mill, and paging and all the rest, and the virtual-to-physical translation, page traps and so on take place in the TLB in a quite ordinary way. It's just that the TLB is in an unusual place in the architecture, after rather than before the caches as seen from the core, and protection is divorced from paging and moved to a different unit.

Well, that's not quite true: there are some extras in the Mill MMU: virtual zero (described in the talk), and some NYF dealing with Unix fork(). But it's mostly true.

As for embedded work, the problem with a smaller address space footprint is aliasing. If an embedded application had sufficiently small address space needs that everything would fit in 32 bits with no memory mapping/aliasing the the Mill single-address-space model would work; it would be a full normal Mill with an implied (and ot represented) zero in the high-order word of pointers. Note that you would still want a PLB. Whether the market for a 32-bit-no-MMU Mill is big enough to justify the work is unknown.

Which brings me to your market questions. The Mill innovations tend to interlock righter tightly, so it is difficult to pull just one out and apply it to some other architecture. For example, you could pull Mill split-stream encoding out and apply it to a VLIW to be able to decode 30 ops per cycle like the Mill does. But without things like the Belt in the execution engine you wouldn't be able to execute all the ops you decoded. And so on. We're not opposed to licensing, and would take NRE contracts to port some of the ideas to other machines, but we see the opportunities as being rather limited. We feel that more likely we will sell hard macros of full Mills into SOCs.

In contrast, we are actively interested in partners for the Mill itself. We know that large buyers will demand second-source availability, which means a license/partner. In addition there are specialized markets - rad-hardened, for example - where the existing vendors have expertise we will never have and a license seems the way to go. It's the usual business story though - nobody wants to be the first to stick their neck out about a new architecture, but as soon as one bites everybody will be at our door.

To which we will say: we are not an ARM with a license-based business model, so it's going to be first-come-first-served.

ivan

Please distinguish alias mapping from paging. 32-bits systems don't have enough address-space bits, so they have to reuse addresses; this is mapping,. In addition, physical pages may be present or not. For efficiency, conventional systems combine the mapping task and the paging task in one MMU/TLB machine.

With a 60-bit space the Mill has no need for mapping; there's all the virtual space that everybody wants there for the taking. However, virtual space may still exceed physical DRAM, so there remains a need for paging, and the Mill does paging.

The paging is done in the TLB and its associated OS tables. Those are only looked at when we miss in cache and have to go to memory, and they tell where in physical memory a given virtual page is located, just as on a conventional machine. Or not located, if the page is swapped out to disk. So a Mill can still get page traps.

Mill code is position-independent (PIC), but jump table and the like have full-width pointers with virtual addresses where the code has been mapped. The underlying DRAM may be swapped out and paged back in at a different physical address, but the virtual address remains the same and the pointers still work.

David

You said that memory security model is intended to be very coarse grained. Many x86 garbage collected systems use page-sized protections in the MMU in order to inject read/write barriers based on page type, and to manage dirty flags in old generation memory pages. These security mappings can be modified on every trap, or at least on every GC cycle. Is this sort of thinking compatible with Mill memory security regions?

Systems like the JVM use memory reads into areas made unreadable as a safe-pointing device. To my understanding, the x86's speculative processing guarantees the trap is raised before any side effects from further instructions are committed. In the more logically asynchronous memory model of the Mill, does this guarantee still hold?

Not really security related: When JITting, do you need to generate member-specific code or can you write to the family-wide portable binary spec and use the loader to optimize and inject it into your memory space?

ivan

Re MMU use by garbage collectors:

One certainly could use Mill protection for this purpose, but there's a better way.

Mill protection has byte granularity, so the GC would need only one region descriptor for the whole of any one kind of space. In a generational GC for example, you might use one descriptor per generation (typical GCs use three generations). This would be an easy port, just replacing the small amount of code that manages the page tables with similar code that manages the region descriptors.

However, there's a better way, one that uses the GC support "event bits" in the pointer format. With these, the GC can work at the granularity of single objects rather than pages or regions, and would be expected to have sharply reduced overheads. Porting a GC to use these would probably be a bit more work, because the model changes and that requires a bit of thought. The actual code changes should be near trivial though, mostly involving taking stuff out.

ivan

Re JITs and member-specific code:

JITs will generate member-independent code and call an API to get that code specialized for the actual target.

davidm

How in the heck were you able to keep these advances to yourselves for so long? I'm reminded of the experience of reading a particularly clever proof--that "oh of course that's how you do that!" feeling. So many "obvious" (after the fact) enhancements. Keeping all this quiet for so long, well that's willpower.

I have a question about threading. In the security talk you mention that threads are "...the same familiar thread you've sworn at so many times in your programming career." I thought I'd read on comp.arch that the Mill won't support preemptive multitasking, even though it would support having multiple threads each running on a separate core. Did I get that wrong or can you have multiple preemptively switched (i.e. not "green") threads each getting a time slice on the same core?

ivan

Pre-emptive multitasking is fully supported. Its a policy decision above the Mill pay grade whether all threads are first class or some are part of thread groups that are scheduled as a group in the usual way. Task switch on a Mill is very cheap because all the save/restore is done for you by the spiller, but there is still the trip through the ready queue and the priority scheduling etc. The point to portals and stacklets is to have the protection benefits of separate processes without the IPC costs when the IPC is in fact being done co-routine style and so can be replaced by calls, as is the great majority of cases.

davidm

Thanks for the answer. Based on the Security talk that's what I expected.

I can imagine all kinds of novel uses for the "protection environment switch via portal" (i.e. essentially a double function-call sans prediction). "Green" thread implementations will rock on the Mill, as will simple system calls. Simply elegant.

[deleted]

The Novel (= "dirty") bit means you have a write-back protection cache, nifty!

I presume clean (non-Novel) revoked PLB entries are marked invalid but kept in the PLB until there is need to access the memory protection structure, at which point one OS trap can do a batch of work.

There was a question asked at the end of the talk (video@1:22:10) that Ivan apparently didn't understand and thus gave a non-answer to.

A pass operation grants permission "to whoever I'm next calling" (video@1:01:29). However, if an interrupt (involuntary function call) occurs between the pass and the call, does the pass apply to it? If not, how does the Mill keep the grant pending across that function call, and all the nested (voluntary) function calls, to have it take effect on return? Even if it creates a PLB entry immediately (with turf=* thread=myself), there are still the mechanics of cleaning it up again.

Is this more spiller magic?

Another question is whether a region can have both execute and portal permission. Presumably the All has both so both can be granted, but how does the Mill handle a call to such a region?

Also, if portal calls almost always require special permission-passing code, why not make it a distinct opcode? As you say, effectively all applications will jump through a library wrapper anyway.

One thing that people might have missed in the Q&A is that a callback must be a portal, but it can either be globally visible or may be granted/passwd for the duration of one function call so that unexpected callbacks are impossible.

(This case of callbacks might be the answer to the earlier question. A callback usually refers only to the original caller's buffers, so does not need to grant additional permissions, so perhaps C compatibility wins here.)

ivan

Very good! Yes, you can see the Novel bit as implementing a writeback cache, and deferred table update (described as "lazy" in the talk) works as you suggest.

As for the missed question (with my hearing I miss quite a few), the "next call" is for explicit calls, not for interrupts, traps, or faults. The pending grant(s) are state. We once had a way to push them in the PLB and fix them later, but there were issues if they got evicted before the call, so it's just spiller-food now.

A region descriptor cannot have both execute and portal permission, but you could create two overlapping descriptors. Which you got would be search happenstance. If you wound up looking at a portal block as code then you would not transit and would be due for an invalidInstruction fault Real Soon Now. If you wound up looking at code as a portal, and by accident you happened to pass the security check by satisfying the ids that the bits in the id fields implied, then you would transit to the turf implied by the bits in the turf field, and then try to jump to the address implied by the bits in the target field. That address would have to have execute permission, and be in fact the address of an EBB entry (or you are up for invalidInstruction again) and probably must be the address of the entry of a function with no arguments or you are up for invalidOperand because the belt contents wouldn't match what the code expects.

So, if the OS portal-bless service screws up and does overlap two descriptors, and the bitsies are just exactly right, then you can call a function in a random service. That's why portal-bless is in the kernel.

As for distinguishing portal from non-portal calls, the basic reason is uniformity. We wanted a single pointer representation, one you could pass on to code that did not know whether it's a portal or not. Consider a numeric integration package, which takes a data vector and a pointer to the function to integrate. The integrator should work the same whether the function pointer is to an application function, or a portal pointer to something in a math service.

joseph.h.garvin

I really like the Mill's approach to stack safety and that in particular it prevents Return Oriented Programming.

Has making random number generation support built in been considered? Besides the stack, the other major bane of embedded security is random number generation. Bad random numbers weaken crypto, and there have been a ton of vulnerabilities relating to routers picking their random seed at boot up time before sufficient entropy has been collected by the OS, leading to predictable random numbers and allowing entry to attackers. Intel has RDRAND but that doesn't help most of the embedded world. Obviously there's no reason in principle why a hardware entropy source couldn't be integrated into the Mill, but it would be great for it to be part of the minimum configuration (Tin) so that its hard to screw up and help cement a reputation for the Mill as being more secure than other CPUs. This could also be a great opportunity to take advantage of multiple outputs on the belt and/or the Mill vector types, lots of simulations will use random matrices/vectors. If you wanted to be really fancy you could support different distributions (e.g. Zipfian vs. Gaussian).

PeterH

A JIT producing generic code and calling a service to specialize strikes me as a policy decision, not exactly forced by the architecture. But I can see doing it that way as strongly advisable in the general case, especially if the code is expected to run on other than a very narrow selection if Mill.

PeterH

First class hardware random number generators aren't difficult. The old Atari systems C. 1980 had them. But I don't see them as a core feature of CPU hardware. An opcode in the generic code representation wouldn't hurt, with an option for the generator as a specialty register.

PeterH

Can the handler called through a portal easily identify who called it? Suppose a service is being called by many threads in different turfs, such as a service reading files? You don't want just any thread to access just any of the managed resources, and the caller can't be trusted to identify itself by simple parameters passed.

joseph.h.garvin

PeterH, I am not super familiar with the Atari hardware so I may be looking at the wrong document, but what I found here suggests the Atari used an LFSR, which as I understand it is not 'first class' in the context of cryptography, where it's important that a determined adversary doing statistical analysis not be able to figure out the stream. You need a real entropy source for the seed and a cryptographically secure RNG algorithm.

joseph.h.garvin

If the Mill doesn't have some other solution I think you could roll your own protocol using regions to allow the service to be sure who is calling the portal. Have the portal put a random value only readable by the desired calling thread into memory. When the calling thread wants to make the portal call, it passes the number provided by the service that only it knows, and the service then verifies that the caller's number is equal to its, and then sets its number to a new random value in anticipation of the next call. Critically when the check fails the service needs to sleep for a period or fault or cause the caller to fault somehow, and pick a new random value for the next challenge, otherwise the caller can brute force retry. Random numbers need to be used rather than just an incrementing counter, because otherwise a malicious thread can guess the current value from information like how long the system has been running or how many times the service has likely been invoked.

Edit: this scheme assumes the calling thread is not 'in cahoots' with a malicious thread and thus won't deliberately share the random value with it. But since the Mill protection model is that threads with a given permission can always grant other threads a subset of their permissions, I think this is OK.

David

Given that service calls are synchronous, I would presume that the current thread identifiers still reflect the caller. These should be read-only by both the caller and service, and shouldn't be spoofable by user-level malicious code. From there you should be able to get to the OS-specific or internal security descriptors.

« Previous Page Next Page »