Security

Tagged: non-forgeable, thread, Turf

Author
Posts
staff
Keymaster
March 24, 2014 at 10:18 pm
Post count: 49
#780 |
Talk given by Ivan Godard – 2014-03-21 at Google.

Slides: Powerpoint (.pptx)
Security and reliability on the Mill CPU:
Naughty, naughty; bad program, mustn’t do that!
Software bugs have always been a problem, but in recent years bugs have become an even more serious concern as they are exploited to breach system security for privacy violation, theft, and even terrorism or acts of war.
The Mill CPU architecture addresses software robustness in three basic ways: it makes impossible many errors and exploits; it detects and reports many errors and exploits that cannot be prevented; and it survives and recovers from many detected errors and exploits. None of these ways involve loss of performance.
The talk describes some of the Mill CPU features that defend against well-known error and exploit patterns. Examples include:
- a call stack structure that cannot be overwritten to redirect execution on return
- an instruction format that makes “return-oriented programming” exploits very difficult
- an inter-process protection mechanism that lets applications, server code, and operating systems follow “least privilege” principles
These features will be discussed in the context of the overall Mill CPU security model, which defends not only against known errors and exploits, but also against unanticipated future failures.
squizzle
Participant
March 25, 2014 at 1:50 am
Post count: 5
#786
Another fascinating talk. I’ve got a few questions from it (you might notice a theme 🙂
Could a thread manipulate its own spReg to allow it to make calls without overflowing the stack, BUT overflowing the state in the spiller?
Could a thread do a grant in a loop, generating regions until something breaks? It sounds like the OS won’t get a chance to stop it until PLB eviction occurs. Also, if it’s on a family member which does it in hardware, that removes the OSs ability to regulate a rogue process.
If sounds like if a thread fills the region table with regions overlapping one address, the augmented interval tree will handle it happily (ignoring the previous question), however what if the PLB is (mostly) full of regions that all overlap an address, which a thread then accesses? Would it need to hit an excessive amount of regions in the cache to resolve the final permission set?
On the topic of overlapping regions, I assume it’s an OR of the permissions, not an AND?
Does the iPLB ignore read permissions? (ie, can code be execute, non-read)
Keep up the good work
- Ivan Godard
  Keymaster
  March 25, 2014 at 4:42 am
  Post count: 689
  #787
  Could a thread manipulate its own spReg to allow it to make calls without overflowing the stack, BUT overflowing the state in the spiller?
  Not unless it had rights to the MMIO register that shadows spReg, and if it has rights like that then it’s part of the OS and better know what it is doing. spReg (and nearly all the specRegs) are not user-assignable except via MMIO. Ofg course, they change all the time as a side effect of executed operations like call, but not directly.
  Could a thread do a grant in a loop, generating regions until something breaks? It sounds like the OS won’t get a chance to stop it until PLB eviction occurs. Also, if it’s on a family member which does it in hardware, that removes the OSs ability to regulate a rogue process.
  The grant quantum, while set by policy, is managed by hardware and will throw a fault if exceeded. We didn’t want to track it at evict time in case we have members that do evicts (with PLB entry creation) in hardware.
  If sounds like if a thread fills the region table with regions overlapping one address, the augmented interval tree will handle it happily (ignoring the previous question), however what if the PLB is (mostly) full of regions that all overlap an address, which a thread then accesses? Would it need to hit an excessive amount of regions in the cache to resolve the final permission set?
  The region table is searched until the first entry that would give permission is found. Entries are not concatenated during the search; permission for the whole access must be provided by a single entry. That keeps the search cost manageable; if you really need to have an access that is part permitted in different entries then you must use a service that will push a new grant covering the union of your existing entries.
  On the topic of overlapping regions, I assume it’s an OR of the permissions, not an AND?
  Any single entry is satisfied if all of the region, thread, turf and rights match.
  Does the iPLB ignore read permissions? (ie, can code be execute, non-read)
  The iPLB is used only for execute access, and its entries need only distinguish eXecute from Portal. Similarly the dPLB is used only for load and store, and need only have Read and Write permission bits. Table entries have all the bits so as to reduce the number of entries.
  This reply was modified 10 years, 4 months ago by Ivan Godard.
David
Participant
March 25, 2014 at 10:28 am
Post count: 32
#789
Just a super-quick verification about things I presume, but weren’t explicitly stated:
For the inp/outp data passing within a local function call (not a portal), the implicit arguments do not undergo any additional memory copying during the function call?
Implicit arguments are simply framePointer[-N], whereas local stack registers are framePointer[+N], and thus are the same cost to access?
I’m pondering the Lisp/Javascript/etc style of effectively always-varargs parameter passing, and it seems that this would be the mechanism employed.
- Will_Edwards
  Moderator
  March 25, 2014 at 11:24 am
  Post count: 98
  #791
  Yes, you can use the stack to pass arguments between frames.
  You may also use the heap. And how you do it depends a lot on whether you make a distinction between objects and primitives, and how you do garbage collection.
  Normal calls within a compilation unit are very conventional and all the normal approaches work; its only if you want to take advantage of the Mill protection mechanism or its convenient for dynamic linking that you use portals.
- Ivan Godard
  Keymaster
  March 25, 2014 at 1:52 pm
  Post count: 689
  #795
  A bit more – while you can use your convention-of-choice to pass arguments in memory, passing them in the belt will be much faster, with a performance difference roughly equal to the difference between passing in registers vs. memory on a legacy architecture. So I’d expect performance-oriented Lisp/JS implementations to use the belt when they can, and restrict the always-varargs approach to blind calls when they can’t optimize.
vapats
Participant
March 25, 2014 at 10:32 am
Post count: 11
#790
Wow. Simply. Brilliant. I see why you folks are busy securing your patents!
Y’all must have been letting some of these ideas stew for the past several decades…
cheers, – vic
yorik.sar
Participant
March 25, 2014 at 1:59 pm
Post count: 10
#796
Great talk!
I found myself thinking that what you’re saying is mostly what I was thinking about when I was comprehending Genode OS Framework. I understood that you cannot do those ways efficiently and securely on x86. What’s missing is small and fast portals that you have! Are you going to build your OS for Mill with Genode or something like it?
yorik.sar
Participant
March 25, 2014 at 2:01 pm
Post count: 10
#797
I also have some questions that bothered me throughout the talk.
* How do you create a turf? Who can create it? Can it be abused?
* Can you create a VMM that doesn’t have right to read/write/execute all the memory but can grant access to it? Can you decouple VMM from the rest of OS?
yorik.sar
Participant
March 25, 2014 at 2:05 pm
Post count: 10
#798
Sorry for mixing things up in the comments. I can’t post long posts here somehow.
yorik.sar
Participant
March 25, 2014 at 2:07 pm
Post count: 10
#799
* Some services might require passing a list of callbacks. Application will have to essentially build a number of portals in its data space. Can any application grant portal permissions to some address space that it can write?
yorik.sar
Participant
March 25, 2014 at 2:07 pm
Post count: 10
#800
If it does, how is that region verified? Should turf ids be the same as turf that running thread is in currently? Can application abuse this by saying “All TBs of my memory are portals. Go, verify”?
* How does that need for portals map to existing C APIs?
Thank you very much for what you do in both creating a really good architecture and teaching it.
- Ivan Godard
  Keymaster
  March 25, 2014 at 2:26 pm
  Post count: 689
  #802
  We are working on a reference port of Linux/L4 (https://en.wikipedia.org/wiki/L4_microkernel). Other implementers will have their own approach; Windows on a Mill is an interesting idea 🙂
  Turfs, threads, and portals (other than the first, created by the hardware) are built by trusted services, principally the loader. The critical steps include allocating turf and thread ids and blessing a hunk of address spaces with the Portal permission. Family members vary in the hardware help for these steps, but all unsafe parts of the work are done via MMIO, so only those with rights to the relevant MMIO regions can do it.
  The software that does these steps will presumably have some notion of quantum, to prevent “portal bombs” such as you suggest. However, that’s all policy, and so not defined by the Mill architecture.
  We have had no problem mapping Mill security to the C API. The general strategy is to use a user-space shim that is linked right into the app. The app calls the shim, and the shim does the portal call, any necessary passing of arguments, and so on. However, we are far from done with the work and may find a gotcha as we get further. If so we’ll deal with it then.
  Callbacks are handled much like linking of dynamic libraries (.so files). hen you link to a dynlib on a conventional system, the app must itself export its callback entry points so the linker can fix up the calls in the library just like it fixes up the calls in the all to point to the entry points in the library; thgis is a fairly sophisticated use of the linker, but well known and quite routine.
  On a Mill you do the same thing, only the exported entries are made into portal by the loader, rather than rewriting the code addresses. However, if you are a JIT so the entry address isn’t known at load time then the entry portals (in either direction) must point to trampoline code at fixed addresses, where the trampoline indirects to the moveable entry point. It costs one more cycle in and out, so no buig deal.
  There’s currently no way to grants something unless you have the same access yourself. It’s doable – another permission bit – but we have no use case that needs it.
  Ivan
  This reply was modified 10 years, 4 months ago by Ivan Godard.
Russell Johnston
Participant
March 25, 2014 at 8:25 pm
Post count: 9
#806
I like it!
Some questions about a few fixed-size things. Stacks will often need to be bigger than one segment/stacklet. There may also be more processes/threads than available turf/thread ids- Linux’s pid_t is 32 bits for example. How would these things be handled?
Stacklets could just be entry points that switch to the real stack I guess. How would that work, since call/return seems to be the only (user-available) way to mess with the stack pointer? This also goes with an earlier question I asked about continuations/coroutines.
- Ivan Godard
  Keymaster
  March 25, 2014 at 10:51 pm
  Post count: 689
  #807
  Neither threads nor turfs are processes in the Unix sense, so pid_t can remain unchanged. A Unix process can be seen as an initial thread residing in a new turf, but you can have turfs without threads on the Mill, which is not possible on Unix because there is no way to have an isolated load module that is a pure service.
  Stack segments can be arbitrary sized, subject to OS enforced quanta per usual. The info block describes the top segment of the segmented stack. A new service call (not a callback) will always use the reserved stacklet, but it initially has no data frame (spiller has a frame of course). Creating a frame is a distinct Mill operation stackf, which carries a size in bytes. If that exceeds the current limit then the hardware traps to a handler (which would ordinarily be a portal to a service) that runs out and allocates enough space for another segment, fixes up the links and the frame and stack pointer registers, and updates the info block. The handler enforces whatever quantum policy is desired. The alloca operation can also trigger stack overflow, with similar handling.
  Subsequent callbacks will use the new segment. An exit followed by a new call can be optimized if the handler keeps a previous, now empty, segment around to avoid allocate/deallocate overhead when call/returns are bouncing over a segment boundary.
  - Russell Johnston
    Participant
    March 26, 2014 at 3:53 pm
    Post count: 9
    #821
    Ah, I like the hardware support for segmented stacks.
    On thread/turf ids, how would the swapping out be handled with e.g. currently running threads needing their id revoked (as an analogy to swapping out an LRU page)? I’m asking out of pure “640k should be enough for anyone”-curiosity. 😛
    - Ivan Godard
      Keymaster
      March 26, 2014 at 4:36 pm
      Post count: 689
      #823
      Thread and turf ids last as long as the identified entity and are not swapped out and reused in the anticipated usage. A (typical) two million or so active ids should be sufficient for what a single chip can do. If you need more, then you’ll have to implement pseudo-ids in software and write some kind of mapping that overlays them on the available hardware id-set and specRegs. Not trivial, but could be done based on the Mill primitive model. However, note that direct access to those specRegs is very much a kernel function, so the mapper would be part of the OS whatever it is.
- JonathanThompson
  Participant
  March 25, 2014 at 10:52 pm
  Post count: 11
  #808
  Hi rpjohnst,
  For low-level hardware things I’ll leave that to Ivan or someone else formally of Mill Computing.
  However, for all hardware Linux runs on, memory management, pids/threads and the like are also an abstraction. Where there’s more required than hardware provides, you’d never see a difference in code, as that’s also abstracted: that sort of thing is handled the same way virtual memory is. That a pid is 32 bits is purely a housekeeping detail that is convenient for the software, and has no connection with the hardware it runs on or its limitations. All those tables can be swapped out as needed by the OS to handle as many as desired.
  And, seriously: I’d be shocked if there’s a single CPU with any number of cores that Linux runs on that you’ll find 2^22 threads/processes in-use at any given time 😉
yorik.sar
Participant
March 25, 2014 at 11:56 pm
Post count: 10
#816
Genode is an OS framework (“an offspring of the L4 community” they say on wiki). It uses whatever microkernel you give it (NOVA, Fiasco, L4) and provides everything you need to get proper OS on top of it. It can also run Linux or FreeBSD processes within it. And it heavily uses services that Mill supports in hardware.
So I wonder if it would be easier to port L4 with Genode instead of Linux on top of it to Mill to get proper OS faster and with less effort.
yorik.sar
Participant
March 26, 2014 at 12:35 am
Post count: 10
#817
On the second thought, you don’t even need microkernel here. Genode has “hw” platform that uses hardware features to provide everything that’s usually done in kernel. And Mill might do really good with it.
- Ivan Godard
  Keymaster
  March 26, 2014 at 1:12 am
  Post count: 689
  #819
  The Mill does good with a lot of things 🙂
  Seriously, while we know that we will do a reference microkernel, the decision of which extant system to build on must wait until the new compiler is up enough to take the load, which won’t be for a while yet. It also depends on who we find as partners in the effort; if a grad student at University A gets fired up about porting system X to the Mill then we will arrange access to our tools and simulators and, eventually, hardware. We’ll also support ports by B of Y, C of Z and so on; may the best and fastest win. The Mill is intended to be policy-agnostic, so the more different systems that get ported the more confident we are in the hardware model provided. As research ports come out, we will be working to make one or another be commercial grade, or driving a merger of the most successful ideas of several.
  But all that waits on the tool chain for now.
BobC
Moderator
March 26, 2014 at 9:48 pm
Post count: 10
#824
The portal/service concept, along wiht much of hte rest of the Mill security architecture, would appear to be useful as a enhancement/modification to other CPU architectures. It appears to be lighter weight and more powerful than a traditional MMU (well, Memory Protection Unit, since there’s no hardware virtual memory on the Mill).
The Mill is a 64-bit machine with a massive 60-bit physical address space (“an exabyte should be enough for anybody”), so an MPU is more appropriate. But there are lots of other processor uses that would benefit from a flexible MPU managing non-virtual memory: Just about every embedded system (my specialty).
Do you think all or part the Mill security model could be usefully ported to 32-bit embedded processors? If so, is MillComputing considering any of the following?
1. Port Mill innovations to existing architectures (with available IP: ARM, Nios, Mips, etc). (This could allow MillComputing to get fab experience and hardware revenue before the Mill itself is ready for roll-out.)
2. License Mill innovations to those wanting to incorporate them into existing architectures. (License revenue, like ARM.)
3. Some combination, where an experienced fabless partner (say, Qualcomm) would get limited Mill IP (for Snapdragon) in exchange for helping push the Mill toward silicon.
Yes, ARM and Nios already have MPUs available, but they seem primitive compared to the Mill security architecture.
- Ivan Godard
  Keymaster
  March 27, 2014 at 3:41 am
  Post count: 689
  #836
  Actually there is hardware virtual memory on the Mill, and paging and all the rest, and the virtual-to-physical translation, page traps and so on take place in the TLB in a quite ordinary way. It’s just that the TLB is in an unusual place in the architecture, after rather than before the caches as seen from the core, and protection is divorced from paging and moved to a different unit.
  Well, that’s not quite true: there are some extras in the Mill MMU: virtual zero (described in the talk), and some NYF dealing with Unix fork(). But it’s mostly true.
  As for embedded work, the problem with a smaller address space footprint is aliasing. If an embedded application had sufficiently small address space needs that everything would fit in 32 bits with no memory mapping/aliasing the the Mill single-address-space model would work; it would be a full normal Mill with an implied (and ot represented) zero in the high-order word of pointers. Note that you would still want a PLB. Whether the market for a 32-bit-no-MMU Mill is big enough to justify the work is unknown.
  Which brings me to your market questions. The Mill innovations tend to interlock righter tightly, so it is difficult to pull just one out and apply it to some other architecture. For example, you could pull Mill split-stream encoding out and apply it to a VLIW to be able to decode 30 ops per cycle like the Mill does. But without things like the Belt in the execution engine you wouldn’t be able to execute all the ops you decoded. And so on. We’re not opposed to licensing, and would take NRE contracts to port some of the ideas to other machines, but we see the opportunities as being rather limited. We feel that more likely we will sell hard macros of full Mills into SOCs.
  In contrast, we are actively interested in partners for the Mill itself. We know that large buyers will demand second-source availability, which means a license/partner. In addition there are specialized markets – rad-hardened, for example – where the existing vendors have expertise we will never have and a license seems the way to go. It’s the usual business story though – nobody wants to be the first to stick their neck out about a new architecture, but as soon as one bites everybody will be at our door.
  To which we will say: we are not an ARM with a license-based business model, so it’s going to be first-come-first-served.
JonathanThompson
Participant
March 26, 2014 at 10:52 pm
Post count: 11
#833
The ramifications of the talk have sunk in, and they’re funny in a brilliant way: whereas x86 has rings 0-3 (usually only using 2 of those unless virtualization is used in some form) for levels of memory protection and supervisor/user privileges, the Mill architecture has, by virtue of removing the concept of supervisor/user mode created a fractal tree of up to 2^22 protection levels that are hardware-accelerated and stupidly easy and cheap to manage. All that, and the virtualization facilities haven’t been revealed as of yet! Sure, in theory, you could lock out access in x86 or comparable architectures to not have any given task have access anywhere else, but it would have massive overhead both in software and hardware to do so.
As mentioned by another poster regarding embedded software, these ramifications are rather interesting: I’ve not seen any kind of mention in my knowledge/understanding of machine architectures where protection levels are so fine and easy to work with. I am curious about details of MMU functionality for each of the regions, if it has present/not present bits, to make it comparable in that aspect of things: I suspect it does. In a finite physical memory system, where code is, I’d expect it’d need to use jump tables or all relative code so it could be swapped out, due to physical addresses being the same as virtual addresses. For data, it means that either data needs to be in separate physical regions for all allocated data, or there needs to be a method provided for fixing up pointers for when regions are swapped in and out.
But one of the funniest and best ramifications of the region/turf setup is the ability to perfectly isolate all data accesses and code accesses so precisely that it’d make tracking down stray pointers in the most complex of code bases a dream: since you could make each and every subroutine a service that has explicitly isolated memory accesses both for code and data, no buggy code, even in the “kernel” (Mill greatly confuses what that means in practice as one of the ramifications!) can’t stomp on anything but its own dynamic WKR, thus making it easy to isolate such faults to either a very small part of the code base, or… hardware defects (pretending that won’t happen is insane, as all know). Thus, if a service is known-good code, and something messes up, it’s inherently traceable to a greater degree than probably any previously existing architecture that it was, indeed, a hardware error, even without ECC or equivalent, because if the only code that can access a small subset of RAM is known-good, then it can be demonstrated that the hardware did something wonky (perhaps DMA/bus mastering, or just plain failure).
This would make the Mill architecture an absolutely stunning processor for proving (as much as software can be proven) software code correct, especially kernels and their drivers for any operating systems, and then recompiling it for other architectures, if you felt a strange need to work with legacy hardware 😉
And that’s the rub: it (the Mill architecture) needs to be adopted over other things for the long-term success it needs, but there’s a huge amount of inertia in regards to not only rewriting code (it’s not all portable, and often makes many assumptions about system/CPU architecture that may not be true on Mill) by also the chipsets. I would be so very unhappy if the Mill architecture is stopped not by something clearly superior for architecture, but merely because it didn’t have a large enough quantum leap to supplant the existing base of higher-end processors along with chipsets. There are too many cases where the “good enough” is the enemy of the much better system, because the “much better system” had to overcome a rather sizable inertia to change by users, commercial and private.
Past attempts at emulating previous instruction sets (Crusoe with their recompiling on the fly, or pure emulation) have been less than ideal: the most practical thing is that code needs to be completely rebuilt for a native instruction set, and while that can be and has been done, that’s a Super-man’s leap of effort for many to accomplish. Recompiling portable source code is so much easier in many respects to get done right.
Perhaps the security aspects of the Mill may be, in combination with so many of the other things, that straw that healed the camel’s back and brings it into widespread adoption in non-tiny spaces: that, and the fact that x86/ARM architecture with registers and complex instruction decoding seems to be hitting a wall for speed/power, regardless of how many gates you throw at it. At least, that’s what I’m hoping for: so many code exploits are such a problem for people that costs everyone money and insecurity in regards to if your system and data is secure, and software is getting too complex/fast-developed to catch it all that the machine needs to be pro-active in architecture to make it impossible for it to be code-related, even with sub-par code.
- Ivan Godard
  Keymaster
  March 27, 2014 at 4:18 am
  Post count: 689
  #837
  Please distinguish alias mapping from paging. 32-bits systems don’t have enough address-space bits, so they have to reuse addresses; this is mapping,. In addition, physical pages may be present or not. For efficiency, conventional systems combine the mapping task and the paging task in one MMU/TLB machine.
  With a 60-bit space the Mill has no need for mapping; there’s all the virtual space that everybody wants there for the taking. However, virtual space may still exceed physical DRAM, so there remains a need for paging, and the Mill does paging.
  The paging is done in the TLB and its associated OS tables. Those are only looked at when we miss in cache and have to go to memory, and they tell where in physical memory a given virtual page is located, just as on a conventional machine. Or not located, if the page is swapped out to disk. So a Mill can still get page traps.
  Mill code is position-independent (PIC), but jump table and the like have full-width pointers with virtual addresses where the code has been mapped. The underlying DRAM may be swapped out and paged back in at a different physical address, but the virtual address remains the same and the pointers still work.
David
Participant
March 27, 2014 at 11:14 am
Post count: 32
#838
You said that memory security model is intended to be very coarse grained. Many x86 garbage collected systems use page-sized protections in the MMU in order to inject read/write barriers based on page type, and to manage dirty flags in old generation memory pages. These security mappings can be modified on every trap, or at least on every GC cycle. Is this sort of thinking compatible with Mill memory security regions?
Systems like the JVM use memory reads into areas made unreadable as a safe-pointing device. To my understanding, the x86’s speculative processing guarantees the trap is raised before any side effects from further instructions are committed. In the more logically asynchronous memory model of the Mill, does this guarantee still hold?
Not really security related: When JITting, do you need to generate member-specific code or can you write to the family-wide portable binary spec and use the loader to optimize and inject it into your memory space?
- Ivan Godard
  Keymaster
  March 27, 2014 at 12:23 pm
  Post count: 689
  #839
  Re MMU use by garbage collectors:
  One certainly could use Mill protection for this purpose, but there’s a better way.
  Mill protection has byte granularity, so the GC would need only one region descriptor for the whole of any one kind of space. In a generational GC for example, you might use one descriptor per generation (typical GCs use three generations). This would be an easy port, just replacing the small amount of code that manages the page tables with similar code that manages the region descriptors.
  However, there’s a better way, one that uses the GC support “event bits” in the pointer format. With these, the GC can work at the granularity of single objects rather than pages or regions, and would be expected to have sharply reduced overheads. Porting a GC to use these would probably be a bit more work, because the model changes and that requires a bit of thought. The actual code changes should be near trivial though, mostly involving taking stuff out.
- Ivan Godard
  Keymaster
  March 27, 2014 at 12:25 pm
  Post count: 689
  #840
  Re JITs and member-specific code:
  JITs will generate member-independent code and call an API to get that code specialized for the actual target.
  - PeterH
    Participant
    March 29, 2014 at 12:13 pm
    Post count: 41
    #875
    A JIT producing generic code and calling a service to specialize strikes me as a policy decision, not exactly forced by the architecture. But I can see doing it that way as strongly advisable in the general case, especially if the code is expected to run on other than a very narrow selection if Mill.
    - Ivan Godard
      Keymaster
      March 29, 2014 at 3:31 pm
      Post count: 689
      #888
      Certainly. I was describing what I expected the common implementation to be. We will support the specialize API as standard system software; it’s up to the JIT whether to use it or not.
David McCammond-Watts
Participant
March 27, 2014 at 5:29 pm
Post count: 13
#841
How in the heck were you able to keep these advances to yourselves for so long? I’m reminded of the experience of reading a particularly clever proof–that “oh of course that’s how you do that!” feeling. So many “obvious” (after the fact) enhancements. Keeping all this quiet for so long, well that’s willpower.
I have a question about threading. In the security talk you mention that threads are “…the same familiar thread you’ve sworn at so many times in your programming career.” I thought I’d read on comp.arch that the Mill won’t support preemptive multitasking, even though it would support having multiple threads each running on a separate core. Did I get that wrong or can you have multiple preemptively switched (i.e. not “green”) threads each getting a time slice on the same core?
Ivan Godard
Keymaster
March 27, 2014 at 5:59 pm
Post count: 689
#842
Pre-emptive multitasking is fully supported. Its a policy decision above the Mill pay grade whether all threads are first class or some are part of thread groups that are scheduled as a group in the usual way. Task switch on a Mill is very cheap because all the save/restore is done for you by the spiller, but there is still the trip through the ready queue and the priority scheduling etc. The point to portals and stacklets is to have the protection benefits of separate processes without the IPC costs when the IPC is in fact being done co-routine style and so can be replaced by calls, as is the great majority of cases.
- David McCammond-Watts
  Participant
  March 27, 2014 at 6:20 pm
  Post count: 13
  #846
  Thanks for the answer. Based on the Security talk that’s what I expected.
  I can imagine all kinds of novel uses for the “protection environment switch via portal” (i.e. essentially a double function-call sans prediction). “Green” thread implementations will rock on the Mill, as will simple system calls. Simply elegant.
Anonymous
March 28, 2014 at 9:13 pm
Post count: 7
#871
The Novel (= “dirty”) bit means you have a write-back protection cache, nifty!
I presume clean (non-Novel) revoked PLB entries are marked invalid but kept in the PLB until there is need to access the memory protection structure, at which point one OS trap can do a batch of work.
There was a question asked at the end of the talk (video@1:22:10) that Ivan apparently didn’t understand and thus gave a non-answer to.
A pass operation grants permission “to whoever I’m next calling” (video@1:01:29). However, if an interrupt (involuntary function call) occurs between the pass and the call, does the pass apply to it? If not, how does the Mill keep the grant pending across that function call, and all the nested (voluntary) function calls, to have it take effect on return? Even if it creates a PLB entry immediately (with turf=* thread=myself), there are still the mechanics of cleaning it up again.
Is this more spiller magic?
Another question is whether a region can have both execute and portal permission. Presumably the All has both so both can be granted, but how does the Mill handle a call to such a region?
Also, if portal calls almost always require special permission-passing code, why not make it a distinct opcode? As you say, effectively all applications will jump through a library wrapper anyway.
One thing that people might have missed in the Q&A is that a callback must be a portal, but it can either be globally visible or may be granted/passwd for the duration of one function call so that unexpected callbacks are impossible.
(This case of callbacks might be the answer to the earlier question. A callback usually refers only to the original caller’s buffers, so does not need to grant additional permissions, so perhaps C compatibility wins here.)
- Ivan Godard
  Keymaster
  March 28, 2014 at 10:03 pm
  Post count: 689
  #872
  Very good! Yes, you can see the Novel bit as implementing a writeback cache, and deferred table update (described as “lazy” in the talk) works as you suggest.
  As for the missed question (with my hearing I miss quite a few), the “next call” is for explicit calls, not for interrupts, traps, or faults. The pending grant(s) are state. We once had a way to push them in the PLB and fix them later, but there were issues if they got evicted before the call, so it’s just spiller-food now.
  A region descriptor cannot have both execute and portal permission, but you could create two overlapping descriptors. Which you got would be search happenstance. If you wound up looking at a portal block as code then you would not transit and would be due for an invalidInstruction fault Real Soon Now. If you wound up looking at code as a portal, and by accident you happened to pass the security check by satisfying the ids that the bits in the id fields implied, then you would transit to the turf implied by the bits in the turf field, and then try to jump to the address implied by the bits in the target field. That address would have to have execute permission, and be in fact the address of an EBB entry (or you are up for invalidInstruction again) and probably must be the address of the entry of a function with no arguments or you are up for invalidOperand because the belt contents wouldn’t match what the code expects.
  So, if the OS portal-bless service screws up and does overlap two descriptors, and the bitsies are just exactly right, then you can call a function in a random service. That’s why portal-bless is in the kernel.
  As for distinguishing portal from non-portal calls, the basic reason is uniformity. We wanted a single pointer representation, one you could pass on to code that did not know whether it’s a portal or not. Consider a numeric integration package, which takes a data vector and a pointer to the function to integrate. The integrator should work the same whether the function pointer is to an application function, or a portal pointer to something in a math service.
joseph.h.garvin
Participant
March 29, 2014 at 10:55 am
Post count: 22
#874
I really like the Mill’s approach to stack safety and that in particular it prevents Return Oriented Programming.
Has making random number generation support built in been considered? Besides the stack, the other major bane of embedded security is random number generation. Bad random numbers weaken crypto, and there have been a ton of vulnerabilities relating to routers picking their random seed at boot up time before sufficient entropy has been collected by the OS, leading to predictable random numbers and allowing entry to attackers. Intel has RDRAND but that doesn’t help most of the embedded world. Obviously there’s no reason in principle why a hardware entropy source couldn’t be integrated into the Mill, but it would be great for it to be part of the minimum configuration (Tin) so that its hard to screw up and help cement a reputation for the Mill as being more secure than other CPUs. This could also be a great opportunity to take advantage of multiple outputs on the belt and/or the Mill vector types, lots of simulations will use random matrices/vectors. If you wanted to be really fancy you could support different distributions (e.g. Zipfian vs. Gaussian).
- PeterH
  Participant
  March 29, 2014 at 12:34 pm
  Post count: 41
  #876
  First class hardware random number generators aren’t difficult. The old Atari systems C. 1980 had them. But I don’t see them as a core feature of CPU hardware. An opcode in the generic code representation wouldn’t hurt, with an option for the generator as a specialty register.
  - joseph.h.garvin
    Participant
    March 29, 2014 at 1:03 pm
    Post count: 22
    #880
    PeterH, I am not super familiar with the Atari hardware so I may be looking at the wrong document, but what I found here suggests the Atari used an LFSR, which as I understand it is not ‘first class’ in the context of cryptography, where it’s important that a determined adversary doing statistical analysis not be able to figure out the stream. You need a real entropy source for the seed and a cryptographically secure RNG algorithm.
    This reply was modified 10 years, 4 months ago by joseph.h.garvin.
  - akohdr
    Participant
    April 16, 2014 at 1:31 pm
    Post count: 2
    #991
    Can I second the request for RNG opcode(s)? It occurs to me that having RNG in the belt loop could provide orders of magnitude improvements in Monte Carlo based simulation. Having an opcode would save the I/O required feeding preconditioned random values quite literally removing noise from the data bus.
    I know you guys aren’t going after HPC but having this facility would allow you to go after a number of other verticals that use the Monte Carlo approach.
    - Ivan Godard
      Keymaster
      April 16, 2014 at 3:02 pm
      Post count: 689
      #992
      As with other very-long-latency operations, RNG does not fit well as an opcode. There is also the problem that RNG algorithms may be expected to improve with time. Consequently we see such a box to be optional per-member, and attached through a different interface that is not fixed-latency like regular ops.
      We do plan to provide a physical seed operation. This is not a RNG, but will provide a few bits of completely unpredictable entropy to start things off.
      - akohdr
        Participant
        April 16, 2014 at 6:10 pm
        Post count: 2
        #993
        I’m not sure I’d consider a RNG a ‘very-long-latency’ operation. Suitably pipelined it becomes a generating function producing a value on each tick at the expense of area. Conditioning from an entropy source is an algorithm though, so I see where you are coming from. If you have the budget I’d urge at least a PRNG I think it would be of general value.
        In the case of Monte Carlo I see a potential improvement from the temporal addressing provided by the belt. For maximum impact the trial state space must fit on the belt and there would need to be a local source of randomness. If the inner loop reverts to memory access it’s no better than streamed vector approach using GPU/DSP.
      - Ivan Godard
        Keymaster
        April 16, 2014 at 6:57 pm
        Post count: 689
        #994
        The issue with a RNG and other hardware of similar character is not the bandwidth, it’s the latency: how long between first request and first result. That needs to be no more than 4-5 cycles, and uniformly consistent for every use. Otherwise the hardware can exist, but it needs to be accessed in a different way or the whole machine suffers whether the RNG is used or not. So the right place for a RNG is as a side hunk of hardware, not reached by the usual operator interface. The timing, bot latency and bandwidth, would be as good as if it were an operation, but without the penalty to machine performance of other code.
        Similarly, the belt cannot be indexed or have an address taken. Again, even the possibility of doing do would cut everybody’s performance 2X or more, whether they used the facility or not. So this is potentially another hunk of out-of-line hardware for users with special needs. We’ve identified several others of the sort: large look-up tables, Galois Field operations, etc. They would all use the same interface.
      - PeterH
        Participant
        April 17, 2014 at 9:39 am
        Post count: 41
        #997
        A RNG need not be long latency, depending on the requirements. Requirements for cryptography vs. a Monte-carlo simulation are different. For cryptography you ideally want a generator that can’t be predicted, even knowing the last N numbers produced. If this takes 300 cycles to get a result, so be it, you aren’t asking for than many random numbers. A Monte-carlo simulation, on the other hand, can accept less random results but likes them fast.
        A hardware LFSR based generator should be faster than an adder, but is completely unsuitable for cryptography, far too predictable. A software based LFSR in the same code using the numbers I’d estimate running 1 vector of results/cycle on the mill.
        Reading from a bank of asynchronous oscillators is fairly fast and pretty good if the sampling is slow compared to the oscillator rate. But this takes power to run the hardware. So high power consumption, slow sampling, or weak randoms. If combined with another independent method you can get top grade randoms.
      - Ivan Godard
        Keymaster
        April 17, 2014 at 1:54 pm
        Post count: 689
        #998
        We believe that the width of the Mill (on the sorts of family members one would expect to use for sim) makes a software implementation practical for low-grade RNGs like a LFSR. As mentioned, we expect to use an i/o like interface for a crypto box on those members that have one. The purpose of the hardware random seed generator that I mentioned is not for use as a RNG – not enough entropy even for low grade – but to provide a physics-based starting point that cannot be predicted even if an attacker has all the code and the machine circuits to study.
        Whether the seed generator uses oscillators, back-biased diodes, or the howling of cats is up to the hardware guys and likely will vary by member. 🙂
        However, all this projection is very premature, and will no doubt be changed as we get closer to actual market.
      - indolering
        Participant
        April 17, 2016 at 8:36 pm
        Post count: 2
        #2121
        Providing a seed really is good enough to solve most of the system level problems that we see with random number generators. DJB has actually argued against adding output from a HRNG to the entropy pool on Linux. Although I don’t fully agree with his conclusion (the threat model is pretty unrealistic) the theory behind his reasoning is sound.
        This reply was modified 8 years, 3 months ago by indolering.
- Ivan Godard
  Keymaster
  March 29, 2014 at 3:28 pm
  Post count: 689
  #887
  We have not put hardware encryption into the base Mill, because we expect encryption methods to change over the life of the architecture and don’t want to be saddled with outdated legacy requirements.
  That said, the Mill has enough horsepower to do a reasonable software crypto. For application that want more, the Mill defines an extension interface to which things like crypto and other specialized engines can be hooked. The interface makes the engine look like an i/o device from the program using it.
  We have also considered supporting a physical seed unit. Such units give only a few bits of entropy and so cannot themselves be used for crypto, but the provide an uncrackable seed for regular algorithms. The decision on that has been deferred until after the FPGA version.
  - squizzle
    Participant
    March 29, 2014 at 8:13 pm
    Post count: 5
    #894
    I think not supporting an AES primitive is a mistake. One day AES is likely to be replaced, but that’s certainly many years away, and unless there’s a massive and complete break, it will be even longer before people actually stop using it (how many people still use SHA1? MD5?).
    If it (or perhaps a generic crypto instruction with a parameter for algorithm) is in the machine independent instruction set, it can be emulated when it reaches the point of being removed from silicon. When we reach this point, the emulation will only need to be correct, not super hand optimized for every cycle, so the cost required to maintain the specialiser on a new family member would presumably be very low / none.
    - Ivan Godard
      Keymaster
      March 29, 2014 at 9:39 pm
      Post count: 689
      #895
      There are issues with AES, any other crypto, and any block functional unit of any purpose. Recall that the Mill is a statically-scheduled fully pipelined machine with exposed timing. Long latency operations don’t play well, and the AES is hundreds to thousands of cycles depending on implementation.
      Moreover, on a fully-pipelined machine like the Mill you must be able to issue a new AES every cycle, which means that you need hundreds to thousands of AES engines because the iterative nature of the algorithm doesn’t pipeline.
      Next there are issues with data width. AES supports different data widths, 128-bit being typical. How would we feed it on Mills that do not support quad width?
      There are similar issues with long-latency scheduling too, The compiler will find at most a handful of other operations that it can schedule in parallel with an AES operation, so the rest of the machine will be stalled for the great majority of the time of the AES. The stall would likely also block interrupts as well.
      I sympathize with your desire that AES should be supported (and there are quite a few plausible other block functions that you don’t mention). However, I think you are confusing a desire for a primitive, which is a semantic notion, with an operation, which is an implementation notion. AES may make a very good primitive that the market will demand and we should support; it makes a very bad operation. Instead, it should be implemented as an out-of-band block functionality akin to an i/o device in its interface. That was it doesn’t have to fit into the decode/pipeline/belt that the Mill uses, and you only need one of them rather than hundreds.
      It’s easy to think that what appears primitive in software can be primitive in hardware. I wish it were that easy 🙂
PeterH
Participant
March 29, 2014 at 12:44 pm
Post count: 41
#879
Can the handler called through a portal easily identify who called it? Suppose a service is being called by many threads in different turfs, such as a service reading files? You don’t want just any thread to access just any of the managed resources, and the caller can’t be trusted to identify itself by simple parameters passed.
- joseph.h.garvin
  Participant
  March 29, 2014 at 2:05 pm
  Post count: 22
  #883
  If the Mill doesn’t have some other solution I think you could roll your own protocol using regions to allow the service to be sure who is calling the portal. Have the portal put a random value only readable by the desired calling thread into memory. When the calling thread wants to make the portal call, it passes the number provided by the service that only it knows, and the service then verifies that the caller’s number is equal to its, and then sets its number to a new random value in anticipation of the next call. Critically when the check fails the service needs to sleep for a period or fault or cause the caller to fault somehow, and pick a new random value for the next challenge, otherwise the caller can brute force retry. Random numbers need to be used rather than just an incrementing counter, because otherwise a malicious thread can guess the current value from information like how long the system has been running or how many times the service has likely been invoked.
  Edit: this scheme assumes the calling thread is not ‘in cahoots’ with a malicious thread and thus won’t deliberately share the random value with it. But since the Mill protection model is that threads with a given permission can always grant other threads a subset of their permissions, I think this is OK.
  This reply was modified 10 years, 4 months ago by joseph.h.garvin. Reason: simplified scheme considerably
  This reply was modified 10 years, 4 months ago by joseph.h.garvin.
  - David
    Participant
    March 29, 2014 at 2:53 pm
    Post count: 32
    #886
    Given that service calls are synchronous, I would presume that the current thread identifiers still reflect the caller. These should be read-only by both the caller and service, and shouldn’t be spoofable by user-level malicious code. From there you should be able to get to the OS-specific or internal security descriptors.
- Ivan Godard
  Keymaster
  March 29, 2014 at 4:09 pm
  Post count: 689
  #890
  As suggested by the comments above, the thread id is unchanged through a portal call – it’s the same thread, just running in another protection environment. The current thread id is a readable specReg, so the service can know who it is working for. From that it can find more info in data structures it or other services maintain.
  However, it also can keep client-specific info in Thread Local Storage. Each client/service combination has its own TLS just like it has its own stacklet, addressed by a hardware base register like the hardware frame pointer.
  - PeterH
    Participant
    March 31, 2014 at 10:39 am
    Post count: 41
    #916
    Allowing that the OS can give threads with a shared security domain common local memory in the service turf, putting service state in thread local memory should work beautifully. A handle may be a pointer, and any thread that can access the appropriate memory space in the service turf can then use the handle. Nice and fast.
    And since attempted read to forbidden memory produces metadata state, the service can check if the handle is valid for very low cost.
imbecile
Participant
April 1, 2014 at 1:22 pm
Post count: 48
#944
I got 3 small questions.
1. Considering stacklets are only 4kb in size, services probably can’t use to large data structures on the stack. But considering argument passing happens on the belt and call stacks in the spiller, the stack pressure on the Mill is vastly reduced. I’m assuming normal applications can have larger stacks than 4kb though.
2. There are a few harware functionalities that could be useful if exposed to the programmer, like the lookups in the PLB and TLP. I’m not sure how feasible and secure this is, but could those lookup algorithms implememted in hardware be accessible to the programmer through a service/portal call? Or are they too tightly tied to their dedicated data tables?
3. Are you aware of the MIT exokernels? I think the Mill architecture lends itself beautifully to it, and the service concept even makes some of the contortions they go through to avoid or secure task switches unnecessary, like dedicated languages that results in code passed to privilged kernel driver modules.
- David
  Participant
  April 1, 2014 at 2:12 pm
  Post count: 32
  #945
  Speaking of operating systems, while it has gotten on in years I think AmigaOS would be a great fit. If I recall everything correctly, it already assumes a flat memory model, uses by-reference IPC data passing, and OS calls are normal function calls. Memory allocation requires protection descriptions as to how they’ll be shared.
  I don’t know how much of this has changed in AmigaOS 4, but the assumptions made for simplicity and speed back then would gave great alignment with how the Mill accelerates and secures those assumptions.
  - Ivan Godard
    Keymaster
    April 1, 2014 at 2:52 pm
    Post count: 689
    #948
    Yep. The Amiga got a lot of things right; amazingly so considering the vintage. AmigaOS was actually one of the (mental) use-cases during Mill development. Back even further, Cedar/Mesa was also a base.
- Ivan Godard
  Keymaster
  April 1, 2014 at 2:49 pm
  Post count: 689
  #947
  1) Bottom stacklets are fixed size. Stack overflow (which can happen on the very first frame if it’s big enough) pushes a grant of the occupied part of the overflowing segment, allocates a bigger (policy) stack segment somewhere, and sets the various specRegs to describe it and thereby grant permission. Return unwinds. Unwind is lazy so you don’t get ping-ponging segments if you are real close to a segment boundary.
  2) I doubt that the search hardware would be exposed. To specialized, and will vary by member so not portable.
  3) I looked at the exokernel work. So far as I can see “exokernel” is just marketese for “microkernel” plus libraries. The libs would be services on a Mill, but otherwise I haven’t seen anything that novel compared to prior work in microkernels and capability architectures. Please point out anything I have missed.
  - imbecile
    Participant
    April 1, 2014 at 4:00 pm
    Post count: 48
    #950
    Yes, the exo-kernels are pretty much microkernels. The difference to other microkernels like the L4 is that the API has even lower level abstractions. They don’t even really have a concept of threads for example, they work by granting processor time slices to memory mappings.
  - David
    Participant
    April 1, 2014 at 6:59 pm
    Post count: 32
    #951
    To my understanding, exokernels expect hardware drivers, filesystems, and other abstractions to be linked directly into user-space programs, so there is no IPC or context switching in those layers. Application optimizations can therefore drill to any abstraction depth to skip and/or cache more levels of processing and decision making than normal abstraction layers. The kernel security is only permission to hit the hardware, or some portion thereof.
    However, the organization is fairly similar to microkernels. One could consider exokernels to be a particular (and peculiar) optimization of microkernel architecture.
  - Russell Johnston
    Participant
    April 3, 2014 at 4:22 pm
    Post count: 9
    #961
    Exokernels are actually pretty different from microkernels (although the difference is somewhat orthogonal, rather than mutually exclusive).
    Microkernels implement things like drivers and file systems in servers, essentially taking a piece of a monolithic kernel and moving it into a process. Applications still need to go through this interface to get things done without violating protection, so applications still can’t e.g. change around file system disk block allocation algorithms, or make their own decisions regarding swapping.
    Exokernels, on the other hand, provide protection at a lower level- applications can access things like disk blocks directly rather than through a privileged server. This is typically done through libraries which run with exactly the same privileges as the application itself, but which can be bypassed for greater flexibility. For example, database indexes are sometimes faster to regenerate than to swap in from disk, so a database application could choose to discard indices rather than having its LRU page swapped out.
    This is why exokernel research tends to show performance improvements over monolithic kernels, and microkernels research tends to be about minimizing performance losses compared to monolithic kernels. 😛
    As I mentioned, you could easily have a microkernel whose servers expose their interfaces at the exokernel level rather than at the monolithic level. This would work really well on the Mill, where separating what would ordinarily be pieces of a kernel into services has a much lower cost.
    This reply was modified 10 years, 3 months ago by Russell Johnston.
    - Ivan Godard
      Keymaster
      April 3, 2014 at 5:10 pm
      Post count: 689
      #963
      Thank you; the explanation cleared up a lot for me.
      I would be very doubtful about directly exposing a device to applications, the premise of exokernels. Devices (the external thingy itself, not the driver) often have *very* fragile interfaces. Apps in quest of speed are notoriously buggy. There’s no conflict if the device can be assigned to just one process; the process is then both app and driver, and can tinker to its heart’s desire and harm no one but itself. But few devices are as simple and inherently single-user as a card reader these days.
      For a shared device, such as a drive, the app may get speed if it does its own driving. However, this exposes all other uses of the device to screw-ups on the part of the app/library/driver. Apps will tend to “just tweak it a bit”, and break everybody else.
      There are also issues with global behavior not being the same as local behavior in shared devices. There is a reason for central control of drives: all requests from all apps need to be sorted into seek order instead of request-issue order, or all apps using the drive suffer.
      Now if all the library does is what a monolith driver would do, and the apps are trusted not to tinker with it, then the library is a service in the Mill sense, and on the Mill the app no longer must be trusted.
      So again, I’m not really seeing much different between the micro- and exo-kernal approaches, at least on a Mill. In the micro, a driver process becomes a fast, cheap, secure service on the Mill; in the exo a trusted library becomes a fast, cheap, secure service. Calling the service a micro or an exo seems mostly a matter of legacy terminology and marketing.
      BTW, nothing on the MIT exo site is less than a decade old, so I guess they abandoned the idea. I’d really like to see a paper that tells why.
      - Russell Johnston
        Participant
        April 3, 2014 at 6:17 pm
        Post count: 9
        #964
        I should’ve been more clear- exokernels expose the driver directly (or even a very thin layer on top), not the device hardware itself. That way they can control access more granularly (e.g. disk blocks rather than whole disks). MIT’s exokernels had all their drivers in a monolithic kernel; they removed things like file systems and the network stack into libraries, with the kernel interface designed to allow securely cooperating implementations.
        The premise is to allow different applications to all run at once even when doing crazy lower-level optimizations. One of their big motivating examples is a web server (closer to a proxy cache today- it just serves static files) that allocates particular disk blocks to put files from individual pages closer together, TCP packet merging because of their knowledge of how the response is laid out, etc. It ended up being 8x faster than Harvest (which became squid) and 4x faster than their port of Harvest to the exokernel.
        Another example I liked because it was particularly simple and clean was a ‘cp’ implementation. It finds all the disk blocks in all the files it’s copying and issues big, sorted, asynchronous reads (the disk driver merges this with all other schedules, including other cp’s, etc). Then, while that’s going on, it creates all the new files and allocates blocks for them. Finally, it issues big asynchronous writes straight out of the disk block cache. This ended up being 3x faster than their cp using a unix libOS.
        Neither of these (or any of their other examples) require any extra permissions- the libraries don’t even have to be in services because if they break it just takes down the application using it, instead of corrupting the disk or something. The exokernel already took care of only granting permissions to disk blocks that the application would access anyway.
        There are still a more recent few exokernel projects floating around, but MIT has moved on. It probably didn’t catch on because some of its systems are either too complex to be worth the effort (file systems required downloading small bits of code into the kernel to verify things about metadata generically) or didn’t solve all the problems they had (packet filters still have some ambiguities that aren’t easy to resolve well for everyone).
        However, many of their ideas have made their way into existing systems- Linux DRI is a very exokernel-like interface to the video card. Many ideas could still work out if you were willing to break compatibility as well- for example, a new exokernel could pragmatically decide to understand file system structures, while still letting the applications do most of the legwork themselves (and thus in the way best suited to their domain).
Ivan Godard
Keymaster
April 4, 2014 at 9:27 pm
Post count: 689
#966
The following was from James Babcock and sent to me directly; repeated here for wider comment:
\
In the security talk, you said that the Mill will not generally have
high-granularity entries in its PLB, for performance reasons, but I
don’t think you said anything either way about the TLB. Will the Mill
support fine-slicing of address spaces in the TLB? If so, how much do
slices cost, and if not would it be feasible to add? I ask mainly
because a finely-sliced address space in the TLB, combined with some
memory-allocator tricks, could solve the use-after-free security
problem, which Mill has not yet proposed a solution for.
The essence of the fix is separating reuse-of-address-space from
reuse-of-memory, and not reusing the address space of freed objects
for as long as possible. If it were cheap to reclaim memory without
having to reuse associated address space, for objects sized and
aligned to >=32 bytes but smaller than a traditional 4kb page, then
the use-after-free problem would be pretty much solved.
- Ivan Godard
  Keymaster
  April 4, 2014 at 10:39 pm
  Post count: 689
  #967
  The TLB supports scaled sizes, as many modern TLBs do, but is unusual in that the smallest size is one line rather than 4KB. However, one-line pages exist to permit on-th-fly allocation of backing DRAM when a dirty line is evicted from the last-level cache. The one-line pages are scavenged by a system process and consolidated into more conventionally-sized pages. If scavenging were not done then the number of one-line entries would grow to the point that the underlying page tables would occupy most of physical memory, which is rather pointless.
  There’s another difference between the PLB and the TLB structure: while the granularity of pages in the TLB varies, it is always a multiple of lines, whereas the PLB has byte granularity. Consequently you can protect a single byte (which is actually done, for example to protect an MMIO register for a device), but you can’t page one to disk.
  Moving on to your suggestion to use the TLB for checking for use-after-free. A crawling allocation only works if it has a crawling free, or you wind up with internal fragmentation and, again, a lot of tracking entries and poor performance. If you do have a crawling free, for example if you are using a copying garbage collector, then it becomes possible to protect the freed region. However, the TLB is probably not the right place to do it. That’s because on the Mill there is not necessarily any memory backing the address space – at all. Backless memory lets data have an existence only in cache, and is a significant performance win for transient data. The problem is that the hardware does not know that the software considered the data to be free, so if it is dirty (which it necessarily will be), it will eventually age out of cache, be evicted, and have physical DRAM allocated for it, all pointlessly.
  The hardware handles the case of transient data in the stack, because allocated frames are automatically freed on return, and the whole call/return protocol is built into the hardware so the hardware knows to discard the lines covering the exited frame. For random memory it doesn’t know about free unless the software tells it.
  The good news is that the software is able to tell it. There are several cache management operations in the Mill that let software convey useful information to the cache. One of those operations tells it that an address range is dead, and lines covering that range can be discarded, nominally dirty or not. However, these ops do not recover allocated backing DRAM if it exists, because tracking DRAM is too complex for hardware and the OS must get involved.
  And here’s where the TLB gets back in the act. Once the crawling free has freed a whole page, and discarded the lines in cache, it is possible to remap the page in the TLB from physical memory to an error trap. There’s your use-after-free case. However, I don’t think that it is quite what you were hoping for, because it doesn’t protect free spaces in fragmented memory, and it only protects a page granularity. Hence there’s a window of vulnerability while waiting for a partially free page to become wholly free.
  I’m not sure I have answered your question in this; feel free to expand or refine it here.
  - jimrandomh
    Participant
    April 5, 2014 at 6:33 am
    Post count: 4
    #968
    That does indeed answer the question, although I feel like the problem is unsolved and there’s room for innovation here.
    It’s not entirely critical that reading or writing in a freed block actually fault, as long as it’s not receiving data from or overwriting pieces of some other object that was allocated into reused address space. If stray pointers lead into phantom backless memory instead of into an immediate error trap, that’s not as good, but it’s still a massive improvement.
    With the Mill as it currently is, if I wanted to write a paranoid version of malloc and free, then free would first instruct the cache to discard or zero the object’s lines, then check whether it was freeing the last entry in its page; if not, it’d either shrink or destroy that page table entry and replace it with two new ones on either side. causing a proliferation of page-table entries. This would be cheaper than every object having its own page table entry, but not by very much.
    Cache-line granularity is fine, unless the cache lines are wider than I think.
    I have two more ideas for dealing with this, which I’m going to continue in email (and maybe post here later) since it may bear on things that are NYF.
    - Ivan Godard
      Keymaster
      April 5, 2014 at 2:11 pm
      Post count: 689
      #969
      All of the finer-granularity solutions that we are aware of, where the granularity is fine enough to be used for red zones and to cover padding bytes in structures, have implementation and use costs that would restrict them to high-end machines that use (and pay for) non-standard DRAM configurations such as those needed for ECC. I could be considered if we enter the main-frame business.
      Line granularity poisoning (as opposed to silent zeroing) is possible at an area, bandwidth and performance cost of a few percent. Line granularity is sufficient to detect use-after-free, but not for use as red zones.
      All such schemes have a UI aspect too: the Mill is intended to runn programs straight off the net without rewrite. When there is hardware that detects use-after-free for example, we have to be wary of the reputational damage that may happen when the faulted user howls “But it works on an x86!”. We could easily be blamed for program bugs that other machines ignore. Sad, but human nature 🙂
squizzle
Participant
April 16, 2014 at 7:17 pm
Post count: 5
#995
It may have been obvious to others, but it’s only just “clicked” for me what is meant by an “external interface”. My assumption was something like an external PCIe card performing some accelerating functionality.
If my “click” is correct, then a better view would be like an ADC on an embedded microcontroller. It’s still part of the chip, but it’s not accessed with a “readadc” opcode, it’s accessed by setting the ADSC bit in the ADCSRA register, then reading ADC register at some point in the future (atmel AVR example)
So instead of a RdRand opcode like x64, you could have something like (paraphrasing, not certain of the syntax) …
load(MM_rand,,)
… and if the RNG takes more time than you expected to provide the data, then you stall, and that’s ok, because loads are designed to handle stalls. That doesn’t mean that you’ll have a 300 cycle latency to get a random number, but it does mean the RNG is allowed to be flexible on its timing.
Is my understanding correct?
- Ivan Godard
  Keymaster
  April 16, 2014 at 7:53 pm
  Post count: 689
  #996
  Your understanding is correct.
  There are further subtleties that make me say “like an i/o device”. For testing, programs often need to have repeatable random number sequences, so that it is important that a program be able to glom onto a RNG and take all the numbers generated, without some other thread or process grabbing one in the middle of the sequence. That is, the RNG needs to be (able to be) single-user exclusive use. Most of the other bulk-algorithm hardware has the same characteristic: you don’t want someone else getting the result of the MD4 that you started, and so on.
  This exclusive use capability is characteristic of many i/o devices. Even those we often think of as having multiple concurrent users (a disk for example) actually has only one user (the driver) when you look closely at the hardware.
  In contrast, operations are usable by all, and the hardware does the save-restore that gives the illusion that everybody has their own belt. It is impractical for hardware to provide the illusion that everyone has their own RNG. So the Mill provides an interface that supports exclusive use (using MMIO and the standard memory protection system) and asynchronous (or semi-synchronous) access.
jwm
Participant
April 24, 2014 at 9:09 pm
Post count: 1
#1014
I really like the protection domains and the PLB, I could really use this on my x86 right now. 🙂
At about 1:03:40 there’s an explanation of passing a graph structure by allocating an arena and passing the arena. There’s a comment that the callee can “tromp on your arena structure”. Shouldn’t we be able to pass() the graph read-only to avoid the callee making any changes? Or does pass implicitly grant read and write access? It seemed that earlier at about 1:01:30 we see a write system call using pass(ptr, len, r), I assume that “r” is read and the absence of “w” means it’s read-only.
- Ivan Godard
  Keymaster
  April 24, 2014 at 9:26 pm
  Post count: 689
  #1016
  You assume correctly; you can pass any subset of the rights you have, and the “r” is explicitly not passing “w”.
  About the arena: the concern is, if the arena is passed with “w” rights and contains both data and administration (which is the way many heaps are organized) the the recipient can tromp on the administration. However, if the administration is disjoint from the contained data (which is the way some heaps are organized) then only the data and not the administration is exposed. Of course, if the callee needs to add or remove nodes in the arena, the latter approach would require also giving the callee portals to the client routines that do the arena malloc and free, because the service would not be able to play with the administration directly itself.
  It’s clear that the Mill primitives can be used in many different ways to support many different models, and we expect the users and the field as a whole to spend happy hours exploring the possibilities and discovering what policies and protocols provide the best of both function and performance. Have fun 🙂
LarryP
Participant
October 11, 2014 at 9:04 am
Post count: 78
#1484
Non-forgeable turf/thread IDs: What can we assume re their size and properties?
In the security talk (circa slide 19, around 21 minutes in), a turf ID is shown as one field in a region descriptor. Turf IDs are also said to be non-forgeable, though how that’s achieved isn’t stated. If region descriptors must fit into the PLBs, the PLBs must either support variable sized descriptors (hard to do, while keeping the H/W fast and cheap) or must be fixed size.
If region descriptors are of fixed size and are constrained to be a single cache line each (true, or have I over-assumed?), then turf IDs are de-facto constrained in their length by the size of a cache line (on that Mill model), less the bits needed for the lower bound, upper bound and rights. If so, what is the minimum width (#bits) guaranteed for the turfID?
In general, short keys/hashes are less secure against attack, and a key length sufficient to be non-forgeable today is unlikely to remain so for very long. Are Mill turfIDs/ThreadIDs made non-forgeable via hash functions (as are the payloads in NaRs), or is there a different mechanism used to make turf (and thread) IDs unforgeable? What may we assume about the (minimum guaranteed) length and security properties of turf and thread IDs?
- Ivan Godard
  Keymaster
  October 11, 2014 at 11:59 am
  Post count: 689
  #1485
  There is considerable member-dependent variation possible in the formats of these things, and there are probably variations that we haven’t thought of yet that will be used some day. Consequently you should see this answer as representative rather than graven.
  ID sizes (turf, task) are member dependent. The most important constraint on their sizes is that the product of a task and a turf is used as the implicit address to locate the root stacklet in a portal call, and hence must fit together in a virtual address (60 bits) together with a stacklet size and enough bits for non-stacklet memory as well. All these bit fields are per-member fixed, and the member designer can increase one only by shrinking another. A representative number might be 23 bits each for turf and task ids.
  We don’t seem to be constrained by the size of a portal data structure, although I suspect that when we implement that part of an OS that we’ll find we want to put more, or more kinds or, data in the portal than we have now thought of. There is some need for the access to be power-of-two sized and aligned, and for hardware reasons it will probably be a multiple of the vector size even if we don’t need the bits.
  In general, such member-specific details are of interest only to the OS and will sit behind a member-independent API.
- Will_Edwards
  Moderator
  October 11, 2014 at 1:13 pm
  Post count: 98
  #1486
  The IDs are unforgeable not because they are unguessable but because they are managed by hardware.
Author
Posts