Forum Replies Created

Viewing 15 posts - 211 through 225 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689
      1) What is NaR + NaR? Whose metadata used?

    It will be one of the inputs; which one is implementation dependent.

      2) A process (thread 8 in turf 5) spawns and dispatches a new thread (thread 22475). That thread (22475) kills itself. What is the OS to do about the original thread (8)? It can’t dispatch it: the process has no threads parked in the kernel turf. Can the OS only kill that process? What happens to the resources used by the thread?

    OS (and RTS) policy. For example, an exception handler within the thread being unwound might dispatch out. There are some bottom turtle issues in thread death, just as there are for permission revoke, and we are not sure we have everything covered; there may be additional helper operations added to the ISA in the future as we (and the OS implementors) gain more experience.

      Stated alternatively: can I create a convoluted process by which a poorly written OS can leak machine resources?

    Of course, FSVO poor.

      3) In units of ALUs, how much hardware are you throwing at making a call a one-cycle operation?

    As compared to a 5 cycle call operation say, very little. As compared to a call without without the spiller, quite a bit, but hard to measure in those units. The spiller has internal SRAM for skid buffering, probably about as much as a top level cache – how many ALUs is a cache? The rest of the spiller and the belt are roughly the same as the bypasses on a conventional. Essentially all the call cost is the spiller.

      4) Will all operations have the same latency across all Mill chips? Your examples have always had that multiply is a three cycle operation. Would a Mill ever be delivered with, for instance, a five cycle multiplier?

    Hardware latency is specified individually on a per-FU basis in our configuration tools. Not only can latency vary across family members, they can also vary per slot within a single member, so a chip could have a fast (expensive) multiplier and also a slow (cheap) one.

      You haven’t done the talk on virtualization yet, so this may be answered later….
      Popek and Goldberg’s virtualization requirements ( https://en.wikipedia.org/wiki/Popek_and_Goldberg_virtualization_requirements ) assert that all sensitive instructions must be privileged for efficient virtualization. You have repeatedly asserted that the Mill does not have privileged instructions. That would imply that you have no sensitive instructions, or the Mill is not virtualizable in the sense of Popek and Goldberg.

    There are no sensitive operations. All control is via the address space and the protection mechanism.

      5) The idea behind virtualization, for many people, is the ability to run an unmodified OS (as from Microsoft or Apple) as an application within another OS. This is so that one can sell a single computer to multiple dupes under a “cloud” computing scam, or allow one to use Microsoft Office natively running on Windows which is magically running on Linux. Will the Mill be virtualizable in this sense?

    Not yet announced because NYF.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: 2017/2018 #3133

    Pretty good progress across the board, given the resource constraints. Things crash all the time of course, but that’s what happens at our stage of development. Some details
    * post-linking implemented and is now a routine part of tool flow
    * specifications made constExpr, roughly doubling the speed of the tools that dynamically bind a spec.
    * hunks of the central parts of the C runtime ported
    * some work on the OS port but not much working yet (major 2018 target)
    * test suite significantly expanded. Some of the tests even work!
    * three public talks
    * a handful of new patents
    * compiled code quality improved after work throughout the tool chain
    * progress on hardware generators; starting to get feedback from hardware implementation into architecture/tools
    * some new people joined and ramping

  • Ivan Godard
    Keymaster
    Post count: 689

    The spiller holds ongoing program state as of a stall, mispreduct, call or return event. That state includes much more than just the then-current belt. Belt values, in their latches, are moved lazily into the spiller and its internal SRAM and eventually migrate to the spillets in DRAM. However, these are relatively simple to handle. The most difficult part of the spiller deals with in-flights, which do not yet exist at event time, but will be produced eventually and must then be captured for subsequent replay. That requires temporal ordering information that is not an address, but may be thought of as a stream or pipe.

    So there is a part of the spiller that does indeed hold full operands (possibly compressed at hardware option), but this is not addressable in the sense that DRAM or scratchpad is. Instead the operands (not necessarily contiguous) are organized for ordered replay. As the “address” changes continuously during replay and the operands will have random and varying other state intermixed, it does not seem practical to try to use spiller hardware for the functionality that is the present scratchpad.

  • Ivan Godard
    Keymaster
    Post count: 689

    The number of entries is a member config decision; ten would be small.

    Storing by entry number would require a mapping from entry number to position, or (trivial mapping) with all entries being maximal size. We pack them (saving space/increasing capacity) and reference by start byte number. The byte number needs more bits to encode than an entry number would, but the scratch ops are otherwise small and currently we just burn the entropy. The belt uses full sized entries and doesn’t try to pack because the actual belt is a set of latches/regs that must be full width anyway.

    The choice for scratch implementation is left to the hardware guys, and might be different in different members due to hardware/cost/power considerations.

  • Ivan Godard
    Keymaster
    Post count: 689

    Our impression is that the whole thing makes our heads hurt.

  • Ivan Godard
    Keymaster
    Post count: 689

    Actually there are problems with the predicated forms of load: with them you can hoist over a branch, as shown in the paper, but you can’t hoist over the computation of the predicate. Spectre has prompted much internal discussion about an alternative that doesn’t require the predicate until retire (easy) and is still safe from Spectre-like attacks (real hard). NYF for now.

  • Ivan Godard
    Keymaster
    Post count: 689

    You are right: we expect distros to be in genAsm and to be specialized to the actual target at install time. The chip will ship with a specializer in ROM with the BIOS. Nothing in the hardware stops the user from writing his own specializer, for whatever benefit or bugs that brings. For that matter, nothing in the hardware stops the user from writing his own genAsm. Subjects such as safe languages are matters of policy, above the architecture, and must be enforced by software. The Mill with its clean protection model offers an excellent platform for such things, but we as a hardware company do not expect to provide them ourselves.

  • Ivan Godard
    Keymaster
    Post count: 689

    We will publish a white paper with details on Monday 1/15/2018.

  • Ivan Godard
    Keymaster
    Post count: 689

    The Burroughs main frames (the A series – the first compiler I ever write was for the B6500) used this approach – the OS would only load code produced and signed by a known compiler. In the controlled environment typical of mainframes this worked, but in the free-for-all of PCs it would be too restrictive.

  • Ivan Godard
    Keymaster
    Post count: 689

    A portal causes turf switch to a turf id contained in the portal structure. There are barriers to the vulnerability you suggest.

    If the attacker gave the victim a code pointer that falsely purports to be a portal and the victim called it then the victim would still be in his original turf, executing the code referenced by the passed pointer. However, the victim must have execute rights for any code, so the substitute code must be executable by the victim’s turf; it can’t be attacker code because the victim does not have execute rights to attacker code. And the attacker cannot blindly give such rights to the victim; there is a check so that a suspicious victim must accept a proposed grant before it takes effect.

    Thus the target address must thus be a valid entry point in the victims own code. Of course, getting the victim to call one of his own functions when he didn’t intend to is problematic too. There is a check, a bit more general than you suggest, that an untrusting program can use for this. It returns, for a given address, what permissions the caller has at that address. That check is necessary in a number of ways, but seems inelegant and we have been exploring alternatives, but with nothing entirely satisfactory yet.

    Second, the portal structure itself is set up by trusted code, which always sets the associated turf to that of the thread creating the portal. That is, you can create portals into yourself, but not into anyone else.

  • Ivan Godard
    Keymaster
    Post count: 689

    Any core including Mill is susceptible to Rowhammer because RH hits a fault in the DRAM, not in the core. The question is what exploits if any can be achieved by RH style memory mangling. A rights escalation would require causing a user to load a different (more encompassing) turf while staying in the same code and data. The only point at which the turf changes is at entry to or exit from a portal call. On entry the new turf is in the portal structure, which is not writeable by either the creating or entering turf. On exit the resuming turf is in the spiller stack, which is also not writeable by either the exiting or exited-to turf. So to escalate the turf you have to have write access to where the turf is kept, and to get that access you have to have write access to where the keeper is kept, and so on. It’s not clear that there is a bottom turtle to that regression.

    There is some code in the micro-kernel that has the rights to modify these saved turf ids and does so: the code that initializes a portal, and the code that does exception handling and thread teardown. So there may be an attack vector there if that code can be given a bogus value from DRAM that it then uses to overwrite a turf. But I don’t think I could do it even if I had the source code and the ability to arbitrarily change a value fetched from DRAM.

    An alternative approach would be to try to mess with the PLB tables without changing the turf. Changing the entry as it is created would face the same bottom turtle problem as trying to change the running turf id. However, changing the address range when an entry is loaded to the PLB might be more possible. It is not clear how an attacker would learn where the entry is located; the table is dynamically allocated, he doesn’t have access to the register that holds the base address, and getting access to that register is the bottom turtle again. Still, if the attacker can flush the caches and the PLB, then probe to a valid location not in a WKR, then the line address containing the table entry will be among the next few addresses presented to the pins. But if you have pin-level access to the system you don’t need Rowhammer to change the DRAM values read.

    So I won’t say that a Rowhammer crack of the Mill is impossible, but it does seem that it will be as hard as a pin crack, and those are blockable by encrypting pin traffic to DRAM. Fair warning though: this stuff is hard, and I may well be overlooking something.

  • Ivan Godard
    Keymaster
    Post count: 689

    I’ll try again. The belt uses temporal addressing; the scratchpad uses spatial addressing. There are two addresses involved in a spill/fill: the belt address of the value to be spilled somewhere, and the “somewhere” address needed to choose what value to fill. The present Mill uses temporal for the first, and like any reference the spill must execute before its target drops off the belt. If scratch were part of the spiller then fill would need a (arbitrarily large) address to look into the spiller history to find the value.

    You can’t use temporal addressing for long- or indefinite-lived values because the temporal address range is unbounded. Hardware doesn’t do unbounded. With spatial addressing the address range is bounded by the configured size of the scratchpad. Hardware does that, although the tool chain must deal with running out of the bounds.

    Perhaps you are thinking of a scheme whereby the spill op would push the value into a side stack and the fill would use a stack offset rather than a temporal reference to address it. That’s possible, but the stack management hardware is more than is needed for a simple regfile-like array of values. And, returning to the first question, one would need either maximal sized entries, or a map from entry number to packed byte offset, or make the stack byte addressable.

    I’m not saying that one couldn’t put the scratchpad in the belt space so that scratch entries could sit in the same latches as belt operands. But the addressing logic to get such a scratch entry back into the space where adds and such could address it is too expensive because it would push up the size of the crossbar. So we keep the addresses space separate.

  • Ivan Godard
    Keymaster
    Post count: 689

    It’s a lifetime issue. The scratchpad is not a simple extension of the belt, it’s a repository for values with long or indeterminate lifetimes. The spill op copies a value from the belt to scratch, and that same value may move into the spiller if there were a call while it’s still live. But a value computed before a loop and used after the loop (and maybe in the loop) has an unknown lifetime, so we need to save it for the duration of the loop. Mill execution makes values with great abandon, and we can’t save them all as if there were an infinite belt. So we need a way for the compiler to tell the hardware that a particular value is of continuing interest, and be able the request it again later. That’s the spill and fill ops.

    In contrast the spiller saves everything in-flight and on the belt, but that’s a tiny population compared to everything that has ever been on the belt, which it the potential population for the scratchpad. Different expected lifetimes, different reference patterns, different latency, complexity, and power constraints -> different mechanisms.

  • Ivan Godard
    Keymaster
    Post count: 689

    We can’t leave scratchpad-usage data in the spiller because the data is both spatially and temporally random access, while the spiller is at heart just a glorified stack. Items can be left in the scratchpad for arbitrarily long times without increasing the latency of later access, whereas items in the spiller eventually migrate to memory and get memory latency.

    Instead we want the scratchpad to have uniform latency and simple random access, without the expensive mux crossbar needed for spiller access even to limited depth. So really scratch acts, and is mostly implemented like, register files in conventional memory. The differences include the inclusion of metadata, the self-defining data widths, and the packing at byte granularity.

  • Ivan Godard
    Keymaster
    Post count: 689

    Indeed you can overlap them, but there’s no need for any special hardware; normal instruction scheduling in the specializer will minimize the overall latency.

Viewing 15 posts - 211 through 225 (of 674 total)