Mill Computing, Inc. Forums The Mill Architecture Inter-process Communication

Tagged: 

  • Author
    Posts
  • staff
    Keymaster
    Post count: 49
    #2995 |

    Talk by Ivan Godard – October 4, 2017, at the
    Silicon Valley Linux Users Group

    Slides: 2017-10-04-IPC.4 (.pptx)

    This was the twelfth topic publicly presented related to the Mill general-purpose CPU architecture. It covers Inter-Process Communication for the Mill CPU architecture family. The talk assumes a familiarity with aspects of CPU architecture in general and C++ programming in particular.

    The Mill is a new general-purpose architectural family, with an emphasis on secure and inexpensive communication across protection boundaries. The large (page) granularity of protection on conventional architectures makes such communication difficult compared to communication within a protection boundary, such as a function call. As a result, the large granularity has forced communication protocols on conventional architectures into two models: pass-by-sharing (using shared pages), and pass-by-copy (using the OS kernel for files/message passing). Both have drawbacks: sharing requires difficult-to-get-right synchronization, while copy involves kernel transitions as well as the costs of the copy itself.

    The Mill supports both these protocols, for use by legacy code. However, the Mill hardware also supports inter-process communication using the same program protocols as for intra-process communication and function call: pass-by-value, pass-by-copy, pass-by-reference, and pass-by-name, but all without kernel involvement or overhead. The protocols are secure: neither party can see anything of the other except the explicit arguments to the communication. Neither caller nor callee codes need source changes to replace intra-process communication with Mill inter-process argument passing. However, the pass-by-reference protocol may require use of shims to delimit the extent of sharing in some languages. And granularity is no longer an issue: arguments can be of any size down to the byte.

    The talk describes the machinery behind the Mill IPC protocols, together with suggestions as to how the hardware facilities may be integrated with representative language runtime systems such as those found in Linux.

    Ivan Godard is CTO and a founder of Mill Computing, Inc., developer of the Mill family of general-purpose CPUs. He has written or led the development team for a dozen compilers, an OS, an OODBMS, and much other software. Ivan has been active in the field of computers since the 1960s.

  • Christophe Biocca
    Participant
    Post count: 3

    As far as the wiki is concerned, whether a call is a portal call or a regular call is not determined by the caller (both use the same call opcode). Portal-call vs. in-turf execution vs. fault is determined by the
    permissions of the calling turf on the entry point of the called code.

    In addition to the above, a turf can grant/revoke portal/execute permissions on their code to other turfs, as a way to achieve microkernel-style privilege separation.

    This protects the caller, but the callee can’t know for certain that the foreign code they’re calling into will actually run in a different turf. I could write an evil portal service that grants plain execute to would be callers, and they wouldn’t be able to tell.

    Am I missing something? Is this just a matter of policy (always have a separate, trusted, turf control the portal permissions so that you can trust that they’re set properly and never changed out from under you), or is there a way for a caller to say “this must be a portal call, fault otherwise”?

    • Ivan Godard
      Keymaster
      Post count: 689

      A portal causes turf switch to a turf id contained in the portal structure. There are barriers to the vulnerability you suggest.

      If the attacker gave the victim a code pointer that falsely purports to be a portal and the victim called it then the victim would still be in his original turf, executing the code referenced by the passed pointer. However, the victim must have execute rights for any code, so the substitute code must be executable by the victim’s turf; it can’t be attacker code because the victim does not have execute rights to attacker code. And the attacker cannot blindly give such rights to the victim; there is a check so that a suspicious victim must accept a proposed grant before it takes effect.

      Thus the target address must thus be a valid entry point in the victims own code. Of course, getting the victim to call one of his own functions when he didn’t intend to is problematic too. There is a check, a bit more general than you suggest, that an untrusting program can use for this. It returns, for a given address, what permissions the caller has at that address. That check is necessary in a number of ways, but seems inelegant and we have been exploring alternatives, but with nothing entirely satisfactory yet.

      Second, the portal structure itself is set up by trusted code, which always sets the associated turf to that of the thread creating the portal. That is, you can create portals into yourself, but not into anyone else.

  • Christophe Biocca
    Participant
    Post count: 3

    And the attacker cannot blindly give such rights to the victim; there is a check so that a suspicious victim must accept a proposed grant before it takes effect.

    That does solve the issue I had in mind: the attacker granting execute instead of portal permissions to its own code, the victim, unaware of this, calling into it and giving attacker-controlled code in-turf access.

    Making each grant subject to approval does solve this issue, and means the program can simply check a purported portal call address once (to not have execute permissions at all) and use it safely afterwards. Then even revocation of the portal permission wouldn’t later cause problems for the program (beyond faulting).

  • ShawnL
    Participant
    Post count: 9

    Can you compare your ideas behind IPC to the seL4 operating system, which is the first operating system that comes with proofs of correctness (including proofs that the compiler correctly translated the code to assembly!). Awaiting features from CPU vendors, they are also on the cutting edge of doing time-dimension proofs. The feature that I see seems lacking on seL4 is futex(), with all its security coming from the page tables and seL4’s existing memory type system, but otherwise everything both you and seL4 seem to be talking about is properly utilizing the MMU to its full potential.

  • ShawnL
    Participant
    Post count: 9

    OK, I just watched the whole video. I would love to be part of supporting/developing the C API on top of turfs[1], as I think this is really cool. One question stuck out, which is that you say you have global addresses and the TLB *after* the cache, and then you say you have a local bit. But if the local address space is 60-bits you would have to put a TLB *before* the cache to support local addresses. Why not instead shrink the local address space to something smaller like 32-bits (convenient because applications generally have to support 32-bit arches, so you can just compile apps in 32-bit mode), and then just implicitly fill in a zero top with the thread+turf id, and thus preserve the global address space model, while still supporting the local addressing abstraction?

    [1] As you pointed out, the C API is insufficient, because there is no concept of a slice (pointer and length, with constant property) to transfer address space.

  • ShawnL
    Participant
    Post count: 9

    Ahh, I guess I answered my question. You want the Mill to support existing code, and quirks are a great way of turning people away, and plenty of existing code needs more address space than 32-bits. And later you could release a Zinc that only support 32-bit addresses in the local address space.

    • Ivan Godard
      Keymaster
      Post count: 689

      I see you have worked out a fair amount of the details 🙂

      A part you are missing is that the local address space as a whole is present in the global space, so no translation is needed to convert between a local and global address. The mapping is trivial, and does not require tables or look-up machinery such as a MMU. We have stolen a single bit in each pointer to distinguish whether the pointer refers to a local or global address. Locals are converted to global as part of the effective address calculation in memory-reference operations, and the memory hierarchy sees only global addresses.

      A 32-bit Mill is possible, or even 16 bit, so long as that’s enough memory for the application; embedded usage for example. The rest of the Mill is really too heavy-duty for microcontroller use though, so the Z-80 and 6502 markets are safe from us.

  • ShawnL
    Participant
    Post count: 9

    > such as a MMU

    You still need something to enable de-fragmentation, and that same machinery would give you the ability to implement CoW for fork(), and perhaps sparse maps.

    I am talking about having a 32-bit addressing mode on a 60-bit mill, for performance/memory/cache reasons.

  • Ivan Godard
    Keymaster
    Post count: 689

    There’s two kinds of potential fragmentation: of the physical memory space, and of the virtual address space. They are tied in a legacy architecture, but separated on a Mill. The physical space is managed by the TLB, which can do paging and physical consolidation in a manner very similar to a legacy. In contrast, the virtual space is never consolidated; this is a consequence of the fundamental design choice to have caches in virtual.

    There’s no obvious advantages to having a 32-bit virtual space in a 60-bit physical space. True, pointers would be four bytes rather than eight, but one can use 32-bit indices just as easily. There’s the problem of programs that need more than 4G, but those could use large mode. But the big problem is mixed mode. Sandbox code will need to use facilities from outside the sandbox, and those would be reached by 8-byte addresses. Keeping track of near-vs.-far addresses is something that we left behind with the 8086.

    So yours is an interesting question: could it be done in the hardware? I think the answer is yes, the hardware could do it. But it would cause a great gnashing of teeth, both among our software teams and customers that used it. Would it sell enough more chips to justify the pain? I don’t think so.

  • mmeyerlein
    Participant
    Post count: 13

    hi mill team,
    i have a question about the transient grant.
    if A calls B via the portal, it can grant B permission to a memory area via pass. if the thread comes back, this permission is revoked – right?
    but if B wants to return a memory area to A, how is this managed? does B have to order a new memory area and pass it to A?

    • Ivan Godard
      Keymaster
      Post count: 689

      Returning a dynamic object across a protection boundary is just as annoying as it is when caller and callee are in the same protection domain, and you can use the same methods to do it. Caller can create and grant-pass a result area for callee to fill. Callee can allocate a new object and explicitly grant it to caller. Caller and callee can share an allocation arena set up in advance and used for multiple calls. Callee can call a trusted registrar that receives and holds the object for caller to pick up later after callee returns. And so on.

      • mmeyerlein
        Participant
        Post count: 13

        ah, ok, I saw the picture in the presentation and wondered how A should get the picture from B if B resize the pic, or if B provides an area from his security area to A, then it will be hard to release his malloc for B.
        but if both of them have an area like shared memory, then I understand.
        thanks.

  • ShawnL
    Participant
    Post count: 9

    > Caller and callee can share an allocation arena set up in advance and used for multiple calls.

    You could return a read-only grant to found data, because those grants have byte-granularity.

    And a filesystem daemon could also provide shared mmaped files in the direct VM space (instead of process-specific VM space).—hmm that wouldn’t work. How can IPC use the same parameters both when passing by copy and through zero-copy (VM manipulation)?

    • This reply was modified 3 years, 5 months ago by  ShawnL.
    • Ivan Godard
      Keymaster
      Post count: 689

      Yes, you can do a read-only persistent grant to data outside the frame, such as global or heap, when returning a large result. However, it will be difficult and complex to revert that grant at some future time; it’s not like returning an integer by copy. Having the caller transiently grant a return space to the callee removes this issue, but requires the caller to know how much space to grant.

      Language runtime systems can use the underlying hardware to provide copy semantics, but they must be able to track ownership. A single-owner language such as Rust could do this by creating an Ownership Manager that is the actual grantee of a new creation and which handles change of access and ownership. The performance would be similar to using a shares arena, but would have less need of trust between caller and callee.

  • MarcTheYogurtMan
    Participant
    Post count: 1

    When performing IPC, how does the caller specify how many return values to expect from the callee?

    For example, let’s say that the caller expects the (untrusted) callee to return exactly one value. If the callee ends up returning two values, then both values would end up on the caller’s belt, and it would interpret the first returned value as the return value of the function and the second return value would be interpreted as the result of a previous instruction, right?

    How does the mill’s IPC mechanism work around this?

    Thanks!

    • Ivan Godard
      Keymaster
      Post count: 689

      The Mill call instructions take a static argument with the expected result count. For encoding efficiency reasons the common cases of no result or one result are special-cased, giving three opcodes: call0, call1, and calln. The hardware matches the number of result in the retn instruction with the expected count, and faults on a mismatch; the fault follows the standard fault sequence.

      Only the count is checked, and only for belt results. You can still shoot yourself by returning a value of an unexpected width – say a short when a double is expected. However, it is also possible to write width-agnostic code in some cases, because each datum carries its width in metadata. The “sig(..)” instruction can be used to check the width without disturbing the belt.

      Arguments and results that are too big or too many to fit on the belt are passed in memory. Memory carries no metadata so it is not possible to verify the count and widths of memory operands. Memory bounds checking provides a limited form of sanity checking, but in general it is up to software to verify that the content of the memory arg/result area makes sense.

You must be logged in to reply to this topic.