Forum Replies Created

Viewing 15 posts - 91 through 105 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689

    If the fabs can build a 1.2M chip for one company then they can and will for all; the ISA question is whether you can use those transistors. In the CPU space you cannot; you are pin-pad limited on the wafer, and more transistors are useless except for heating office buildings. You can use them in some kinds of AI work, and likely also in rare embedded applications, data reduction in your friendly local SuperCollider for example. They won’t run Windows, and the control processors they need (and for which you can see them as peculiar I/O devices) can be Mills.

    We still don’t see any prospect of a general change upcoming in the CPU ISA space. Except us. Thanks for your support; we will need it especially as we scale up, which was supposed to have happened this year, but 2020 🙁

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Multi core machine? #3619

    Google is your friend 🙂

    Try threading, multithread, multicore, cache coherence… as starting keys at Wikipedia.

  • Ivan Godard
    Keymaster
    Post count: 689

    Mill portals are not IPC (although they can be the entryway to IPC); they are secure service calls, and you remain in the same thread (though with different permissions) both in and out.

    A trusted IPC service can define get and put portal functions and do the argument copying between threads, if that is what you mean by IPC. Unfortunately there is no standard semantics for IPC, so each system has to define its own semantics in terms of the underlying Mill primitives, and cross-system communication is likely to be questionable.

    Revoke is a classic problem in all permission systems. As currently defined, Mill transient grants cannot be revoked, and persistent grants can be revoked by diddling the PLB. However, there are a host of semantic issues, especially if grants can be re-granted.

  • Ivan Godard
    Keymaster
    Post count: 689

    I haven’t seen any technical material on their chip, nothing but market puffery, so it’s hard to say how much is real. It’s good that they are exploring translation; it makes the notion more believable to potential customers. We can assume that their translation will be about as good as ours, so the competitive situation will boil down to the native ISA underneath. We’re not worried 🙂

  • Ivan Godard
    Keymaster
    Post count: 689

    There will be PCIe controllers in the initial configurations, and typically in most others as well; our config software and process lets the uncore be configured to suit the application, and not all applications need them. We do not anticipate configuring any graphics capability directly on the initial chips; it’s not our expertise. Later on, possibly, if resources permit and markets demand.

    We expect to port software support for most industry-standard interfaces, likely including OpenGL. As for others, it’s too soon to sketch out a path.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #3658

    The separate conform op is no more; branches can now list the things they want preserved on the belt, in the order they want them. So branches now look like function calls, with belt arguments. The phi ops in the genAsm input to the specializer are removed during input, and replaced by arg lists on the exit arcs of blocks. Each arc is a go-to, while phis are come-from.

  • Ivan Godard
    Keymaster
    Post count: 689

    As far as I know TRIPS was abandoned.

    The problem with dataflow architectures is the interconnect when random producer must be connected to random consumer. That true for wide non-dataflow too – the crossbar that feeds belt to FU is the clock constrainer on a Mill. Yes, Tachyum can package instructions into call-like blocks – but it’s rare for the blocks to be more than a few instructions. That’s because most code references globals frequently, and those references must preserve order because the languages don’t provide reliable ordering semantics. So one can turn a whole sqrt routine into one of their blocks – but trying to make a block out of a loop body, so separate iterations can be spun out to their own core – that’s much harder. The analysis is the same you have to do for auto-vectorization, and the result is not pretty albeit it can be lovely on cherry-picked micro-benchmarks.

  • Ivan Godard
    Keymaster
    Post count: 689

    Interesting conversation; thank you all.

    There doesn’t seem to be anything in WASM (present and proposals I am aware of) that would be hindered in the Mill. In particular, translating to WASM from C/Rust/etc should work just like on any other machine. Naive JITing from WASM vm to Mill native should be slightly simpler on the Mill, because the stack maps naturally to the hardware belt. Non-naive JITing would be more complex on a Mill because the stack code would have to be first analyzed for inter-operation dependencies so that independent calculations can be scheduled to execute in parallel. That analysis would typically involve a translation to SSA-form, from which our existing schedulers can produce multi-instruction bundles. The benefit would be much better runtime at the cost of the extra analysis time, typical of JITs in general.

    Where x86 and other ISAs with page-level protection have to allocate a fixed page-granularity sandbox, on the Mill each WASM job would have a unique turf and would use the Mill native byte-granularity protection, and portal calls for access to libraries. This should economize on address space and TLB thrashing when compared to page-based systems. I would also expect that the Mill approach to shared libraries would make it more difficult for an attacker to leak out of its sandbox into the shared space.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Multi core machine? #3622

    You are right that such things are questions for the OS designers. As we are doing a reference kernel, that includes us 🙂

  • Ivan Godard
    Keymaster
    Post count: 689

    Yes, you can do a read-only persistent grant to data outside the frame, such as global or heap, when returning a large result. However, it will be difficult and complex to revert that grant at some future time; it’s not like returning an integer by copy. Having the caller transiently grant a return space to the callee removes this issue, but requires the caller to know how much space to grant.

    Language runtime systems can use the underlying hardware to provide copy semantics, but they must be able to track ownership. A single-owner language such as Rust could do this by creating an Ownership Manager that is the actual grantee of a new creation and which handles change of access and ownership. The performance would be similar to using a shares arena, but would have less need of trust between caller and callee.

  • Ivan Godard
    Keymaster
    Post count: 689

    Returning a dynamic object across a protection boundary is just as annoying as it is when caller and callee are in the same protection domain, and you can use the same methods to do it. Caller can create and grant-pass a result area for callee to fill. Callee can allocate a new object and explicitly grant it to caller. Caller and callee can share an allocation arena set up in advance and used for multiple calls. Callee can call a trusted registrar that receives and holds the object for caller to pick up later after callee returns. And so on.

  • Ivan Godard
    Keymaster
    Post count: 689

    Virtual Zero (for globals) and Implicit Zero (for stack) preclude rummaging in uninitialized rubble; you always get a zero. What you get from the heap is up to the heap allocator you use; that’s not under hardware control, although there are cache-control ops that make zeroing out an allocation particularly efficient, for those allocators that care.

    You will also get a zero for intra-struct and after-array padding. The bounds check is at power-of-two granularity. If your array is not power-of-two sized then there will be padding at the end to bring the allocation up to power-of-two. You can read (and write) into that padding, but will always read a zero (or whatever you wrote).

    This hardware check is not as comprehensive as a check against the declared bound whatever it is. If you want the more complete check then you can turn on an option and will get the same kind of inline software checks as any other ISA, at similar cost in code space and power. You are right that the Mill width may save on the latency cost compared to other architectures, but wide superscalars with OOO can probably do as well for something like bounds checking, albeit at OOO power cost.

    Many apps find software checking too expensive for production code, and enable it only in debug builds. For them the hardware power-of-two check does provide complete wild-address and exploit security at no cost, even though it does not completely guarantee conformity with language semantics.

  • Ivan Godard
    Keymaster
    Post count: 689

    Thank you for kudos.

    A detail: bounds checking on the Mill is free in both time and power. It is not free in space (allocations are rounded up to a power of two), and not free in aggravation for C users who routinely violate bounds.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #3597

    One less than the whole belt. The extra position is needed for the predicate, if there is one.

    The ISA details in the Wiki had developed bitrot after Jan took a leave of absence. We now have a new team member who’s picking it back up again, but not ready yet.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #3595

    A belt needs to have congruence maintained wherever two or more control flows join. The includes call and return sites, and also branch-like control flow when a value dropped in one ebb is passed to and consumed by ops in a different ebb. The argument lists of call and return provide a ready-made way to enforce congruence: the arguments must be in the same order at each site, which is easy to enforce.

    The erstwhile “conforms” op rearranged the belt for branch join points. It was a separate op so that it could be omitted if by happenstance (or good scheduling) the live values were already congruent on the in-bound control flow arcs. It turns out that such natural congruence almost never happens, so we replaced the conforms op with a call-like argument list on the branches.

    You suggestion to re-order call arguments to get them into belt lifetime order is valid, although the current specializer does not do it. In fact, it does no partial-parameterization yet at all – not just ordering, but such things as folding common constant arguments into the body of the called functions. Some of that sort of thing are done by LLVM’s Link Time Optimization facility, and so will also be done for Mill targets. However, argument reordering is not done by LLVM, and we have no plans to add it to LLVM any time soon.

Viewing 15 posts - 91 through 105 (of 674 total)