Forum Replies Created

Viewing 15 posts - 526 through 540 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689

    Different efficiencies for different parts. vs. Haswell:
    functional units: the same
    caches: the double I$1 makes mill a bit bigger there, but same elsewhere for equal cache sizes;
    belt/bypass: same for equal numbers of pipelines
    decode: Mill much smaller for equal issue rate
    read/write buffers: don’t exist on Mill.
    rename and general registers: don’t exist on a Mill.
    prediction: unclear because approach taken is so different; wild guess: equal area consumed, Mill advantage taken as better miss rate rather than reduced area
    controller: equal for eqaual memory organization
    coherence: Mill much smaller
    spiller: does not exist on Haswell

    Wild guess: for equal performance, process and clock a Mill core will be 4x smaller than Haswell, while the caches and controller will be the same size.

  • Ivan Godard
    Keymaster
    Post count: 689

    This sounds like a plausible approach, at least for the start. It is certainly not a Mill, but we don’t mind if you describe it as “inspired by the Mill architecture” 🙂 Please let the Forum know how you progress.

    BTW, a personal bias in designing and implementing language front-ends: I never use a separate parser. Instead I write a recursive-descent LL1 recognizer that does semantics (and frequently at least the front of optimization and code generation) on the fly. I find that this approach is significantly cleaner and easier to maintain and extend than separate parsets (especially those produced by parser-generators), and give much better error messages for the user. YMMV 🙂

  • Ivan Godard
    Keymaster
    Post count: 689

    Mill code is member-dependent at the encoded bit level. In fact, it is even slot-dependent: the exact same add operation encodes differently if it is the first, second, or Nth operation in an instruction (usually, anyway, depending on what operations are supported in the different slots). Because operations that are not present cannot be encoded, the Mill encoding is very dense and compact,

    However, Mill code is member-independent at the load-module level. Compiler output does express an abstract form of the program, just as you suggest. That abstract code is then specialized to the concrete binary of the actual Mill member, and the specialized code is than cached back into the load module so that specialization only need be done once per different Mill that will run on.

    Typically code is specialized at install-time, although it may be done at load time or (to build ROMs and a few other purposes) as a manual command. Specialization is very fast, because its job is roughly what is done in the backend of a conventional compiler. A few other tools that work with concrete binaries also are bit-level Mill dependent (a debugger for example) so we’ve put the concrete-to-abstract and abstract-to-concrete conversion functions in a library for application use if desired.

  • Ivan Godard
    Keymaster
    Post count: 689

    Besides what has been revealed in the talks already, we are actively working to extend and improve the Mill security model. We know that in the long run it will be a Good Thing.

    But frankly, we are also worried that in the short run it may not be a Good Thing at all. Far too many programs are out there that are full of security holes that nobody has fallen into yet, or at least they aren’t talking. If the Mill has always-on security (and we feel that security you can shut off isn’t security) then programs will fail on the Mill that “work” on other machines. And Mr. Clueless will be all over the web, posting that Mills are broken because his (obviously bug-free) program works fine on other chips. We wouldn’t survive a reputation for flakiness, deserved or not.

    If you think these worries are far-fetched, all I can say is that I remember the transition to MMUs, and hearing people file reports about failing hardware, with comments like “What is this crap about segv?”.

    There is a rule of thumb in business, and Mill Computing is a business after all, that for the first sale you must give the user what he thinks he wants, while for the second sale you must have given him (in the first sale) what he actually needed. That makes for subtle design issues. We’ll do our best, and I hope you will be around to balance Mr. Clueless and his rants.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Security #966

    The following was from James Babcock and sent to me directly; repeated here for wider comment:
    \

    In the security talk, you said that the Mill will not generally have
    high-granularity entries in its PLB, for performance reasons, but I
    don’t think you said anything either way about the TLB. Will the Mill
    support fine-slicing of address spaces in the TLB? If so, how much do
    slices cost, and if not would it be feasible to add? I ask mainly
    because a finely-sliced address space in the TLB, combined with some
    memory-allocator tricks, could solve the use-after-free security
    problem, which Mill has not yet proposed a solution for.

    The essence of the fix is separating reuse-of-address-space from
    reuse-of-memory, and not reusing the address space of freed objects
    for as long as possible. If it were cheap to reclaim memory without
    having to reuse associated address space, for objects sized and
    aligned to >=32 bytes but smaller than a traditional 4kb page, then
    the use-after-free problem would be pretty much solved.

    • Ivan Godard
      Keymaster
      Post count: 689
      in reply to: Security #967

      The TLB supports scaled sizes, as many modern TLBs do, but is unusual in that the smallest size is one line rather than 4KB. However, one-line pages exist to permit on-th-fly allocation of backing DRAM when a dirty line is evicted from the last-level cache. The one-line pages are scavenged by a system process and consolidated into more conventionally-sized pages. If scavenging were not done then the number of one-line entries would grow to the point that the underlying page tables would occupy most of physical memory, which is rather pointless.

      There’s another difference between the PLB and the TLB structure: while the granularity of pages in the TLB varies, it is always a multiple of lines, whereas the PLB has byte granularity. Consequently you can protect a single byte (which is actually done, for example to protect an MMIO register for a device), but you can’t page one to disk.

      Moving on to your suggestion to use the TLB for checking for use-after-free. A crawling allocation only works if it has a crawling free, or you wind up with internal fragmentation and, again, a lot of tracking entries and poor performance. If you do have a crawling free, for example if you are using a copying garbage collector, then it becomes possible to protect the freed region. However, the TLB is probably not the right place to do it. That’s because on the Mill there is not necessarily any memory backing the address space – at all. Backless memory lets data have an existence only in cache, and is a significant performance win for transient data. The problem is that the hardware does not know that the software considered the data to be free, so if it is dirty (which it necessarily will be), it will eventually age out of cache, be evicted, and have physical DRAM allocated for it, all pointlessly.

      The hardware handles the case of transient data in the stack, because allocated frames are automatically freed on return, and the whole call/return protocol is built into the hardware so the hardware knows to discard the lines covering the exited frame. For random memory it doesn’t know about free unless the software tells it.

      The good news is that the software is able to tell it. There are several cache management operations in the Mill that let software convey useful information to the cache. One of those operations tells it that an address range is dead, and lines covering that range can be discarded, nominally dirty or not. However, these ops do not recover allocated backing DRAM if it exists, because tracking DRAM is too complex for hardware and the OS must get involved.

      And here’s where the TLB gets back in the act. Once the crawling free has freed a whole page, and discarded the lines in cache, it is possible to remap the page in the TLB from physical memory to an error trap. There’s your use-after-free case. However, I don’t think that it is quite what you were hoping for, because it doesn’t protect free spaces in fragmented memory, and it only protects a page granularity. Hence there’s a window of vulnerability while waiting for a partially free page to become wholly free.

      I’m not sure I have answered your question in this; feel free to expand or refine it here.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Security #1016

    You assume correctly; you can pass any subset of the rights you have, and the “r” is explicitly not passing “w”.

    About the arena: the concern is, if the arena is passed with “w” rights and contains both data and administration (which is the way many heaps are organized) the the recipient can tromp on the administration. However, if the administration is disjoint from the contained data (which is the way some heaps are organized) then only the data and not the administration is exposed. Of course, if the callee needs to add or remove nodes in the arena, the latter approach would require also giving the callee portals to the client routines that do the arena malloc and free, because the service would not be able to play with the administration directly itself.

    It’s clear that the Mill primitives can be used in many different ways to support many different models, and we expect the users and the field as a whole to spend happy hours exploring the possibilities and discovering what policies and protocols provide the best of both function and performance. Have fun 🙂

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Security #998

    We believe that the width of the Mill (on the sorts of family members one would expect to use for sim) makes a software implementation practical for low-grade RNGs like a LFSR. As mentioned, we expect to use an i/o like interface for a crypto box on those members that have one. The purpose of the hardware random seed generator that I mentioned is not for use as a RNG – not enough entropy even for low grade – but to provide a physics-based starting point that cannot be predicted even if an attacker has all the code and the machine circuits to study.

    Whether the seed generator uses oscillators, back-biased diodes, or the howling of cats is up to the hardware guys and likely will vary by member. 🙂

    However, all this projection is very premature, and will no doubt be changed as we get closer to actual market.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Security #996

    Your understanding is correct.

    There are further subtleties that make me say “like an i/o device”. For testing, programs often need to have repeatable random number sequences, so that it is important that a program be able to glom onto a RNG and take all the numbers generated, without some other thread or process grabbing one in the middle of the sequence. That is, the RNG needs to be (able to be) single-user exclusive use. Most of the other bulk-algorithm hardware has the same characteristic: you don’t want someone else getting the result of the MD4 that you started, and so on.

    This exclusive use capability is characteristic of many i/o devices. Even those we often think of as having multiple concurrent users (a disk for example) actually has only one user (the driver) when you look closely at the hardware.

    In contrast, operations are usable by all, and the hardware does the save-restore that gives the illusion that everybody has their own belt. It is impractical for hardware to provide the illusion that everyone has their own RNG. So the Mill provides an interface that supports exclusive use (using MMIO and the standard memory protection system) and asynchronous (or semi-synchronous) access.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Security #994

    The issue with a RNG and other hardware of similar character is not the bandwidth, it’s the latency: how long between first request and first result. That needs to be no more than 4-5 cycles, and uniformly consistent for every use. Otherwise the hardware can exist, but it needs to be accessed in a different way or the whole machine suffers whether the RNG is used or not. So the right place for a RNG is as a side hunk of hardware, not reached by the usual operator interface. The timing, bot latency and bandwidth, would be as good as if it were an operation, but without the penalty to machine performance of other code.

    Similarly, the belt cannot be indexed or have an address taken. Again, even the possibility of doing do would cut everybody’s performance 2X or more, whether they used the facility or not. So this is potentially another hunk of out-of-line hardware for users with special needs. We’ve identified several others of the sort: large look-up tables, Galois Field operations, etc. They would all use the same interface.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Security #992

    As with other very-long-latency operations, RNG does not fit well as an opcode. There is also the problem that RNG algorithms may be expected to improve with time. Consequently we see such a box to be optional per-member, and attached through a different interface that is not fixed-latency like regular ops.

    We do plan to provide a physical seed operation. This is not a RNG, but will provide a few bits of completely unpredictable entropy to start things off.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Execution #982

    As far as the specializer is concerned, a call is just a zero-latency op. F(G(), x*y) has the same constraints as F(a+b, x*y). It can schedule the {call G, add} any time in relation to the {mil}, but both have to be done before {call F} can issue. Becuase it knows all the latencies, it tries to have everything ready to go and already dropped to the belt when it is needed. In the case of the nested call, recall that even with cascading both calls are in the same instruction, and multiply drops between readerPhase and opPhase, which is before callPhase. So for best speed, the mul must retire before either call, cascaded or not. Remember, both calls happen in one of the consecutive call phases of the single instruction; unlike other phases, there can be several call phases.

    It doesn’t matter how long either function takes to execute; for scheduling, all calls take zero cycles, i.e. they complete in the same instruction that they issue in.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Execution #979

    Short answer: it’s all done in the naming.

    Longer answer: the decoders map logical belt numbers in the code to physical belt numbers in the execution engine. There are twice as many physical numbers as the logical length of the belt, to accommodate drops that occur during phasing.

    In your example of G cascading to F, say that G has a single result. Consequently the pre-existing belt value that would be b3 as an argument to G will be b4 as an argument to F (you see why hand-coding a Mill is not recommended 🙂 ). The decode hardware knows that G will drop one result (it’s in the encoding of the call op), so it adds one to the running logical-to-physical bias that it uses for the call of G to get the bias for the mapping for F. For a 16-long belt that’s only a 5-bit add, so speed is not an issue.

    When a call returns, the physical belt is the same as it was before the call courtesy the spiller; a previously unoccupied physical position number is now filled with the result, but no existing values change physical numbers. The arguments of the second call are then picked up by their physical numbers (already available due to the mapping in the decoder), and copies made in free buffers. The copies are assigned physical numbers with the new frame number. Bump the frame number, and you’re good to go.

    That’s what happens without cascading. It should be clear that it’s not actually necessary to reify the whole belt. All that is really needed is to assign the right physical number to the result of the first call, and to create copies of second-call arguments in the spiller with suitable numbers.

    The spiller doesn’t really have time to do that if there’s no cycle between return and next call and it doesn’t get the I-need-this-list for the arguments until after the return. But the decoder already has the I-need-this list for the second call even before the instruction is issued, so it just gives the list to the spiller to save along with the rest at the first call. The spiller can then just do the copies as soon as it knows the return is happening.

    There is actually a problem even with the advance notice if the return is conditional, because predicates are not evaluated until writer phase in the next cycle, which with cascading is actually the first cycle of the second-called function. We can go with the prediction and unwind if it is wrong, or simply not cascade conditional returns; its an implementation choice in the hardware, and members will likely differ.

    We also have a way to do call arguments without copying, but it’s NYF.

    Now for the rest of your question:

    1) The return in a cascade goes direct to the next called function, delta the conditional-return case above.

    2) Called functions start with no data stack frame; they must explicitly allocate one with stackf if they need it. Return does automatic cutback though. The called belt is created as described above – a call with no arguments need only bump the frame number and presto – new empty belt.

    The return op itself has no understanding of what it is returning to. It does have an argument list, encoded just like the list of a call, that gives the belt positions of the result(s). Necessarily these will become b0… in the caller (returns drop at the front as usual), so all it needs is the mapping bias and the frame number of the returned-to frame. Change the physical belt number and frame of those values and the result handling is done. The bias and frame have been saved in the spiller. Hardware implementations will vary, but I expect that the result handling is done by the part of the spiller that does cascading, and the return FU deals only with getting the address and predicate right and dealing with First Winner Rule.

    • This reply was modified 10 years, 9 months ago by  Ivan Godard.
  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Execution #972

    To expand a bit on Will’s answer, and introduce some terminology for the glossary (you there, Mill-fan-registered-as-“imbecile”?).

    CPUs that do not expose the pipeline go to great lengths to support the illusion that instructions are executed one-by-one to completion. They must be able to unwind absolutely any action, completed or not, in the event of an interrupt or other event that lets the program see state. The unwind goes back to just before the first incomplete (unretired) operation; everything after that, done or not, is discarded and has to be done over. This approach is called execution replay (or sometimes issue replay.

    No modern machine actually has one-at-a-time execution internally, and an exposed pipeline machine like the Mill does not even pretend that it does. IBM 7094 semantics are so 1960, after all. Instead, we let all operations run to completion, save the results, and on restart we inject the saved results with the same timing and semantics that would have occurred had the interrupt not happened. This approach is called result replay.

    Result replay has real advantages over issue replay: you never have to recompute things. Moreover, while you have to save the in-flight state if an interrupt happens, you don’t have to save all the state all the time in case an interrupt might happen. And the state is a lot smaller, too.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Security #969

    All of the finer-granularity solutions that we are aware of, where the granularity is fine enough to be used for red zones and to cover padding bytes in structures, have implementation and use costs that would restrict them to high-end machines that use (and pay for) non-standard DRAM configurations such as those needed for ECC. I could be considered if we enter the main-frame business.

    Line granularity poisoning (as opposed to silent zeroing) is possible at an area, bandwidth and performance cost of a few percent. Line granularity is sufficient to detect use-after-free, but not for use as red zones.

    All such schemes have a UI aspect too: the Mill is intended to runn programs straight off the net without rewrite. When there is hardware that detects use-after-free for example, we have to be wary of the reputational damage that may happen when the faulted user howls “But it works on an x86!”. We could easily be blamed for program bugs that other machines ignore. Sad, but human nature 🙂

Viewing 15 posts - 526 through 540 (of 674 total)