Forum Replies Created

Viewing 15 posts - 646 through 660 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Instruction Encoding #399

    If you need to keep a belt value over a loop then you should run it out to scratchpad before the loop (“spill” op), and bring it back again when you need to use it next (“fill” op). Described in the belt talk.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Synchronization #359

    If you mean atomic primitives, the Mill uses optimistic concurrency using the top-level cache, sometimes mis-named “transactional memory” and quite similar to operations in recent Intel and IBM architectures. There are no read-modify-write primitives; pessimistic things like compare-and-swap are built from the optimistic primitives.

    If you mean actual thread operations such as spawn and visit, then sorry, NYF.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Many core mill (GPU) #354

    A massively parallel Mill is possible, but not a GPU (which are very different architectures). Think Sun’s Niagara or Tilera. Currently most of that market seems to be 32-bit, and Mills would be overkill, but it may evolve up into our range.

    GPUs are wavefront machines, lacking real branches and recursion. The Mill would do a good job of software graphics for business-type machines, but for game-type horsepower you want a GPU handling the triangles.

    Or so we have expected. Won’t know until we see what people use them for 🙂

  • Ivan Godard
    Keymaster
    Post count: 689

    I posted some code to the “Application walkthrough” topic, but the forum software squeezed all multi-blank sequences to a single blank, destroying the formatting. If this is an option please reset it, or post to that topic how to work around.

  • Ivan Godard
    Keymaster
    Post count: 689

    This is a reply to post #407

  • Ivan Godard
    Keymaster
    Post count: 689

    And this is what I get if I click “reply” on post #406

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Instruction Encoding #405

    How do you break out of a loop? Execute a branch op. (I think I’m not understanding the question).

    What happens if two transfers would retire in the same cycle? First Winner Rule applies. FWR will be covered in the 2/5 talk (I think), but short story: there is a canonical ordering that both hardware and code generators (and asm programmers) use.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Instruction Encoding #393

    Branch operations can contain an explicit delay, the way loads can; a delay of zero may be omitted, indicating an immediate branch, which takes effect in the following cycle, as described. A delay of one takes effect in the cycle after that, and so on. As a result, there may be several branches in flight at the same time. These all will occur, in their respective cycle.

    Instructions may contain several different branch operations. For sensible code, these will normally use different predicates and there will normally be at most one unconditional branch. All predicates are evaluated, and any branches with unsatisfied predicates are ignored.

    For any given cycle, there will be a (possibly empty) set of branches which are due to retire in that cycle; some with zero delay from the current instruction, and some from prior instructions that are at last timing out. One of these retiring branches wins, according to the First Winner Rule. FWR says that shorter delay beats longer delay, and for equal delay higher slot number beats lower slot number. Operations are in asm are packed into slots in reverse textual order, so slot zero is on the right of the instruction as written. Consequently you can read the code and the textually first branch that is satisfied will be the one taken; hence First Winner Rule.

    As an example:

    if (a == 0) goto A0;
    else if (a == 1) goto A1;
    else goto Arest;

    codes as

    eql(<a>, 0), eql(<a>, 1);
    brtr(b0, "A0"), brtr(b1, "A1"), br("Arest");

    With phasing (due in the 2/5 talk) the code can be reduced to a single instruction rather than the two shown here. I use goto for simplicity here, but the same structure is used for any branches.

    Loops are no different:
    while (--i != 0) { a += a; }
    can be encoded as:

    L("loop");
    sub(<i>, 1);
    eql(<sub>, 0);
    brtr(<eql>, "xit");
    add(<a>, <a>), br("loop")

    but would be better coded as:

    L("loop");
    sub(<i>, 1);
    eql(<sub>, 0);
    add(<a>, <a>), brfl(<eql>, "loop");
    //falls through on exit

    Yes, the second form does the add an extra time on the last iteration, but the former value is still on the belt and is the correct value for “a” at the end. The belt permits optimizations like this that are not possible if the a+=a was updating a register. Combine this optimization with phasing and a NYF operation and the whole loop is one instruction.

    I’m not sure what a flat-out OOO superscalar would do with this code, but it clearly would not be better than the Mill 🙂

    • This reply was modified 10 years, 9 months ago by  Ivan Godard.
  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #402

    The operation that renumbers the belt is called “conform” and is a separate op from branches. It takes a belt number list, and those are moved to the front of the belt, in order, and the rest of the belt is invalidated. There is also a “rescue” operation that takes a similar list, but the rescued values are left on the belt in original belt order, not in list order. Obviously you can use conform to get the effect of rescue; rescue exists because it has a more compact encoding, for use when you don’t care what the order is so long as they don’t fall off.

    Any time you have a control-flow join point in which the in-bound arcs do not have congruent belts (such as at the top of a loop) then you pick one of the arcs, define its belt as canonical, and use conform on the other arcs to make everything congruent. For loops you generally want to define the most frequently taken backwards branch as the canonical one; that eliminates the conform that would be most frequently executed.

    • This reply was modified 10 years, 9 months ago by  Ivan Godard.
    • This reply was modified 10 years, 9 months ago by  staff.
  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #401

    In reply to Jonathan #397: Yes, regular branches have to belt consequences. Of course, the predicate for a conditional branch must be on the belt, but that’s an argument, not the branch itself.

    • This reply was modified 10 years, 9 months ago by  Ivan Godard.
  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #400

    Check out the scratchpad, described in the Belt talk. It’s where you put stuff that you don’t want to lose.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Instruction Encoding #389

    There are no jumps “within” an ebb; by definition ebbs can only be entered at the entry point.

    Branch and call operations are flow side and (ignoring code pointers) contain a manifest signed 32-bit offset that uses normal flow-side manifest encoding (and so can occupy from zero to 32 bits in the encoding, depending on value). The offset is relative to the entry point address of the ebb containing the transfer operation.

    Transfers using a code pointer take the pointer from the belt and transfer to the indicated address; there is no offset.

    Ignoring misprediction, latency of all transfer operations is effectively zero; the cycle after the instruction that contained the transfer will issue the instruction at the target address. Thus a one-instruction loop (a common situation on wider Mills) executes an iteration every cycle.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Memory #386

    I read through the Peyton Jones paper; not sure how long ago I looked at if ever, though the Miranda paper I did read.

    Some general comments:
    1) The Mill design has generally been concerned with commercially important languages, and we haven’t put much thought into support for esoterica, which Haskell still is 🙂 However, we want to provide such support, but don’t really know enough about the issues and details at this point.

    2) The scheme of using global state over function calls as a way to do tail calls doesn’t work on a Mill; call has much more complex semantics than just changing the code pointer. However, the Mill supports label variables and indirect branches, which should provide a satisfactory substitute.

    3) The data structure in which there is a pointer to a closure which contains a pointer to an info record which contains a pointer to the code would be very painful on any modern machine, Mill included. This implies that a closure should contain both the code and the arguments so that it can simply be entered with a single indirection, which is free if the closure pointer is already on the belt. However, there are issues with mixed code and data, because the Mill uses a Harvard architecture with separate i$ and d$. There are also issues with protection; the PLB has byte granularity, but you *really* don’t want a PLB entry for every closure! As a naive alternative, the closure could contain the bindings and an info pointer and also a code pointer, eliminating one but not both indirections. I have to think about what the predictor consequences of that would be.

    4) One possibility is to use the function pointer representation of lexically nested functions as the representation of a closure pointer. That is, a closure pointer is the pair {code pointer, binding pointer}, as is used e.g. for Pascal and other languages with lexical nesting. We expect this form to be used for lambdas in C++. The drawback is that function pointers become 128 bits, which is OK on members with quad but not so good on smaller members.

    5) An alternative is to generate a thunk for each lambda or statically nested function, where the thunk loads a static pointer to the binding (or outer frame) and then jumps to the actual code. This approach is used by gcc for lambdas I think.

    6) both of the above have issues with where to put the data when there’s no GC; not relevant for Haskell and friends, but very relevant for C++.

    7) The natural place for the Node register is inpReg. There are interesting issues when the closure is in a different protection domain (“turf”) than the caller, which inpReg helps with (sorry, NYF). In particular, the caller must not be able to muck with the closure data before or during the call. This again suggests a thunk-like approach, but again the PLB can’t tolerate one thunk per closure-instance and even one per bind-clause may have performance issues.

    I’m probably showing my ignorance here 🙁 We will want to revisit this when the full machine is public.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Memory #362

    A “display” is a mechanism to implement lexically nested functions.

    Are you thinking about closures created by a JIT, where the code could be anywhere and the code and data may be moved by the GC? Or compiled closures like lambdas in C++?

  • Ivan Godard
    Keymaster
    Post count: 689

    Yep, I hadn’t noticed the “code” button 🙁

    The code was hand written; as you see, Mill asm is distinctly *not* humane.

    The ability to write relAsm (which is what we have called specializer input) was initially supposed to exist, but then dropped when we realized that there was no need and great cost, if it could be done at all – writing graph structures is no friendlier than absAsm (what the assembler accepts).

    Dave has long advocated a form of absAsm that at least calculated the belt for you, and even has a personal tool that sort-of does that that he used for some benchmarks. While there are issues that so far there are no solutions for, it looks like it could be done and would make asm easier. It remains a someday project, awaiting a volunteer and progress on higher priority projects.

    Unfortunately the entire operation set contains a few NYFs, so we can’t publish it yet. However, it should be possible to write at least a few of the samples using only the known operations, or pseudo-ops standing in for ops that you don’t know the mnemonic for yet. The format I used on the slides omits belt numbers in favor of a comment indicating what was being referenced. Thus the first gcd would be:

    /* code based on Rosetta C++ example:
       int gcd(int u, int v) {
       return (v != 0)?gcd(v, u%v):u;
       } */
    F("gcd"); 
       neqs({v}, 0), rems({u}, {v});
       retnfl({neqs}, {u});
       nop(4);           // wait for rem
       call1("gcd", {v}, {rems});
       retn({call});
    

    Would that help?

Viewing 15 posts - 646 through 660 (of 674 total)