• Author
    Posts
  • staff
    Keymaster
    Post count: 49
    #634 |

    Talk by Ivan Godard – 2014-02-05
    at Stanford University’s Computer Systems Colloquium (EE380).


    Slides: execution.02 (.pptx)

    Instruction execution on the Mill CPU:

    Working at Mach 3

    A perennial objection to wide-issue CPU architectures such as VLIWs and the Mill is that there is insufficient instruction level parallelism (ILP) in programs to make effective use of the available functional width. While software pipelining can reveal large quantities of ILP in loops, in open (non-loop) code studies have calculated maximum ILP in the order of two instructions per cycle (IPC), well below the capacity of even conventional VLIWs never mind super-wide architectures such as high-end Mills. The problem is that the program instructions tend to form chains connected by data dependencies, precluding executing them in parallel.

    This talk addresses the ILP issue, describing how the Mill is able to achieve much higher IPC even when the nominal ILP is relatively low. The Mill is able to execute as many as six chained dependent operations in a single cycle; open code IPC numbers typically exceed nominal ILP by a factor of three. The talk will show in detail how this is achieved, and why we chose “Mach 3” as the name of the mechanism. In the course of the explanation, the talk will also introduce other operations for which the semantics of Mill execution differs from that of conventional CPUs.

    • This topic was modified 10 years, 9 months ago by  staff.
  • allpowerful32
    Participant
    Post count: 1

    In the talk you describe how the Mill does tail calls “in hardware, for free”. You use the example of F(G(a+b)), such that the two calls happen in one instruction, and that the call to G actually returns directly to F, not to the calling frame.

    I’m having a hard time fitting these with the model of tail calls that I’m familiar with. This is where calls in “tail position” (basically, being the last instruction before a return that’s either void, or returns the result of that call) are optimized such that the stack frame of the function is destroyed and execution jumps (rather than calling) to the function in tail call position.

    This is most visible in recursive functions, where tail call optimization can often turn a function that will quickly hit a stack overflow (i.e. one that uses stack space proportional to some non-constant function of the input) into a function that completes fine, having used a constant amount of stack space.

    As always, wikipedia explains it better: http://en.wikipedia.org/wiki/Tail_call.

    Could someone help me reconcile these two models? How does the Mill’s “tail calls” (forgive the quotes) relate (if at all) to tail call optimization, which is often vitally important in functional languages?

    • Will_Edwards
      Moderator
      Post count: 98

      True, the analogy doesn’t stand up to such scrutiny 🙂

      Your compiler is still responsible for Tail-Call-Optimisation to turn recursion into loops. The CPU does what it’s told.

    • Ivan Godard
      Keymaster
      Post count: 689

      Yes, Mill cascade is not the same as recursive tail-call, although the benefits are somewhat similar.

  • jrfeenst
    Participant
    Post count: 4

    Does the mill support (arbitrary) vector element swizzling? I’m just wondering if the same functionality that enables free pick might also allow free swizzles. I could see how it might be machine dependent due to different vector sizes.

    • Ivan Godard
      Keymaster
      Post count: 689

      “permute” is an operation that supports general vector reorder, including element duplication. There are some specialized forms that do specific permutations at much smaller encoding cost.

    • Will_Edwards
      Moderator
      Post count: 98

      Does the mill support (arbitrary) vector element swizzling?

      Yes. There is a shuffle op for arbitrary rearrangements of a vector.

      I’m just wondering if the same functionality that enables free pick might also allow free swizzles.

      I believe its in the op phase.

      I could see how it might be machine dependent due to different vector sizes.

      Well, you can always use the remainder op to create a mask that you then pick against with 0s or Nones to create a partially filled vector? This was covered in the strcpy example in the Metadata talk and the Introduction to the Mill CPU Programming Model post.

      • Joe Taber
        Participant
        Post count: 25

        So shuffle can duplicate elements, can it replace elements with 0?

  • Ivan Godard
    Keymaster
    Post count: 689

    Nope, nor with None, which would be even more useful. Use “pick” rather than shuffle to insert specific values where needed.

    • Joe Taber
      Participant
      Post count: 25

      Since shuffle is op phase and the pick phase is after the op phase (iirc), could shuffle+pick on a vector happen in a single instruction and on literal 0 or None? (That is, no explicit 0 or None vector needs to take up a belt position.).

      I guess that would get what I was asking for, except you’d need a way to generate and park the vector to pick on, unless you can pick on a vector literal, if such a thing exists.

      • Ivan Godard
        Keymaster
        Post count: 689

        Yes, pick can select within the result of a shuffle in the same instruction.

        And yes, there are vector literals, but restricted ones. The general mechanism is to use the con operation (reader phase) to get a scalar literal, and then the splat operation to make a vector out of it. However, certain popular constants, scalar or vector, are available from a form of the rd operation (reader phase). These are called popCons, short for “popular constants”. The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.

        So for zero and None, assuming you have a bool mask to drive the selection, the code would be (for word element size):
        rd(wv(0)), shuffle(<data>, <indices>), pick(<mask, <shuffled>, <rd>);
        in one instruction.

  • imbecile
    Participant
    Post count: 48

    Just a quick question about multiple inputs: Are the argument lists of the call operation implemented in terms of the args operation used in ganging, or are they completely different?

    My guess would be different, because I suspect the call operation is on the branch/pointer side of the instruction streams and ganging on the computation side. If both sides had both it would be kind of redundant.

    • baking
      Participant
      Post count: 7

      Function arguments are specified the same way as for operations except you can have up to 16 of them. They also return any reasonable number of results just like operations. See the first talk on instruction decoding.

      The function gets a brand-new belt with all of its arguments loaded in order. I think that was described in more detail in the second talk.

      Edit: I’m not sure if I actually answered your question about the mechanics of passing those arguments to functions in the assembly language. I don’t know if that has been specifically mentioned.

      • This reply was modified 10 years, 9 months ago by  baking.
      • imbecile
        Participant
        Post count: 48

        > except you can have up to 16 of them

        Yes, those previous talks are the reason I have this question. If you already can supply an instruction with an arbitrary number of inputs up to 16, why do you need a special mechnaism for other instructions to provide more then 2 inputs?

        Could just be an encoding issue. The 2 operand case is by far the most common for all instruction except call. The new belt for the callee is invisible to the caller anyway, so from the callers belt perspective it should be irrelevant whether arguments are provided in an argument list or with additional arg-operations. The caller doesn’t care that in one case the arguments are combined in one case in a new belt and in the other case in internal data paths. It doesn’t see either. All it sees are the eventual results.

        • Ivan Godard
          Keymaster
          Post count: 689

          The two-argument constraint is on the exu encoding side and the computational operations in general. The encoding optimizes the two-argum,ent case for the slots that encode those operations. The flow side has a completely different encoding because it needs to support large constants (for offsets and the like) and long argument lists.

          Call, return, conform, con, rescue, and some NYF ops uses the flow big-constant mechanism for arguments, and call, branch, load, and store use it for address offsets.

          • baking
            Participant
            Post count: 7

            Found this useful explanation in the split-stream encoding white paper:

            if operations with similar encoding
            requirements are assigned to the same stream,
            then the bundle or operation formats can be
            tailored to each stream, saving instruction space
            and likely reducing decoding difficulty. For
            example, all operations with an effective-address
            parameter (including loads, stores, and branches)
            can be assigned to one of the streams, so the logic
            to decode an address is not needed in the other
            decoder.

      • Ivan Godard
        Keymaster
        Post count: 689

        Yes, except the number 16 is member dependent. One slot is used for the call proper, including its address offset (or function pointer position on the belt), a count of the expected number of results if more than one, and so on. There can be three belt numbers per slot, so any not needed for this are available for arguments. If there are still arguments then they are spread over additional slots, each of which get the same three belt numbers plus 32 bits of manifest that is also used for belt numbers.

        If a belt number is 4 bits (common in members), and the first slot is completely used for administrivia, then the ganged slots add 11 more arguments each (three morsels plus 8×4 bits). However, it is not meaningfull to pass more atguments than the callee has belt. A 4-bit belt number means the belt is 16 long, so at worst needs two slots for arguments and three in all for the maximal argument list.

        Similar reasoning determines the number of slots needed for other belt lengths. We currently have belt lengths of 8 and 32 defined as well as 16; it’s not clear whether a belt length of 64 is ever needed in a practical chip..

        • imbecile
          Participant
          Post count: 48

          Exactly the answer I was looking for 🙂

          tldr; arg-ganging is used to extend the argument list of any instruction that may need it, including call.

    • Ivan Godard
      Keymaster
      Post count: 689

      Ganging is supported on both sides. On the flow side, ganging is used for bulky arguments that don’t fit in the encoding of one slot. A Quad-precision pi constant, for example, needs three slots to supply 128 bits. This is also used for long argument lists for operations that use them, including call.

  • JonathanThompson
    Participant
    Post count: 11

    Having now watched the slideshow after watching the original live broadcast, I’m wondering if my theory is correct: with a 32-long belt, it appears for predicated branches (first true wins) that it’s possible to have a 16-way speculative branch, as there’s 1 load and 1 branch specifier for each case.

    Is this correct?

    • Ivan Godard
      Keymaster
      Post count: 689

      The value of N for N-way branches us determined by the number of branch functional units, not the number of operands on the belt. It is quite common for two branches to reference the same condition value, but with opposite sense:
      btrt(b0, <then>), brfl(b0, <else>);

      • JonathanThompson
        Participant
        Post count: 11

        Ok, I’m not surprised: I was very hopeful, that’s true 😉

        A great example where such multiway branches can be very useful can be found in the recursive-descent lexer/parser code generated by ANTLR, as it generates (depending on generation options) lots of switch/case situations, recursively.

        Now, I’m wondering: is there a way in a single instruction to encode a rather complex if/then/else where you have multiple values meet some conditions (say if(a==b&&c==D|d=a)? I’d expect it’d take 3 CPU cycles, due to phasing, and the practicality of data to load, etc. but can it be encoded and set off in a single instruction?

        Using the ganged operations in instructions, I’m suspecting that could be synthesized, but I’m not sure: that could really make code fly.

        • Ivan Godard
          Keymaster
          Post count: 689

          I think you meant:
          if((a==b) && (c==D)) ||(d==a))
          then that becomes:
          brtr(<d==a>, <then>), brfl(<a==b>, <else>), brtr(<c==D>, <then>), br(<else>);
          i.e. one instruction if you have four branch units.

          Note that this code assumes that reordering the arguments of a conjunction or disjunction is permitted, which is true in some languages. If not permitted (as in C), you would use:

              brfl(<a==b>, <disj>), brtr(<c==D>, <then>), br(<disj>);
          disj:
              brtr(<d==a>, <then>), br(<else>);

          This code ensures correct C semantics if one of the predicates is a NaR.

          • This reply was modified 10 years, 9 months ago by  Ivan Godard.
          • This reply was modified 10 years, 9 months ago by  Ivan Godard.
  • PeterH
    Participant
    Post count: 41

    Is there a summary of opcodes posted somewhere we could reference?

    • Ivan Godard
      Keymaster
      Post count: 689

      Not yet; there are still a few ops that are NYF.

  • baking
    Participant
    Post count: 7

    The first talk on decoding instructions explains how instructions are split into two streams (Exucode and Flowcode) with each half-instruction (bundle) composed of a header and three blocks, for a total of six blocks per instruction. The most recent (sixth) talk on execution explains how these blocks are decoded and executed in phases. Together, they create a more complete picture of the Mill instruction set.

    In the exucode stream, the three blocks correlate directly to the functional pipelines. (Numbers correspond to the “Gold” member of the Mill family.)

    Block 1x: 8 read operations (fill from scratchpad*)
    Block 2x: 4 integer or binary floating point (including multiply) operations
    and 4 integer only (no multiply) operations
    Block 3x: 4 pick** operations
    and 5 write operations (spill to scratchpad*)

    * The only exucode side read/write operations that we currently know of are the fill and spill scratchpad ops.
    ** The pickPhase is describe as being between phase 2 and phase 3 but there are numerous reasons to include pick ops in the third block. For one, the pick logic resides in the crossbar that the writePhase ops use to select their inputs. Also, the maximum number of operations in one block is said to be nine.

    The flowcode stream is quite different with only 8 dedicated pipelines compared to 25 for the exucode stream. The reason for this is that flowcode operations are much larger due to their in-line address and constant operands so the code size (and cache size) for both streams balances out. Another difference is that the pipelines include functional hardware for multiple phases:

    4 pipes that can do either immediate constant, load/store or control transfer (branch, call) operations
    4 pipes that can do either immediate constant or load/store operations

    Suppose you have an instruction with 8 stores (writePhase), followed by an instruction with 8 loads (opPhase), followed by an instruction with 8 constants (readPhase). Because of the phasing, all 24 operations would hit the 8 pipelines during the last cycle. Would the the separate functional units handle it easily, or are there constraints on the mix of flowcode ops to balance out the phasing?

    • Ivan Godard
      Keymaster
      Post count: 689

      Congratulations! A remarkably complete summary.

      The short answer is that there are no issue constraints in your example. However, because you are issuing loads and expecting immediate retire then you will get several cycles of stall between the load and con instructions; that’s why deferred loads exist, see the Memory talk for details. The data cache cannot get you data instantaneously.

      More generally however, the operation set and hardware have almost no issue hazards. “almost” being the operative word here. An example of an issue hazard that does exist is FMA. While you can issue an FMA every cycle in all slots that support it, if you issue an FMA now and then try to issue a plain FP add in the cycle when the multiply part of the FMA is done then both the add and the add part of the FMA will be trying to use the adder hardware in the same cycle; no dice. The hardware throws a fault if you try.

      And currently that’s the only case.

      • Ryan Hitchman
        Participant
        Post count: 3

        The integer/floating point dividers are fully pipelined?

  • Ivan Godard
    Keymaster
    Post count: 689

    There’s not much reason for hardware divide on a machine as wide as the Mill. As your question shows, conventional divide is implemented as a microcode loop that blocks the pipeline, and pipe blockage gets more painful with wider machines.

    While divides are defined in the abstract Ml, we do not expect them to be native hardware in any member. Instead,the specializer will substitute an emulation using Newton-Rapheson. The emulation will make use of several “divide-helper” ops that will be native. Thus the emulation, while software visible, is doing essentially the same job as the microcode on a conventional. The result has the latency of native hardware, but the individual operations can be scheduled intermixed with other ops, including those from other divides.

    We expect to use the same strategy for sqrt and transcendentals.

  • Symmetry
    Participant
    Post count: 28

    Some questions:

    This wouldn’t be a Mill architecture, but I wonder if there would be a problem with adding more phases and having different otherwise identical functional units do their work in different phases. For example, maybe read, pick1, op1, call, op2, pick2, write. I suppose that would induce limitations as to how quickly you could schedule instructions back to back, but it would make decoding easier for a given width. Were there other important benefits and limitations I’m not thinking of? Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?

    With regard to calls, how many cycles do they normally take in overhead?

    What happens if you have an FMA instruction but fail to also issue an Args? Or is it that the existence of an Args in an instruction turns the multiply operation into a FMA operation?

    Am I right in guessing that which operations are connected to which in ganging is a result of which slot in the instruction the ops occupy?

    • Will_Edwards
      Moderator
      Post count: 98

      Answering particular parts of your question where I can:

      > Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?

      Well, definitely informed by code analysis. There’s an upcoming talk on Configuration, and its been described in the talks and in the Hackaday interview.

      A central tenet of the Mill is that people building products can prototype custom processor specifications in the simulator very quickly and choose a configuration with the best power performance tradeoff informed by benchmarking representative code.

      > With regard to calls, how many cycles do they normally take in overhead?

      One. Calls and branches (that are correctly predicted) transfer already in the next cycle. In the case of mis-predictions where the destination is in the instruction cache, again the Mill is unusually fast; it has a penalty of just five or so cycles.

      Additionally, there is is none of the conventional pre- and post-ambles.

      • Symmetry
        Participant
        Post count: 28

        >A central tenet of the Mill is that people building products can prototype custom processor specifications in the simulator very quickly and choose a configuration with the best power performance tradeoff informed by benchmarking representative code.

        Yes, I’m really looking forward to seeing what tech stack you use to do that. I gues I’l have to wait for the specialization talk to come out to find out how.

    • Ivan Godard
      Keymaster
      Post count: 689

      This wouldn’t be a Mill architecture, but I wonder if there would be a problem with adding more phases and having different otherwise identical functional units do their work in different phases. For example, maybe read, pick1, op1, call, op2, pick2, write. I suppose that would induce limitations as to how quickly you could schedule instructions back to back, but it would make decoding easier for a given width. Were there other important benefits and limitations I’m not thinking of? Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?

      The phase structure is largely dictated by the encoding and the decode process. It takes three cycles to fully decode the extremely wide Mill instructions, but ops in some blocks are completed in each of those cycles and we issue them right way as soon as available; that gives us three phases. The pick op has no cycle, and the call op uses the cycle of the called function, which gives us five phases in all.

      We could do as you suggest and have added phases with dedicated FUs. However, the ops for those phases would already have been decoded and would just be waiting for their phase, so why not issue them at once? We could of course go to a five-block encoding, which would take four cycles to decode, requiring another phase. However, three seems to be all that’s useful.

      Assignment of ops to blocks (and hence phase) is ad hoc but guided by hardware constraints. For example, it is not possible to get the belt bypass set up in parallel with decoding the first (reader) block, so ops in reader phases cannot have belt arguments. Consequently the reader phase can only have pure-source ops or administration ops that do not take a belt argument. Reader phase also cannot have an op taking a large constant (such as the stackf frame allocation op) because the constant would wreck the entropy of the encoding of the other reader ops which are quite small.

      • This reply was modified 10 years, 8 months ago by  Ivan Godard.
      • This reply was modified 10 years, 8 months ago by  Ivan Godard.
      • This reply was modified 10 years, 8 months ago by  Ivan Godard.
    • Ivan Godard
      Keymaster
      Post count: 689

      What happens if you have an FMA instruction but fail to also issue an Args? Or is it that the existence of an Args in an instruction turns the multiply operation into a FMA operation?

      Am I right in guessing that which operations are connected to which in ganging is a result of which slot in the instruction the ops occupy?

      Decode never needs to look at an adjacent slot to disambiguate an operations; needing to do so would complicate and slow down decode. Consequently FMA is a distinct opcode and not a mul-with-baggage.

      The sanity check for ganging (at least on the exu side where these ops live) is done during execution. The FMA functional unit knows that it needs arguments from its buddy, but it has the execution cycle to work out whether or know it has a buddy with the right arguments. No buddy causes an invalidOperation fault, even though the decoder proper didn’t complain.

      Gangs are always encoded in adjacent slots. Otherwise there are encoding issues and added data paths to get everything to the FU involved.

  • David McCammond-Watts
    Participant
    Post count: 13

    I was wondering about what happens to FUs that are occupied at the time of a call. Do they complete while the call is executing and simply output their value to an output buffer for the slot (potentially to get saved and restored by the spiller) or is their internal execution state saved and restored?

    Suppose you have a mul and a call in the same instruction and that the mul take 3 cycles. With phasing the mul will have completed 1 cycle’s worth of work prior to the call. Will the mul complete its work (in two more cycles) during the called function’s cycles or would its work be stopped and its internal state be saved (again, to the spiller?) and then restored on return? If the mul completes during the called function then it could show up in the caller’s belt prematurely (unless that’s handled by a naming scheme?). If the state is saved and restored, then potentially more information would need to be spilled and filled and the mechanism for doing such a thing must be costly in complexity (wires, area, power, etc.).

    • Will_Edwards
      Moderator
      Post count: 98

      Good question. Its the former, and its done using naming, as you guessed.

      Internally, each belt item has a frame id; this is how the CPU ensures the caller’s and callee’s belts are separate.

      Operations mark their output belt items with the frame id their instruction was issued in. The spiller takes care of saving and restoring this in the background.

      This was described in the Belt talk, about 30 minutes in.

      • Ivan Godard
        Keymaster
        Post count: 689

        To expand a bit on Will’s answer, and introduce some terminology for the glossary (you there, Mill-fan-registered-as-“imbecile”?).

        CPUs that do not expose the pipeline go to great lengths to support the illusion that instructions are executed one-by-one to completion. They must be able to unwind absolutely any action, completed or not, in the event of an interrupt or other event that lets the program see state. The unwind goes back to just before the first incomplete (unretired) operation; everything after that, done or not, is discarded and has to be done over. This approach is called execution replay (or sometimes issue replay.

        No modern machine actually has one-at-a-time execution internally, and an exposed pipeline machine like the Mill does not even pretend that it does. IBM 7094 semantics are so 1960, after all. Instead, we let all operations run to completion, save the results, and on restart we inject the saved results with the same timing and semantics that would have occurred had the interrupt not happened. This approach is called result replay.

        Result replay has real advantages over issue replay: you never have to recompute things. Moreover, while you have to save the in-flight state if an interrupt happens, you don’t have to save all the state all the time in case an interrupt might happen. And the state is a lot smaller, too.

        • David McCammond-Watts
          Participant
          Post count: 13

          Thanks for the explanations. That makes a lot of sense. Just to make sure I have this right: if you have a “mul-call” instruction followed by an “add” then the mul will complete during the called function (presuming it’s more than a couple of cycles) before the add is issued. Even though the mul ends up being computed before the add (because the mul’s result will show up in an output buffer during the cycles of the called function) it will show up in the callers belt at the appropriate time (via naming) *after* the add. The mul’s result might get saved and restored by the spiller if the called function is of sufficient length (or itself calls functions that are of sufficient, length, etc.) but that is all transparent to the software. An interrupt is handled essentially the same (via result replay). Honestly, that sound pretty straight forward, given the alternatives. Let me know if that’s not correct.

          I’m curious to see how exceptions play into all of this. I know I’ll have to wait a while, but that must be an interesting story to tell.

          In any event, thanks, again!

          • Will_Edwards
            Moderator
            Post count: 98

            Yes this is how it works. The compiler statically schedules everything, and everything happens in-order. The spiller transparently takes care of storing results that where inflight during calls and makes them available when their retire cycle comes, if necessary.

            Exception is an overloaded term but I guess you mean c++ -like exceptions?

            These are in the language and runtime, and to the core knows nothing about them. They are just rarely-taken branches and returns.

            The stack trace, however, may be recovered using a special system service which can talk to the hardware; or a language runtime may record the call stack itself using its own metadata.

          • David McCammond-Watts
            Participant
            Post count: 13

            Yes, I was thinking about C++ style exceptions, not hardware faults. In particular, how this all works if your exception handler (i.e. catch) is several stack frames up from the place the exception was thrown. I doubt that’s implemented as a normal branch or return (tell me if I’m wrong). How are all the intervening stack frames unwound, the internal spiller-space state including the separate return-address stack, any not yet executed”tail-calls”? What happens to in-flight computation results that should be discarded, etc.? I imagine there must be a way for the compiler to tell the core to unwind all of that. Perhaps something similar to a facility for setjmp, longjmp (but not identical, I would imagine)?

            I understand if this has to wait for a future talk, but it’s hard to resist asking questions. It’s like having access to your favorite author who has yet to publish the last volume of a trilogy. How do you not ask?

  • David McCammond-Watts
    Participant
    Post count: 13

    In the talk you mention that if you have multiple calls in an instruction that the return from one call returns directly to the subsequent called function. You give the example of F(G(a+b)). The instruction would be “add-call-call” where the add is in op phase and the calls are in call phase. A couple of questions:

    1. Is this true if the code were G(a+b); F(c+d); (i.e. does the retn in G go directly to F)?
    2. How are the belt and stack setup for F? Presumably G’s stack must be cut back and F’s stack established (although with IZ, this should be a simpler process on the Mill) but G’s belt must be established as well. Is this also information stored by the spiller (in addition to return addresses, as described in the security talk)? Does the FU that processes the retn have access to that saved state (something like a belt list you’d pass with a call) to setup the next call (i.e. the call to F)?

    Thanks in advance for your answers.

    • Ivan Godard
      Keymaster
      Post count: 689

      Short answer: it’s all done in the naming.

      Longer answer: the decoders map logical belt numbers in the code to physical belt numbers in the execution engine. There are twice as many physical numbers as the logical length of the belt, to accommodate drops that occur during phasing.

      In your example of G cascading to F, say that G has a single result. Consequently the pre-existing belt value that would be b3 as an argument to G will be b4 as an argument to F (you see why hand-coding a Mill is not recommended 🙂 ). The decode hardware knows that G will drop one result (it’s in the encoding of the call op), so it adds one to the running logical-to-physical bias that it uses for the call of G to get the bias for the mapping for F. For a 16-long belt that’s only a 5-bit add, so speed is not an issue.

      When a call returns, the physical belt is the same as it was before the call courtesy the spiller; a previously unoccupied physical position number is now filled with the result, but no existing values change physical numbers. The arguments of the second call are then picked up by their physical numbers (already available due to the mapping in the decoder), and copies made in free buffers. The copies are assigned physical numbers with the new frame number. Bump the frame number, and you’re good to go.

      That’s what happens without cascading. It should be clear that it’s not actually necessary to reify the whole belt. All that is really needed is to assign the right physical number to the result of the first call, and to create copies of second-call arguments in the spiller with suitable numbers.

      The spiller doesn’t really have time to do that if there’s no cycle between return and next call and it doesn’t get the I-need-this-list for the arguments until after the return. But the decoder already has the I-need-this list for the second call even before the instruction is issued, so it just gives the list to the spiller to save along with the rest at the first call. The spiller can then just do the copies as soon as it knows the return is happening.

      There is actually a problem even with the advance notice if the return is conditional, because predicates are not evaluated until writer phase in the next cycle, which with cascading is actually the first cycle of the second-called function. We can go with the prediction and unwind if it is wrong, or simply not cascade conditional returns; its an implementation choice in the hardware, and members will likely differ.

      We also have a way to do call arguments without copying, but it’s NYF.

      Now for the rest of your question:

      1) The return in a cascade goes direct to the next called function, delta the conditional-return case above.

      2) Called functions start with no data stack frame; they must explicitly allocate one with stackf if they need it. Return does automatic cutback though. The called belt is created as described above – a call with no arguments need only bump the frame number and presto – new empty belt.

      The return op itself has no understanding of what it is returning to. It does have an argument list, encoded just like the list of a call, that gives the belt positions of the result(s). Necessarily these will become b0… in the caller (returns drop at the front as usual), so all it needs is the mapping bias and the frame number of the returned-to frame. Change the physical belt number and frame of those values and the result handling is done. The bias and frame have been saved in the spiller. Hardware implementations will vary, but I expect that the result handling is done by the part of the spiller that does cascading, and the return FU deals only with getting the address and predicate right and dealing with First Winner Rule.

      • This reply was modified 10 years, 7 months ago by  Ivan Godard.
      • David McCammond-Watts
        Participant
        Post count: 13

        The penny drops. I wasn’t fully appreciating how the physical belt worked in relation to the logical. Thanks.

        It also makes sense that multiple call ops in the same instruction are not always cascaded. I misunderstood what was claimed in the talk. Your explanation of conditional returns makes that clear. If I’d thought just a little more about it I think it would have been clear. For example, you can’t logically cascade something like F(G(a+b) * c). The mul would have to take place after G returns and so they couldn’t be cascaded (well, and the call ops couldn’t be in the same instructions, either, I suppose).

        Is the decision whether to cascade done by the specializer or by the core on the fly?

        Also, are there possibly timing issues if the function is quick to execute? Take for example F(G(), x * y). Suppose G is simply a return of a literal value (one supported by cons so no load is needed). The call takes a cycle, the cons takes a cycle, and the return presumably also takes a cycle, for three cycles total. If x * y is a multiply that takes more cycles than the call (an fmul, for example taking more than the 3 cycles accounted for above) and the compiler didn’t schedule the multiply early enough to retire at the right spot (silly compiler), would the cascaded call simply stall waiting for x*y to be computed? Would the specializer know enough to simply not cascade the call in the first place? If done by the core, does it have better information to make that decision?

        I apologize if these questions seem too basic, and I appreciate the answers.

        • Ivan Godard
          Keymaster
          Post count: 689

          As far as the specializer is concerned, a call is just a zero-latency op. F(G(), x*y) has the same constraints as F(a+b, x*y). It can schedule the {call G, add} any time in relation to the {mil}, but both have to be done before {call F} can issue. Becuase it knows all the latencies, it tries to have everything ready to go and already dropped to the belt when it is needed. In the case of the nested call, recall that even with cascading both calls are in the same instruction, and multiply drops between readerPhase and opPhase, which is before callPhase. So for best speed, the mul must retire before either call, cascaded or not. Remember, both calls happen in one of the consecutive call phases of the single instruction; unlike other phases, there can be several call phases.

          It doesn’t matter how long either function takes to execute; for scheduling, all calls take zero cycles, i.e. they complete in the same instruction that they issue in.

      • LarryP
        Participant
        Post count: 78

        Please forgive me if these questions have been answered elsewhere. But since this topic mentions both logical/physical belt positions as well as frame IDs, I hope it’s OK to pose here.

        However the Mill implements the belt and new-belt-per-function-call, there’s a finite number of belt IDs in any given implementation.
        Q0: Correct?

        I assume that a sufficiently deep series of function calls will need more belt IDs than the (particular family member’s) hardware supports.
        Q1: In which case, what is the Mill’s behavior?

        I’ll guess that some mechanism (the spiller or a close relative) has to offload a sizable chunk of internal state, so there will be available Belt IDs for the rest of those function calls. (And the spilled Belts/Belt-ID-state must eventually be refilled to complete the deeply-nested series of calls.)

        Q2: If the Mill occasionally has to perform such Belt-ID spill/fill operations, does the compiler have any choice about the granularity of the spill/fill (e.g. a full spill, half, quarter, exactly Z beltIDs), or is the entire belt-ID internal state always spilled? The reason I ask is that infrequent, expensive digressions can cause problems for workloads with tight timing constraints. (And I’m hoping the Mill can be applied to such workloads, in a reasonably clean way.)

        Thanks in advance for any explanation.

  • Will_Edwards
    Moderator
    Post count: 98

    Q0 yes and Q1 yes, the spiller takes care of this. The spiller is clever enough to not spill empty belt positions, not the empty space in slots that are only scalar etc, so it makes the best possible job of it all.

    Q2 is really no, the spill is automatic and lazy. The compiler writes to an intermediate IR, and does not know the dimensions of the various targets as Mill models differ in e.g. vector height and belt length.

    But the good news is that the Mill models are very very good on tight timed loops! DSPs eat software pipelined loops, and the Mill is very much a DSP when you need that ummph.

    There is a talk explaining pipelining on the Mill planned; watch the forum or subscribe to the mailing list for further info when the schedule is set!

  • LarryP
    Participant
    Post count: 78

    Does the pick operation get its third operand via ganging?

    The pick operation takes three operands:

    pick(<mask, <shuffled>, <rd>)

    Since functional units (and the crossbars) are set up for two inputs per F.U., does the pick operation use ganging, or is there some other way for pick to get its third operand? If pick uses ganging to get its 3rd operand, what types of FUs can do Args, and thus gang with pick?

    • Ivan Godard
      Keymaster
      Post count: 689

      No, it occupies one three-input slot in the same encoding block with writers (writerBlock). It is able to do so because pick is not actually moving any data, the way that copying bits to the input register of an adder does. Pick only controls the routing of the data going someplace else. There’s no real pick FU, in the sense of an adder FU.

      The two-input block (exuBlock, adds and muls and such) does have real data movement and the wires to carry it. Real writers (not the pick that is encoded in the same block, put the pure sink operations that move data to the scratchpad or specRegs) have only one input. ReaderBloc operations (popular constants, specReg and scratchpad reads) have no inputs. And the whole flow side (control flow, memory) has an extremely ad hoc data movement pattern in which different slots can have different operand counts depending on what FUs are connected where.

      It’s as regular as we can make it, but no more than that 🙂

  • LarryP
    Participant
    Post count: 78

    Passing more than a belt’s worth of arguments into/out of functions?

    How do functions that require more than a belt’s worth of arguments get those input args that don’t fit on the callee’s belt? Conventional machines typically pass these via the stack. But stacks for Mill functions, so far as I understand them, are function local and always initially zeroed. So passing the extras via the stack sounds like it can’t work for this case. One of the belt-passed input parameters could be a pointer the rest of the arguments, somewhere else in memory. Is that how this case is handled, or is there some other mechanism? I know that huge argument lists are considered poor form (so this situation should be rare), but the machine + compiler will have to to handle this case. Are output parameters beyond what fits on the belt handled the same as excess input args?

  • Ivan Godard
    Keymaster
    Post count: 689

    Excess arguments or returns are passed in memory in essentially the same way that a conventional ABI does. Ditto VARARGS and over-large args/results.

    The memory-pass ABI on a Mill has to cope with the fact that a Mill call can cross protection boundaries. Consequently it is important that the argument pass happens without the caller being able to see the callee’s frame, nor the callee see the caller’s frame. Yet still get the data across. Yes, yet more magic 🙂

    The full explanation requires pictures, but the short answer is that the caller piles the arguments contiguously up in his own space, and then the call automatically grants access to the pile to the callee, and the access is automatically revoked at the return. Addressing is via a pair of specRegs (unimaginatively called INP and OUTP) that can be used as base registers in load/store/LEA.

    Pictures in one of the future talks at some point, don’t hold your breath 🙁

  • LarryP
    Participant
    Post count: 78

    Ivan,

    Thanks for the explanation. Your reminder that calls can cross protection boundaries, and your outline of the access grant, followed by an automatic rights rescind on return is gives me the gist of it.

    Larry

  • c3d
    Participant
    Post count: 6

    What is the maximum arity of operations on the Mill today? What impact does this maximum arity have on the other sizing factors (e.g. size of the belt, rotation, etc)? Do you for example have equivalents to the Itanium extract and deposit instructions?

    • Ivan Godard
      Keymaster
      Post count: 689

      Currently the highest fixed arity is four, in the fma family. However, many ops are polyadic with unbounded arity, such as call, and some ops have member-dependent maximum arity, such as conform. There are field extract and insert operations, similar to those of other machines.

  • LarryP
    Participant
    Post count: 78

    A subtle difference between hardware-native operations and emulation sub-graphs? (Re Mill Boolean predicate gangs)

    According to the presentation on execution, exu-side/op-phase operations can pass condition codes to the next higher exu slot — which can then turn those codes into a Boolean result if so programmed.

    If some Mill operation (let’s call it foo) is not native on a member, then foo will have to be emulated on that member by a sub-graph (which may be an emulating function call or an “inlined sequence of operations.) However, since this emulated foo will not have a slot of its own, I don’t see an emulated foo having an obvious way of mimicking a hardware-native foo’s ability to pass a CC set to another functional unit “horizontally” as a predicate gang, as would a native foo operation.

    How does the Mill architecture and tool-chain handle this apparent difference between native and emulated operations?

    Thanks in advance!

    • Ivan Godard
      Keymaster
      Post count: 689

      As far as the specializer is concerned, add.eql (add and a predicate gang) is a single op that encodes in two slots and has two results. The emulation can replace both together with code that also has two results. There is no way to replace only the add with emulation and leave the eql as native, because the horizontal signals are not visible in the program model.

  • Renesac
    Participant
    Post count: 6

    Can someone please do a diagram showing cycle by cycle the status of mill pipeline when a typical call happens and returns? That isn’t very clear to me. When the belt renaming happens, and do different phases see different belt numbers?

    From my understanding, the existence of the reading phase is by itself bad for performance, unlike the subsequent ones. It sure is a ingenuous way to enable denser instruction encoding, by allowing more complex instruction encoding and also by reducing the number of instructions encoded, avoiding the fixed overhead of an extra instruction. But save maybe by the interaction with a function call, I don’t see where else there can be a gain.

    The abstract Mill instruction encoding format allows a lot of flexibility for the hardware and physical instruction set beneath to change and still be compatible with the abstract Mill. If for a certain Mill member the implementation trade-offs change, and it makes sense to encode both reader and computing phase operations together to be decoded and executed in the same cycle, would this still be a Mill? Or the reading phase executing a cycle earlier is a hardwired assumption that would break compatibility if changed?

    • Ivan Godard
      Keymaster
      Post count: 689

      Diagrams are hard 🙂 I’ll try an explanation.

      The phase of an operation (or the phase of a sub-action within an operation that is multi-phase, like store) is determined by explicit specification. The phases are reader, op, call, pick, and writer, and are evaluated/executed in that order; there are a couple more in the sim code for ease of implementations, but these don’t really exist in the hardware. There may be multiple instances of callPhase, one for each call in the instruction.

      All ops must be decoded before they can be executed, and on the Mill decode takes more than one cycle, and some blocks in the instruction complete decode before others do. Consequently, it is convenient to place ops that must execute in an earlier phase so that they are in blocks that are decoded earlier too. Consequently readerPhase ops are placed in the block that decodes first, while writerPhase ops, while they could go in any block, are in practice placed in a block that decodes last because they are not in a hurry to execute.

      The assignment of ops to blocks is also impacted by encoding format issues; slots in any block should contain only ops with similar bitwise format, for encoding entropy reasons and simpler decode. Thus the con operation, whose format is completely unlike the other readerPhase ops, is off in the same block with call (callPhase) and conform (writerPhase).

      Now, as to the call op. CallPhase occurs after opPhase (when we do things like add). The physical belt changes only at cycle boundaries. To accommodate the intra-cycle logical changes of belt content, the physical belt is actually twice as long as the logical belt in the program model. If the code in the current frame C is F(a+b), where a and b are on the belt, then the add happens in the opPhase cycle, and the result is physically available at the end of that cycle. The result will be identified as C:b0, assuming nothing else drops, and the former C:b0 will become C:b1, C:b1 becomes C:b2, and so on.

      Meanwhile, the hardware makes a copy of the result of the add and identifies it as F:b0, i.e. the first belt position in the called function. There are no other operands identified as F:bX, so the rest of the belt of the callee is empty. All this happens essentially at the cycle boundary, so that the first instruction of the callee sees the arguments and nothing else. In effect, callPhase is really that first callee instruction cycle. Because we go directly from the caller cycle containing the call to the callee cycle, the call in effect takes no time at all.

      ReaderPhase is likewise free. After an opPhase op has been decoded, it takes a cycle to set up the crossbar to get the op its input data, so an add (for example) cannot execute immediately after decode, but instead there is an “issue” stage between the second decode and the execute stage. However, readerPhase ops do not take a belt argument, and so they don’t need a setup stage for the crossbar. So we can execute then in the same cycle in which the opPhase ops are getting ready for issue. So the timing is:

      decode 0
      decode 1
      readerPhase execute/opPhase issue
      opPhase execute
      callPhase/first called instruction
      … repeated
      writerPhase (which is also opPhase of the next instruction and readerPhase of the one after that)

      • Renesac
        Participant
        Post count: 6

        The phase of an operation (or the phase of a sub-action within an operation that is multi-phase, like store) is determined by explicit specification.

        Ok, so the abstract Mill accepts a processor that has no separate reader phase (or other phase), if such a thing is advantageous in some future implementation? For example, if you run a n-way interleaved multithreading (barrel processor) you have plenty of time between your cycles. Or maybe a Tin, where instructions are already small and simple, likewise the crossbar is tiny and clocking targets may be low. Or something weirder.

        • Ivan Godard
          Keymaster
          Post count: 689

          The execution timing is the same so long as decoded ops are available, regardless of whether phasing is used or not: each execution pipe is fully pipelined and so can start an op each cycle, and the dataflow constrains the earliest that an op can be started. If you could decode it and had enough pipes, an OOO x86 would pipe dataflows into the pipes the same way that a Mill does. The barrel is no different.

          The gain to phasing occurs at the boundary conditions, where the sequence of instruction decodes detaches from the sequence of issue and execution. This occurs at control flow points. For example, when you call a functions, unless you are OOO you must complete all issues in the caller, transfer, decode in the callee, and refill the pipelines in the callee. The same occurs at return, branches, and, importantly) at mispredicts.

          Phasing is “really” a way to do short-range OOO without the OOO hardware. If the code is: 3-cycle dataflow -> branch -> 3 cycle dataflow -> branch -> 3 cycle dataflow, then an OOO machine will get it all done in three issue cycles and five cycles overall – and so will the Mill. A strict in-order will take 11 cycles, or nine if it has a predictor.

          So phasing is just poor-man’s OOO, taking advantage of the fact that real code contains a lot of short independent dataflows that, if you have the MIMD width, you can overlap.

          Of course, once you have phasing then other possibilities are enabled: multi-call, First Winner Rule, 0-cycle pick, …

          • Renesac
            Participant
            Post count: 6

            Right, I was confounded by your slide 28. It shows the con being moved to the brtr cycle. Apparently this is wrong. The add that should be moved up to the brtr cycle, and the con should be moved even higher to the cycle before, and both will not appear to have executed if the branch wasn’t taken. And the store is in the same cycle as the following branch (writterPhase) not moved lower.

            Now I see that the reader phase is really advantageous in all situations (except some times for exit mispredictions, as any other pipeline lengthening, but that is ok).

          • Ivan Godard
            Keymaster
            Post count: 689

            I see where we led you astray. The slide intends to show the “brtr” as the principle cycle (i.e. compute phase) of the instruction containing the brtr; the branch would happen in the following cycle, along with the add, or more correctly at the boundary leading to the add. That is, the slide describes instructions, and tries to show how they are spread out over the phases so as to overlap.

            It’s very hard to come up with lucid and unambiguous pictures for this, even with the animations to help. I don’t know how the slide could be improved; just putting it in terms of phases from the start and showing how those turn into instructions, as you interpreted it, would be just as confusing but in the opposite way.

            Perhaps a color coding, with one color to represent the op as an instruction and another to represent it in phase as executed? That would work for the con, but now there would be two “brtr”s and the other ops, and the clutter would be confusing too 🙁

            If any of you readers have access to PowerPoint I would welcome sample slides that you feel explain this better. Don’t worry about the color scheme, but use whatever you think looks nice.

      • LarryP
        Participant
        Post count: 78

        Renesac’s questions lead me to ask two more:

        1. Of the seven phases identified in the Wiki http://millcomputing.com/wiki/Phasing , which ones *require* a clock cycle vs. which ones do not? Based on Ivan’s most recent comment and the Wiki’s entry, my best guess is that pick, conform and transfer do not require, of themselves, another clock cycle. Ivan &Co. please correct me if I’ve mis-read things.

        2. In the execution talk and in the Wiki under phasing, we have diagrams of the reader/op/writer phases happening in steady state — but only for that case. What I’d like is some explanation of what happens when adjacent instructions do not all use the same phases. In such cases, are there essentially bubbles in the pipeline?

        IMHO, a diagram would be very good for illustrating this case and hopefully explaining how the Mill executes when adjacent instructions use different phases from one another.
        —-

        Side note:
        The presentation on instruction decoding presents details most relevant to the exu-side decoding — but the flow-side encoding is substantially different in terms of how the header info and blocks are used! So the Wiki entry http://millcomputing.com/wiki/Decode (specifically the section titled “Flow Stream”) is good reading to understand what Ivan wrote about how and why the con operation is encoded as it is.

        • This reply was modified 9 years, 9 months ago by  LarryP.
        • Ivan Godard
          Keymaster
          Post count: 689

          1)
          Yes, pick takes no cycle at all, nor does conform, because both are in the belt-mapping/crossbar that is logically at the cycle boundary. It actually takes a cycle, or more, to set up a conform. The timing is different depending on whether the belt is CAM (logical) or mapping (physical) addressed, but for physical the hardware is tracking issue and drops a cycle ahead of when they happen and so it actually remaps the belt naming in what is the opPhase cycle to do a conform. However, that mapping is only effective for ops in the following cycle, so we can think of it as taking place at the cycle boundary, even though the hardware guys think of it as being in the X0 clock.

          Transfer phase doesn’t really have a hardware equivalent; it is an artifact of the way the sim is implemented as it steps through the state machine that does phasing. The Mill is running at least three cycles ahead down the predicted path, so we have already transferred long before we actually execute the branch; possibly several times in fact. So transfer phase is really the point at which we recognize and stall a mispredict. A modern machine cannot stop on a dime, when the core can be several cycles across at speed-of-light. Stall is one of the most difficult parts of hardware design.

          Call phase doesn’t take a clock if there are no calls, but if there are then in effect it preempts and uses the clock that writer phase would have used.

          2)
          Unused phases just don’t issue anything from the instruction. A sequence of instruction with nothing but compute-phase ops will execute them one per cycle, but there will not be any reader/writer/ phase ops overlapped with them because the instructions don’t have any of those kinds of ops. Adjacent instructions run the same whether they have anything to do in a particular phase or not. There’s no cost to an unused phase, just a lost opportunity.

      • Renesac
        Participant
        Post count: 6

        decode 0
        decode 1
        readerPhase execute/opPhase issue
        opPhase execute
        callPhase/first called instruction
        … repeated
        writerPhase (which is also opPhase of the next instruction and readerPhase of the one after that)

        Let’s see if I understand it right. The first part could well be in the encoder talk topic:

        Decode 0 don’t produces anything? Are the readerPhase ops already ready, but waiting for the opPhase ones to be decoded so they can issue while reader phase executes? Or do the reader side ops on the flow side take one extra cycle to decode compared to the simple encoding in the exu side, and that is why there is this delay?

        The decode 1 decodes the opPhase operations (and maybe the pick and writer phase too, finishing the effective decode?) so that the next cycle can be a issue phase (Mill does had an issue stage after all, only masked by phasing) while the reader operations that were already ready execute.

        Is there a decode 2 stage?

        Parallel to the opPhase execute, is the readerPhase of the called function/loop executing, assuming that the branch/exit predictor did it’s job? Or there is a bubble there? That is the only place where I see the readerPhase increasing the work done in a cycle when assuming a infinitely powerful decoder.

        So, is the “callPhase/first called instruction” already at the opPhase of the called instruction?

        And in the cycle the function returns (already at transfer phase, that occurs at the same time as the writer phase?), is the calling site already at opPhase? Or there is a bubble? Again, assuming perfect exit prediction.

        • This reply was modified 9 years, 9 months ago by  Renesac.
        • Ivan Godard
          Keymaster
          Post count: 689

          Neither decode0 nor decode 1 “produce” anything except whole or party decoded operations.

          Internally to the team, the phase between decode 1 and execute 0 is variously called decode 2, issue 0, or execute -1 depending on which aspect of it you are interested in. For external documentation we try to call it the issue stage, but there is still the potential for confusion because there are three issue stages because of phasing.

          The phase sequence continues, with overlap, across all control flow boundaries, following the predicted path. It is entirely possible for a cycle’s reader, compute, and writer ops to belong to three different functions. A mispredict is a pain getting the bogus state discarded and things restarted. Big OOO machines have a separate retire stage at the end, with a reorder buffer, to make the update to the architected state be in the proper sequence; if they mispredict they throw away everything that is in flight and re-execute the instructions; this is issue replay. The Mill takes a different approach, marking each in-flight with the cycle that it issued in. At a mispredict we let the in-flights drain to the spiller buffers, the same way we do for in-flights over a call or interrupt, discarding those marked as being from the cycles down the wrong path.

          Meanwhile we are restarting the decoder, and as the new ops start execution we replay the former in-flights just as we do after a return. This is result replay.

          In summary: phasing proceeds in the presence of control flow just as it would have had all the control flow been inlined.

          • Renesac
            Participant
            Post count: 6

            Neither decode0 nor decode 1 “produce” anything except whole or party decoded operations.

            I was referring exactly to those decoded or partially decoded instructions, of course. So here the break down of what happens with one instruction by cycle after it was fetched, as I understand:

            1 – In decode 0 stage readerPhase operations are at least partially decoded, right?

            2 – Then on decode 1 the rest of the instruction is at least partially decoded, and the readerPhase operations must have ended decoding. Are some long latency registers already probed at this stage as an advanced readerPhase? Or all readPhase operations patiently waits the next cycle because the opPhase needs a stage to issue and/or the flow side const encoding only finishes decoding this cycle?

            3 – Reader phase operations are executed. It is also the issue stage for opPhase (also called decode 2). By now every operation is probably fully decoded already, only waiting for their phase, right?

            • This reply was modified 9 years, 9 months ago by  Renesac.
          • Ivan Godard
            Keymaster
            Post count: 689

            At the end of decode0 (D0) all the exu-side reader block has been decoded. This is important because we can start the scratchpad fill operations in D1, with the data available at the end of D2 (I0, X-1) for use by the adds in X0. Two cycles to read the scratchpad SRAM is plenty.

            We also have the flow-side flow block decoded at the end of D0, but that doesn’t help because we need the flow-side extension and manifest blocks (which are really just data arrays) to make sense out of those decodes. The extension and manifest blocks are interleaved into the long (very long) constant in D1 and the selectors set up to extract the bit fields belonging to each op (based on the extCount/conCount fields in the already-decoded flow block) at the end of D1. The selectors extract the constants of any con ops and have them available at the end of D2, ready for the adds in X0 again.

            The clock-critical part of all this is the priority-encodes of the extcount/conCount fields that set up the selectors. Priority encode is a linear parse, and with as many as eight flow slots that’s a lot to figure out. The priority encode and the actual selection can blur over the two cycles D1/D2, but this may constrain clock rate at high clock on very wide Mills, forcing us to add another decode cycle. A Gold at 4GHz is not for the faint of heart! It’s all heavily process dependent of course. Best guess is that the width vs clock constraints will pinch in the fastpath belt crossbar before they pinch in the con operation decode, but we don’t know.

  • LarryP
    Participant
    Post count: 78

    Changes in the divide instructions — now native on Silver?

    Greetings all,

    According to the wiki: http://millcomputing.com/wiki/Instruction_Set/divu

    Divide operations are no longer all emulated via loops around (less expensive in hardware) helper operations. This is an interesting departure from the strategy of emulating divide by looping around a helper function (that’s less hardware costly.)

    It seemed to me that this might happen at some point, for some Mill core(s), but I expected that it wasn’t on the critical path, so am interested in why change it now?

    In some ways, this reminds me of what seemed to happen with the ARM7 series of chips. The base ARM7 lacked hardware divide, and although there were software emulations, I think ARM licensed far more ARM7’s with hardware divide than without. I think system designers were worried about worst-case soft-divide performance, should the software guys intend on writing divide-heavy code. (Shame on them!)

    Thoughts?

    • Ivan Godard
      Keymaster
      Post count: 689

      You happened to catch work in progress. The Wiki shows a periodically-updated snapshot of the actual specs for the members, and so may be internally inconsistent with reality at times as various things are being worked on. We still have no plans to have hardware div, rem, or divRem operations, but they are part of the abstract Mill and so the tool chain and sim must support them for eventual family members. Of course, to test the software we need a member spec that does define these ops, and Silver seemed to have gotten elected. I expect it will go away again.

  • LarryP
    Participant
    Post count: 78

    Effective IPC during divide emulation?

    What happens to a Mill’s effective IPC while it’s emulating a divide operation?
    An iterated-in-software approach to divide (and relatives) would appear to require in the general case both multiple cycles and an unknown-at-compile-time number of cycles. If a divide emulation takes on average N cycles and there’s no way to get anything else done — because the number of iterations is unknown at compile time — that situation seems to imply that the effective ops/clock would drop to something proportional to 1/N, unless there’s a way to make use of the rest of Mill’s width (or at least a good portion thereof), during the emulated divide. If there’s no way to schedule other ops during the emulation, that sounds like it would be a substantial performance hit!

    I won’t speculate publicly (I hate that constraint, but understand the need!),
    but I’m very curious about:

    a) How can the Mill CPU and toolchain get anything else useful done during an emulated divide?

    b) What does the Mill’s approach to divide do to its effective IPC in codes that need to do a substantial number of division or remainder ops?

    Apologies if this is a sensitive issue, but eventually people are going to want to know. (On the off chance that you’re not satisfied with your solution, I’d be happy to share my speculations via email.)

    As always, thanks in advance for anything you can share publicly.

    • Ivan Godard
      Keymaster
      Post count: 689

      Frankly, nobody has written the divide emulation code yet, so we have no numbers; it’s in the pipeline, but the emulation team has mostly been worrying about FP. If you’d like to help then feel free to post candidate code for div, rem, and divRem for the various widths. Remember that either or both inputs can be NaR/None. I do not expect there to be a loop; simply unroll the whole thing.

      That said, the emulation substitution occurs at the start of the specializer where we are still working with dataflows, so if there is any code that is part of a different flow than the divide then the code will overlap.

  • LarryP
    Participant
    Post count: 78

    Ivan et all at Millcomputing,

    No promises, but I can take a stab at emulating integer division, probably unsigned to start out. In order to do that, I’ll need a clear definition of exactly what the rdivu operation calculates, including width(s) and any constant of proportionality.

    I find examples very helpful to understand key details. For simplicity, let’s start with uint8 as the size of the input numerator and denominator. Would you kindly post what rdivu(x) returns for, respectively: 0, 1, 63, 64, 127, 128, and 255?

    (I’ve got an idea what I’d like rdiv to return, but I don’t want to mis-assume — nor do I know ALUish hardware well enough to guess what’s fast and cheap to implement as a native op.) Far better to start with what rdivu really does.

    Also, what should integer division (OK divrem) on the Mill return for quotient and remainder when given a zero divisor? NaRs?

    Thanks,

    • Ivan Godard
      Keymaster
      Post count: 689

      Like I said, we haven’t done anything about it yet. As for the correct values, use the C/C++ standard for everything. For divide by zero, return NaR(undefinedValueNaR).

      Things are tricky when the domain is signed and either or both arguments are negative; see https://en.wikipedia.org/wiki/Modulo_operation. Currently we do not have a mode argument in the spec for these ops, but we probably should; the natural candidate is the roundingCode attribute that FP ops use. The integral result would then be the value produced by an infinitely precise FP divide operation followed by the FP integerize operation, plus the Euclidean rule for the remainder. As far as I can see the available rounding codes provide all the sign rules demanded by the various languages listed in the Wikipedia article, plus some (nearest-ties-to-even for example) that don’t seem to be used.

      I suggest that you see how far you get in unsigned, and when that gets through the specializer (you’ll write in genAsm) then we can revisit signed and decide whether to use roundingCode, a new attribute, or roundingCode with some enumerates faulting as undefined.

      It’s OK to post your code here, although I’d suggest starting anew topic. Other readers/posters may chime in on your efforts; please be polite. When you have something that you believe works then either we will have published access to the tool chain for all or (if that’s still not done) we can NDA and let you use our servers.

  • LarryP
    Participant
    Post count: 78

    Ivan,

    If you folks haven’t defined rdivu, may I please have permission to suggest (namely speculate publicly) on what rdivu might calculate? Or would you prefer I email that to you first, for your approval.

    I really doubt that the rudiments of numerical analysis, Taylor series and the Newton-Raphson algorithm{1} would have anything patentable, but I’d like to help — and thus try my best not to cause Millcomputing any IP headaches.

    Unless I’m mis-remembering, the convergence of any Newton-Raphson-like algorithm will depend strongly on the accuracy of what rdivu produces. So there’ll probably be a trade-off between the hardware/speed cost of implementing rdivu and the precision — and thus how many iterations are needed to converge (for a given width of input args.)

    Is there an existing forum category you’d like to see this under?
    Actually, my instinct is to put it on the Wiki (once I have something coherent), under community-contributed software. That way we have history and multiple people can try their hand at emulating division.

    —-

    {1} According to a lecture I heard some years ago at NIST (wish I could recall the speaker’s name), Newton’s original approximation method, in his Principia Mathematica is rather different from the modern Newton-Raphson root-finding algorithm. (But Raphson still got second billing. So it goes….)

    • Ivan Godard
      Keymaster
      Post count: 689

      You have permission, and can conduct your business here or on the Wiki as seems best to you.

      In addition, you may propose novel helper ops beyond rdevu if you find the need, but please send the proposal for such an op to me directly first for vetting before putting it in the public space.

  • LarryP
    Participant
    Post count: 78

    Greetings all,

    I’ve started a Wiki page for user-contributed division suggestions, algorithms, pseudocode — and eventually genAsm code, at:

    http://millcomputing.com/wiki/Division:_Algorithms_and_user-submitted_genAsm_code

    I don’t see a “code” formatting button on the Wiki, as I do here in the forums.

    Would somebody who knows the Wiki look into adding a “code” formatting button on the Wiki?

    • Veedrac
      Participant
      Post count: 25

      It seems you should use <pre> tags for code on the Wiki. I’ve edited it in.

  • LarryP
    Participant
    Post count: 78

    Veedrac,

    Thank you much!

    Larry

  • mkbosmans
    Participant
    Post count: 1

    It’s curious to see the rdiv instruction specified with only a vague idea about it’s specification. May be it is just the public information that is unclear about the specifics, but from Ivan’s comments it looks like this really is uncharted territory for the Mill team.

    On the wiki, at LarryP’s page on division I posted a small comment about an alternative algorithm, Goldschmidt division. It could very well be that when that is used as a software div implementation, the instruction providing the initial guess should be another function or could be omitted altogether.

    Anyway, it’s probably time for a separate topic on integer division.

    • Ivan Godard
      Keymaster
      Post count: 689

      Hardware div units are expensive and don’t pipeline well. On a microcoded OOO box that doesn’t matter much, but the Mill doesn’t microcode and is in-order. Other machines also use macrocode after a seed value, so we knew it was possible and would fit with the rest of the machine regardless of the algorithm, so we left the details to arm-waving until later. The emulation codes are like that in general: obviously software emulation of e.g. floating-point is possible, but it’s a rat’s nest that needs a specialist, so the details were deferred until we had a specialist.

      We now have those specialists (Terje and Allen, and now Larry), so the arm-waving is being turned into code. Right now the bottleneck is genAsm. Speaking of which, I better get back to work 🙂

  • PeterH
    Participant
    Post count: 41

    In the Compiler talk it was mentioned that launching coroutines is a user level operation. How is a process prevented from DoSing the chip by endlessly launching coroutines?

  • Ivan Godard
    Keymaster
    Post count: 689

    You can only DOS yourself, just like you can by coding an infinite loop. The coroutine is not an OS process, it’s part of your program, and they collectively run under your timeslice. If you want a true OS process you have to ask the OS for it as usual. Or you can get five processes from the OS and run 20 coroutines in them; mix and match.

  • ralphbecket
    Participant
    Post count: 9

    I’ve seen mention in these fora that the Mill provides some support for garbage collectors. I’d like to ask a few questions on the topic if I may.

    (1) Garbage collection typically involves a “stop the world” event where threads in the mutator are temporarily suspended, at least for the purposes of identifying roots. Stopping the world can be quite costly. Does the Mill offer any special support in this regard?

    (2) Each concurrently executing thread may have pointers sitting on its belt. How can the GC get access to these during root marking?

    (3) Many collectors compact memory. Since pointers can live on the belt and/or may have been squirreled away by the spiller, how should the GC take this into account?

    Apologies for asking more than one question in a single post, but they are closely related.

    • Ivan Godard
      Keymaster
      Post count: 689

      I’ll give you the best present most recent final word about GC, but must caution that we have not yet implemented a GC and this may not be as final as we’d like.

      1) Pointers reserve three bits for use by the garbage collector; hardware knows this and for example does not examine these bits when testing pointer equality. There are a set of hardware masks that are checked when a pointer is used to access memory; the three bits index into the mask, or, when storing a pointer, the concatenated bits of the pointer being stored and the address being stored to are used as the index. If the indexed bit is set then the operation traps to software. With this, one can write a generation GC that does not “stop the world” because by setting the masks and allocators right one can get a trap whenever there is an up-generation store of a pointer and add the stored pointer to its roots.

      2) There are several situations in which software must be able to access program state held in the spiller. GC is one such situation, but also debuggers and various statistics and diagnostics tools as well. System software presents a trusted API for such inspection; there is no direct access because spiller state is not in application address space. The API interacts with the portal call mechanism and permission granting. For example, if the app makes a portal call passing a callback function which the service then calls, and then takes a segv in the callback, the debugger will be able to inspect the state of the callback and of the app, but not the state of the service that is between the two (unless it requests and is granted permission for that address space too, of course).

      The API only has bitwise access to the contents of the spilled data stack, but not the spilled control stack. The API provides enough information to construct a backtrace, but the debugger does not have direct access to return addresses, downstack links, and the rest of the call machinery; Mill is very attentive to matters of security.

      3) GC and other software such as debuggers can get permissions to mutate saved state such as spilled pointers.

      • ralphbecket
        Participant
        Post count: 9

        Apologies for the late response: thank you for your answer. This sounds as though it will make GC substantially cheaper in practice and easier to implement. I look forward to someday seeing an implementation.

You must be logged in to reply to this topic.