Mill Computing, Inc. Forums The Mill Architecture Loop pipelining and aliasing

  • Author
    Posts
  • mseaborn
    Participant
    Post count: 3
    #1174 |

    I’ve got a question about yesterday’s talk about software pipelining of loops (for which the slides and video haven’t been posted yet).

    One of the examples in the talk was:

    for (int i = 1; i < N; i++)
    A[i] = A[i] + A[i-1];

    This is equivalent to the following (if N > 1):

    Type prev = A[0];
    for (int i = 1; i < N; i++)
    prev = A[i] = A[i] + prev;

    I found the pipeline diagrams for the loop confusing, because they seemed to show A[i-1] as coming from the “load” (of “A[i]”) operation from the previous iteration, but really it should come from the “add” (“A[i] + A[i-1]”) operation from the previous iteration (or, equivalently, “prev” in the transformed version).

    Was the example deliberately ignoring the aliasing between iterations of the loop? The diagram would have made sense to me if the loop were reading from one array but writing to a different array. Or did I misunderstand the diagram?

    Or was the intention to make use of the Mill’s deferred load operation? Presumably one iteration of the loop can issue two loads:

    load A[i], deferred by N cycles
    load A[i], deferred by N+1 cycles

    If a store to A[i] occurs, these two loads can produce different results.

    Is the idea that the compiler can successfully pipeline the first loop above without having to reason about aliasing, and without having to transform it into the second version above, by virtue of using deferred loads?

  • Ivan Godard
    Keymaster
    Post count: 612

    Oops.

    Yes, you are entirely right. It should be A[i] = A[i] + A[i+1] to match the animations. And none of us caught that in reviewing the slides. No, nothing about deferred loads, just a goof.

    Back to the PowerPoint 🙁 Sorry. I think we can keep the video we shot and sync it with corrected slides in post-production.

    Ivan

    • mseaborn
      Participant
      Post count: 3

      OK. I’m still curious whether deferred loads could help with pipelined loops in the way I suggested.

      We might have a more complex loop where the compiler can’t tell whether memory locations will alias. For example:

      for (int i = 0; i < N; i++)
      A[i] = A[f(i)] + 1;

      • Will_Edwards
        Moderator
        Post count: 98

        The thing is, the Mill is immune to aliasing because all loads are deferred as you describe it.

        Loads specify their retire time, and the stations are snooping on stores so they know when their fetched value is stale.

        So the compiler never ever has to even consider aliasing!

  • LarryP
    Participant
    Post count: 78

    Greetings all,

    Regarding pipelining loops, deferred loads and loop-carried data:

    1. I’m glad a confusing issue was caught by an alert audience member, and I’ll look forward to the corrected slides/video as soon as posted. With the noticed mistake and the resulting need to redo slides, plus extra video editing, I suspect that we won’t have the video + animated slides online for at least a couple weeks. 🙁
    IMHO, better to be later and correct, than quick and confusing. (Though I’m still waiting with imperfect patience.)

    2. If I understand correctly, the number of *simultaneously-executing* deferred loads varies considerably among family members, because family members differ in the number of traffic-monitoring load buffers they have. Since the Mill intermediate representation apparently compiles for an abstract Mill — and thus doesn’t now how many (aliasing-immune) load buffers the specific target has, then how does the combination of the compiler and specializer handle the case where the compiler’s implementation of a pilelined loop requires more *simultaneous* in-flight deferred loads than the target hardware has to do such loads?

    It seems to me that the compiler will consider most loops pipeline-able for the *abstract* Mill and thus emit its intermediate representation for a fully-software-pipelined loop — only to have the specializer potentially need to (partially, or worse fully) “de-pipeline” it, due to the target’s constraint on the number of simultaneous in-flight loads it can do.

    How will the compiler and specializer jointly handle this case? Is this one of the cases where the compiler has to emit a series of alternative sequences for the specializer to choose from? Or can the specializer always implement any software-pipelined loop sequence, though it may need more instructions to do the pipeline-required loads using its limited load hardware?

    This issue about whether/how-to software pipeline loops for a given target member sounds like (a) rather a common case, especially given the Mill’s key goal to be the CPU that performs dramatically better on loop-heavy code and (b) one with great variability (that the specializer will have to handle well), given the range of family member’s and their functional-unit populations.

    IMHO, the details will need to wait for talks on the compiler and specializer, but I’m hoping for some early/high-level clarification that the Mill’s software pipelining of loops will work as close to optimally as possible on the full range of Mill targets — without additional/major surgery on the compiler and specializer.

    Any comments would be most welcome.

    • imbecile
      Participant
      Post count: 48

      > It seems to me that the compiler will consider most loops pipeline-able for the *abstract* Mill and thus emit its intermediate representation for a fully-software-pipelined loop — only to have the specializer potentially need to (partially, or worse fully) “de-pipeline” it, due to the target’s constraint on the number of simultaneous in-flight loads it can do.

      I’m only speculating here, but I get the impression that this resource allocation problem is of the same class as the varying amount of belt slots on the different cores. And will be solved the same way: the compiler emits code as if there was no restriction at all, i.e. as if there is an unlimited number of belt slots and load/retire stations. It doesn’t even know of the spill and fill instrucions for example, since it wouldn’t know where to place them.

      The specializer knows those exact limits and then schedules/inserts loads and stores and spills and fills exactly to those limits at the appropriate place. Figuring out how many parallel loads and stores you can have and how much pipelining/unrolling loops you can do on a core is pretty much the same as figuring out how many belt slots you have and thus how to link consumers and producers together, except in one case you emit spills and fills at the limits and in the other you emit branches at the limits. The loads and stors in both cases are just placed according to best latency hiding behavior.

      • Ivan Godard
        Keymaster
        Post count: 612

        Exactly.

        In more detail: a retire station is a dedicated, blocking, not-hardware-pipelined device, much like a classic divide unit in which the unit can have only one divide in flight at a time; contrast with a hardware-pipelined device like a multiplier where you can start a new multiply every cycle and have several in-flight at different stages of execution.

        Now apply that information to the scheduler in the specializer. A software pipeline is constrained by the resources it needs for the code it executes, so if (for example) the loop body contains six adds and the target member has four ALUs then the software pipeline must be at least two cycle long. The scheduler determines this minimum by considering blocking ranges as it schedules. For an ordinary, hardware-piped operation like an add the blocking range is the one cycle for issue, regardless of the latency of the operation; you can issue an add every cycle. In contrast, a 14 cycle blocking divide has a 14 cycle blocking range, so if the loop body has a divide and the hardware has one divide unit then the software pipeline must be at least 14 cycles long, even it has to fill those cycles with no-ops.

        Now to apply that to loads. The amount of deferral of a load is up to the code, up to a maximum number of cycles determined by the representation (for count-deferred loads) or with no upper limit (for tag-deferred loads). In particular, the deferral can be zero so that a load issued in this cycle can retire in the next issue cycle, albeit not in the next clock cycle. A load uses the retire station as a blocking resource, but the blocking range is only for the period from issue to retire of that load. Consequently, if loads use zero deferral, the limit of parallelism is the lesser of the number of load units (i.e. the number of loads that can be decoded) and the number of retire stations. All Mill configurations will have more retire stations than load units (otherwise deferral would be pointless) so in practice the constraint for non-deferred loads is the number of load units. Thus, non-deferred loads can be pipelined just like adds, because they have a one-cycle blocking range just like an add.

        Now introduce deferral. If there are eight retire stations, then we can have eight loads in flight at any one time. If there are four loads in the loop body, then each load could have a blocking range of two, which means that loop can be software pipelined with each of the four loads specifying a one-cycle deferral (i.e. a two cycle flight time). This constraint is static and the blocking range can be figured as soon as you know what the loop does and have the specs for the target, i.e. at specialize time.

        Now it happens that stalls are expensive because modern hardware doesn’t stop on a dime, nor start immediately either. A two-cycle deferral will always cause a stall, even if the data is in the top level cache, because the cache access latency is longer than that. So it is better to have a deferral that is at least the d$1 latency, assumed to be three cycles for initial Mills.

        So for performance the specializer (when scheduling a pipelined loop) wants to use a deferral of at least three. That means that it cannot issue all four loads every cycle, because that would require 12 loads in-flight and there are only eight retire stations. Consequently the schedule must add additional cycles to the software-pipeline, just like it does with a blocking divide. The resulting code will contain no-ops.

        Of course, if the schedule does not run out of retire stations even with a deferral of three, it can use a larger deferral for all or some of the loads until the cumulative blocking region exceeds the available stations. The same algorithm is use for all ops with a blocking region greater than one; the only thing that is different about loads is that the size of the blocking region is determined by the code rather than by the hardware spec.

        Incidentally, loads are uncommon in Mill software-pipelines, for reasons that are NYF 🙂

    • Ivan Godard
      Keymaster
      Post count: 612

      Most of this is addressed in post http://millcomputing.com/topic/loop-pipelining-and-aliasing/#post-1180.

      We won;t have to shoot new video for the pipelining talk; only the slides have to be changed so it should not add much delay (famous last words, I know).

      Yes, only the specializer knows how many in-flight deferred loads are possible on the target. That constraint may force the specializer to produce a longer (more cycles) schedule for the piped loop than would be necessary without the constraint. As a simple example, if specializer policy is to always use a deferral large enough to hide the d$1 cache latency (three cycles) and there is only one load in the loop and only one retire station, then the schedule must use at least three cycles per iteration, whether the other slots have anything useful to do or not. Pipelining is still quite possible, and will be done, although the constraint may force no-ops that would not be necessary if loads were true one-cycle ops. This consideration was ignored in the talk, which was already over-long.

      There are fewer retire stations on smaller members, so a random loop is more likely to hit this constraint than on a bigger member. However, other constraints (such as shortage of ALU units for the loop work) will also pinch on smaller members, causing the pipeline to need more cycles per iteration anyway. These effects offset, and the members are configured so that over the range of loops in average code no particular one of the constraints is exceptionally pinchy compared to the others. The result is that pipelines will be slower on smaller and faster on bigger members, as one would expect.

      The compiler does not emit alternates for pipes (we’re not even sure that it is worth having the compiler emit alternates in any cases, but the possibility is there). The compiler also does not do scheduling: compiler output is a pure dataflow graph, with no temporal notions at all. The algorithm that the specializer uses to pipe loads is described in the cited post.

      Ask again if the explanation isn’t clear.

      • LarryP
        Participant
        Post count: 78

        Ivan,

        Thanks for your detailed explanations about how the compiler and scheduler work together to implement pipelined loops! I found your detailed explanation very helpful in understanding more about how the specializer schedules code and handles resource allocation/tracking as part of its scheduling. The explanation about blocking ranges (and the impact of non-pipelined functional units) on the number of instructions needed to implement a loop body makes clear the downside of having a dedicated, long-latency functional unit (like some implementations of monolythic divide) to a Mill machine’s overall performance.

        From what you wrote above, the answer to my previous questions appear to be:

        How will the compiler and specializer jointly handle this case?

        A. Quite well indeed.

        (To this reader, it sounds like there’s already been lots of sim results and tests that have exercised the existing compiler, specializer and simulator on a whole bunch of loop-heavy code. So there’s a body of NYF/NY-disclosable evidence that the tools and software pipelining of looks works well and will continue to do so. No hand waving involved.)

        Is this one of the cases where the compiler has to emit a series of alternative sequences for the specializer to choose from? Or can the specializer always implement any software-pipelined loop sequence, though it may need more instructions to do the pipeline-required loads using its limited load hardware?

        A. The compiler could issue alternative sequences, but at present there appears to be no need to do so for loops.

        A. The specializer can schedule any intermediate-level “code” the compiler gives it — for any Mill target — including for all known and foreseen pipelined loops. On lower-end Mills, with fewer slots/functional units (including fewer load units and retire stations), the pipelined loop body may, and probably will, take more instructions to do the loop body, as should be expected.

        —-

        Incidentally, loads are uncommon in Mill software-pipelines, for reasons that are NYF :-)

        Especially given the example, that iterates through an array, I’ll be very interested in hearing about how/why loads are uncommon in Mill software pipelines! Of course, I’m already eagerly awaiting the availability of the pipelining talk, and I’ll hope that the post-production work goes smoothly and you don’t have the hassle of re-shooting any video.

        Thanks again for your cogent and detailed responses,

        Larry

        • Ivan Godard
          Keymaster
          Post count: 612

          How will the compiler and specializer jointly handle this case?

          A. Quite well indeed.

          (To this reader, it sounds like there’s already been lots of sim results and tests that have exercised the existing compiler, specializer and simulator on a whole bunch of loop-heavy code. So there’s a body of NYF/NY-disclosable evidence that the tools and software pipelining of looks works well and will continue to do so. No hand waving involved.)

          Well, I could have said “quite well”, but don’t like to brag :-)

          However, I’m sorry to say that we really don’t have the kind of exercise-based evidence you suggest. The prototype compiler never did dataflow analysis (which is necessary for pipelining), and the new one isn’t to that point yet either. So the architecture, and the discussion, is from first principles and not from experimentation.

          It is certainly possible that subsequent experimentation will expose flaws in our first-principles analysis, and if so we’ll go back and rethink the analysis. One of our reasons for taking the Mill public as fast as we can get the filings in is so we can get a large number of skeptical eyeballs onto the design; eyeballs that may not be as comprehensive as a test suite but are likely to be already familiar with troublesome corner cases.

          We thank you, and your legion of friends who ask us hard questions.

You must be logged in to reply to this topic.