Mill Computing, Inc. Forums The Mill Architecture Memory level parallelism and HW scouts

  • Author
    Posts
  • NXTangl
    Participant
    Post count: 21
    #3999 |

    The Mill’s design is excellent at extracting ILP from general code without resorting to OoO. That’s great in the case where stalls are rare, and the Mill’s deferred loads can mostly hide stalls as long as the address is known. However, more complicated dependency chains can’t be deferred (even something as simple as a[i]->member + b[j]->member has the case where the Mill necessarily stalls twice but an OoO processor can get it down to a single stall). This article is probably relevant. Their arguments that ILP and branch prediction don’t matter don’t apply, since even with absurd stall times the Mill gets you software pipelining and speculation through width, and instruction prefetching way ahead of execution through the exit predictor, but the fact remains that the more MLP you can find, the better the rest of your performance can be. How do you plan on extracting MLP in cases with excessive memory indirection? After all, almost every object-oriented programming language (especially the dynamically typed kind) is going to have a ton of indirection, and the Mill is supposed to be able to run commercial software reasonably quickly without rewrite.

    I noticed that between the spiller, the metadata, and the exit predictor, you could probably make a very effective hardware scout (spill enough to be able to restore state-at-stall, run ahead dropping <Placeholder> values on the belt when a stall would happen, use the exit prediction if control flow is passed a <Placeholder> value.)

  • nachokb
    Participant
    Post count: 11

    something as simple as a[i]->member + b[j]->member has the case where the Mill necessarily stalls twice

    I fail to see the first stall. Normally, the first load batch, a[i] and b[j] should be amenable to be hoisted as much as possible (assuming a, b, i and j are known), such that, in cases where there’s not much to do in the meantime, it would mostly only stall on the second load batch only.

    Also, I don’t see why the OoO would do any better in this context. In any case, the Mill compiler can see a “bigger window” to try to find something to do.

    From what I remember from the talks, the case of chained memory indirection (including OOP but not exclusively) was explicitly mentioned time and again. And it’s a problem everyone faces. Their solution is also mentioned: try really hard to hoist as early as possible.

    [a few minutes later]
    I now believe you bring this example in the context of loops (in which case the first hoist would be very hard to achieve). I’d think your best bet is unrolling as much as possible (the limit would be the number of retire stations?). Not sure it’s that much different on an OoO, though.

    I noticed that between the spiller, the metadata, and the exit predictor, you could probably make a very effective hardware scout

    Neat. I’d like to read their response. My guess: it all hangs on power consumption.

    • NXTangl
      Participant
      Post count: 21

      Maybe it was poorly worded. What I meant was that an OoO could find a way to overlap stalls where static instruction bundles could not.

      Suppose the scheduling is `load_offset(a, i, delay), load_offset(b, j, delay);

      // a[i], b[j] drop
      con(member_offset), load_offset(b0, b2, delay), load_offset(b1, b2, delay);`

      Imagine a[i] hits but b[j] misses, now we have no choice but to stall. Then imagine a[i]->member misses but b[j]->member hits, now we have no choice but to stall again. Whereas an OoO processor can see that a[i] hits and issue the load for a[i]->member before retiring the load for b[j].

You must be logged in to reply to this topic.