Mill Computing, Inc. Forums The Mill Architecture Memory level parallelism and HW scouts

  • Author
    Posts
  • NXTangl
    Participant
    Post count: 21
    #3999 |

    The Mill’s design is excellent at extracting ILP from general code without resorting to OoO. That’s great in the case where stalls are rare, and the Mill’s deferred loads can mostly hide stalls as long as the address is known. However, more complicated dependency chains can’t be deferred (even something as simple as a[i]->member + b[j]->member has the case where the Mill necessarily stalls twice but an OoO processor can get it down to a single stall). This article is probably relevant. Their arguments that ILP and branch prediction don’t matter don’t apply, since even with absurd stall times the Mill gets you software pipelining and speculation through width, and instruction prefetching way ahead of execution through the exit predictor, but the fact remains that the more MLP you can find, the better the rest of your performance can be. How do you plan on extracting MLP in cases with excessive memory indirection? After all, almost every object-oriented programming language (especially the dynamically typed kind) is going to have a ton of indirection, and the Mill is supposed to be able to run commercial software reasonably quickly without rewrite.

    I noticed that between the spiller, the metadata, and the exit predictor, you could probably make a very effective hardware scout (spill enough to be able to restore state-at-stall, run ahead dropping <Placeholder> values on the belt when a stall would happen, use the exit prediction if control flow is passed a <Placeholder> value.)

  • nachokb
    Participant
    Post count: 12

    something as simple as a[i]->member + b[j]->member has the case where the Mill necessarily stalls twice

    I fail to see the first stall. Normally, the first load batch, a[i] and b[j] should be amenable to be hoisted as much as possible (assuming a, b, i and j are known), such that, in cases where there’s not much to do in the meantime, it would mostly only stall on the second load batch only.

    Also, I don’t see why the OoO would do any better in this context. In any case, the Mill compiler can see a “bigger window” to try to find something to do.

    From what I remember from the talks, the case of chained memory indirection (including OOP but not exclusively) was explicitly mentioned time and again. And it’s a problem everyone faces. Their solution is also mentioned: try really hard to hoist as early as possible.

    [a few minutes later]
    I now believe you bring this example in the context of loops (in which case the first hoist would be very hard to achieve). I’d think your best bet is unrolling as much as possible (the limit would be the number of retire stations?). Not sure it’s that much different on an OoO, though.

    I noticed that between the spiller, the metadata, and the exit predictor, you could probably make a very effective hardware scout

    Neat. I’d like to read their response. My guess: it all hangs on power consumption.

    • NXTangl
      Participant
      Post count: 21

      Maybe it was poorly worded. What I meant was that an OoO could find a way to overlap stalls where static instruction bundles could not.

      Suppose the scheduling is `load_offset(a, i, delay), load_offset(b, j, delay);

      // a[i], b[j] drop
      con(member_offset), load_offset(b0, b2, delay), load_offset(b1, b2, delay);`

      Imagine a[i] hits but b[j] misses, now we have no choice but to stall. Then imagine a[i]->member misses but b[j]->member hits, now we have no choice but to stall again. Whereas an OoO processor can see that a[i] hits and issue the load for a[i]->member before retiring the load for b[j].

  • peceed
    Participant
    Post count: 5

    From what I understand, the problem is in the code like the kernel below:
    aLotOfComputationsA(a[hash(i)]); //predictors don’t work
    aLotOfComputationsB(b[hash(i)]);
    aLotOfComputationsC(c[hash(i)]);
    aLotOfComputationsD(d[hash(i)]);
    //”One miss to stall them all!”
    Mill is not able to handle such a scenario!

    It must be remembered that we have entered the era of huge L3 caches that have access times of around 40 CPU cycles, and this is an area where OoO does great, and Mill – not, being doomed to work in bursts. Additionally, OoO can use SMT for practically free, which reduces the occurrence of stalls by a factor of two – this is hard to beat, Mill has fundamental difficulties in implementing SMT.
    Similarly, processing elements of a linked list, OoO can process the chain of dependencies as fast as possible. It should be noted, however, that garbage collector algorithms sort memory objects and this very well improves the performance of predictors for data – that’s why the hash example is better.
    If we add to this the widespread use of vector instructions that can parallelize many algorithms quite analogously to MILL, and the simultaneous transfer of most of the calculations from CPU to GPU/Tensor cores, then we can see that the economic window of opportunity for Mill is closing, and it is in danger of becomining an alternative past.

    • Findecanor
      Participant
      Post count: 36

      To the question of OoO vs. The Mill’s deferred loads: I’d wait with any judgement until I’ve seen benchmarks.
      I completely expect The Mill to be worse at some workloads but better at others. With statically scheduled code, performance relies a lot on the compiler.

      GPUs and CPUs are now so far apart that they are used for completely different workloads. Only some algorithms can be data-parallelised. AI workloads are also moving away from GPUs to dedicated low-precision matmul cores.
      Therefore I think that comparing them that way is a bit silly.

      We’re also entering the many-core era for CPUs. You can get a 192-core system today, with kilo-core systems on the way. I expect memory stalls to be more common on those than anywhere else.
      Those are found mostly in servers: where overall throughput of general-purpose code is important, as is power consumption.
      Again, this is an area where I’d await benchmarks.

      But being a data-flow processor with deferred loads is only one area where The Mill is different from conventional architectures.

      • peceed
        Participant
        Post count: 5

        11 years have passed since I heard about Mill for the first time. It’s a long time.
        The lack of realistic benchmarks is unfortunately very disappointing, especially since they can use the simulator.
        There is no point in hiding from the competition, it does their things at own pace anyway.

        I think that the whole project lacks a business approach. It was possible to use grants from the European Union to create a Linux version that would cope with the Mill memory model – due to energy savings it is ‘pro -ecological’.
        All the fun in creating ideal architecture is immediately idiotic when it still uses the intermediate code, and the only markets on which it can be adapted quickly and are based on compiled sources or virtual machines (lamp/java/.Net/Android/Apple). Nvidia and AMD change GPU architecture every few years and it doesn’t matter.
        It was necessary to ‘show meat’ as soon as possible, how well Gold and Silver cope with mobile applications and JS, and the rest of the improvements show later. The new security model is unnecessary for the first version, and the prototype can use ordinary uniform physical memory and run a trusted code only.
        You need ‘proof of concept’. The lack of such a thing indicates that the benefits only occur after implementing the whole innovation, and this means that the core is essentially weak.
        As I once pointed out that there is an Embedded market that can do without bells and whistles and first generation of product can operate using a 32-bit address space.

You must be logged in to reply to this topic.