Mill Computing, Inc. Forums The Mill Architecture Memory level parallelism and HW scouts Reply To: Memory level parallelism and HW scouts

peceed
Participant
Post count: 5

From what I understand, the problem is in the code like the kernel below:
aLotOfComputationsA(a[hash(i)]); //predictors don’t work
aLotOfComputationsB(b[hash(i)]);
aLotOfComputationsC(c[hash(i)]);
aLotOfComputationsD(d[hash(i)]);
//”One miss to stall them all!”
Mill is not able to handle such a scenario!

It must be remembered that we have entered the era of huge L3 caches that have access times of around 40 CPU cycles, and this is an area where OoO does great, and Mill – not, being doomed to work in bursts. Additionally, OoO can use SMT for practically free, which reduces the occurrence of stalls by a factor of two – this is hard to beat, Mill has fundamental difficulties in implementing SMT.
Similarly, processing elements of a linked list, OoO can process the chain of dependencies as fast as possible. It should be noted, however, that garbage collector algorithms sort memory objects and this very well improves the performance of predictors for data – that’s why the hash example is better.
If we add to this the widespread use of vector instructions that can parallelize many algorithms quite analogously to MILL, and the simultaneous transfer of most of the calculations from CPU to GPU/Tensor cores, then we can see that the economic window of opportunity for Mill is closing, and it is in danger of becomining an alternative past.