Regarding pipelining loops, deferred loads and loop-carried data:
1. I’m glad a confusing issue was caught by an alert audience member, and I’ll look forward to the corrected slides/video as soon as posted. With the noticed mistake and the resulting need to redo slides, plus extra video editing, I suspect that we won’t have the video + animated slides online for at least a couple weeks. 🙁
IMHO, better to be later and correct, than quick and confusing. (Though I’m still waiting with imperfect patience.)
2. If I understand correctly, the number of *simultaneously-executing* deferred loads varies considerably among family members, because family members differ in the number of traffic-monitoring load buffers they have. Since the Mill intermediate representation apparently compiles for an abstract Mill — and thus doesn’t now how many (aliasing-immune) load buffers the specific target has, then how does the combination of the compiler and specializer handle the case where the compiler’s implementation of a pilelined loop requires more *simultaneous* in-flight deferred loads than the target hardware has to do such loads?
It seems to me that the compiler will consider most loops pipeline-able for the *abstract* Mill and thus emit its intermediate representation for a fully-software-pipelined loop — only to have the specializer potentially need to (partially, or worse fully) “de-pipeline” it, due to the target’s constraint on the number of simultaneous in-flight loads it can do.
How will the compiler and specializer jointly handle this case? Is this one of the cases where the compiler has to emit a series of alternative sequences for the specializer to choose from? Or can the specializer always implement any software-pipelined loop sequence, though it may need more instructions to do the pipeline-required loads using its limited load hardware?
This issue about whether/how-to software pipeline loops for a given target member sounds like (a) rather a common case, especially given the Mill’s key goal to be the CPU that performs dramatically better on loop-heavy code and (b) one with great variability (that the specializer will have to handle well), given the range of family member’s and their functional-unit populations.
IMHO, the details will need to wait for talks on the compiler and specializer, but I’m hoping for some early/high-level clarification that the Mill’s software pipelining of loops will work as close to optimally as possible on the full range of Mill targets — without additional/major surgery on the compiler and specializer.
Any comments would be most welcome.