For all posters, I recommend that you put one question in each posting, or at least collect questions on a single topic together so that replies can be coupled to that topic.
This reply addresses LarryP’s scratch-related questions.
1) several different pieces of loop-carried data
This makes no difference at all. The rotate operator is invoked once per iteration, and all the scratchpad data local to the iteration rotates, no matter where it came from. The scratch space must cover a loop’s distance worth of data, and individual LCVs may have shorter distances, for example,
A[i] = B[i] + B[i-6] + C[i] + C[i-10];
Here the loop distance is 10, but B has a distance of only six. It is true that the values from B will reside in the rotating scratch for four more iterations after they are dead before they are overwritten with newer values, but that’s harmless.
2) where the loop distance (times the size of each loop-carried datum) is greater than the largest-guaranteed scratchpad allocation.
This is our old friend the running-out-of-scratch problem. The Mill solution is “extended scratchpad” in spiller-owned memory, with scratch-like spill/fill semantics. There wasn’t time for extended in the talk, but it was described elsewhere here.
As for compilation, we believe (not that far yet to say “we have done”) that it is sufficient for the compiler to identify LCVs and their distances, and let the specializer deal with all spill/fill. Compiler output is a dataflow graph, so if the scheduler finds that a live datum will fall of the belt it must allocate scratch and insert the necessary spill and fill. Because the datum is marked with its distance, the specializer knows that a rotate op is needed and how many copies of the LCV will be live in the scratchpad. The rest is reasonably straightforward.