Constraints on Mill software pipelining:
Based on the pipelining talk and the discussion above, I’ve been thinking about how far Mill pipelining can be pushed vs. where it cannot go. Specifically, what properties would make a loop un-pipeline-able (or unprofitably pipelinable) on the Mill? — and how to recognize those cases. So far, I can see two situations where the Mill approach to pipelining will likely have problems. This isn’t meant as a criticism. I suspect that universal loop pipelining is probably infeasible with finite hardware. IMHO, the key is to recognize when pipelining either won’t work, or would give poorer performance than leaving the loop unpipelined.
1. Loops where there are several different pieces of loop-carried data, for example, coming from different arrays. Since the Mill’s rotate() operation acts only on the most recent scratchpad allocation, I don’t see an obvious way to software pipeline a loop that has loop-carried data from multiple arrays, especially if the loop-carried data have differing loop distances. I suspect this is a fairly rare case.
2. Loops with a single chunk of loop-carried data, but where the loop distance (times the size of each loop-carried datum) is greater than the largest-guaranteed scratchpad allocation. Again, I suspect this situation will happen occasionally, though most loops won’t run into this constraint.
The second of these situations suggests that there are some parameters that the compiler will need to know about, in order to emit intermediate code that the specializer can cause to run correctly on all Mills, even the little ones. In my second situation, the maximum scratchpad allocation permitted in a single scratchpad allocation operation (scratchf()?) on the smallest-scratchpad Mill, is probably one such parameter.
Ivan, if it won’t run into NYF or related issues, can you give us an idea of how many belt operands the scratchpad can hold, before scratchpad needs help from the spiller? Even a rough Tin — Gold range would be nice to have a rough idea about. (I have a rough guess, but I won’t speculate here, so as not to cause IP/patent problems.)
Similarly, I’m curious about what other parameters are “baked” into the Mill architecture. If I recall correctly, the bit-width of the address space is another such. (64 bit pointers, with 60 bits of true address + four reserved bits for garbage collection and forking.) Since pointers must fit in (single) belt positions, it sounds like this requires a minimum height of 8 bytes for all Mill family members. The shortest belt length mentioned to date is 8 for the Tin. I suspect that a shorter belt length (e.g. 4) would effectively rule out wide issue (since results need to live on the belt for at least the spill/fill latency.)
Similarly, two streams of half-bundles (at least until we get that hypercube memory ) and three blocks to decode within each half bundle. Other baked-in Mill-architecture parameters?