One must distinguish pipelining from vectorization; the two are often conflated but present very different issues.
Essentially any closed loop can be scalar pipelined on a Mill, although it may not be worthwhile. Because the compiler does not have all the code available (due to specialize-time emulation substitution and inlining of external functions and templates) we must determine the applicability of SPL in the specializer; this is done after all the control-flow folding has been done and the code is ready to schedule. If-conversion (including pick) is done at this stage, as well as conventional control-flow simplification like jump-to-jump elimination. We also flop the sense of branches so that the fall-through case is the likely one (if profile info is available) or the longest path (which gives better schedules). We also recognize loops at this point, and replace exiting branches with leave ops and put an inner op in the loop head (the dominator that is the entry to the loop).
The heuristic for SPL is straightforward: we count the latency of one full iteration (following the dataflow of execution) as if it were straight-line code, and count the number of ops in that code in the various resource categories. Because we know how many resources (of each category) are present on the target, we know the minimal number of issue cycles that are needed if every op were completely packed into the available width. If the linear latency (following the dataflow) is twice or more the packed latency (maximal entropy) then we SPL. The “twice” is heuristic’ we could pack three linear into two packed, for example, but don’t have enough experience to know if it’s worth it.
Scheduling is pretty much the same as for open code, except that open code uses an extensible linear tableau to allocate slots to ops, whereas SPL uses a slice of the tableau that is packed-latency high and schedules it as a torus, with wraparound as it looks for free slots. The schedule is guaranteed to be feasible by the definition of the packed latency, but the placement is greedy and does not necessarily lead to a feasible placement (the bin-packing problem is NP). If it fails then the packed latency is increased by one and the whole mess is rescheduled. This is guaranteed to lead to a feasible placement, because eventually the increase in packed-latency will be enough for an open-code placement, which is always feasible.
Once placement is done, we calculate belt lifetimes of intermediates. This is the same as for open code, although we must distinguish the intermediates from different iterations that are simultaneously live. As with open code, those lifetimes are measured in terms of relative position on an infinite belt, rather than in terms of cycles, and if the relative position between producer and consumer exceeds the actual belt size then we insert spill and fill ops at the producer and the first consumer that needs a lost operand. There can be several of these of course. If this is the first spill/fill we add a rotate op at the top of the loop and put a scratchf op (scratchpad allocation) in the loop head; these get fixed up later when we know how much scratchpad will be used. If we added any spill/fill then we throw away the schedule and redo everything from the top with the modified code. This too is guaranteed to reach a feasible schedule, because the limit is for every produced value to be spilled and the belt reduced to a single position. In practice, one reschedule is needed ~20% of the time, two in single-digit percents, and we’ve not found any code that needs more.
The torus schedule is a normal schedule and may embed control flow, and represents the steady state. Any dataflow on the back arc to the entry point represents either a recurrence (for example the loop control variable) or a loop-carried temporary. The recurrences are those with a dataflow from the loop head (or earlier) into the loop; those with flows in from the head but not on a back arc are initialization, not recurrences. Initialization flows are handled like any other dataflow across a join point, and may require injection of a conforms op to get all the arcs into the top of the loop to have a congruent belt. Back-arc-only flows require creation of a None value in the loop head, which acts as an initializer just like explicit initialization. And true recurrences need both the initial value and the injection of a recurs op at the join. We could be smarter about recurs injection; it’s only needed if a None can reach the recurrence. Truth be told, there’s probably other cases where we could be smarter, but by and large the resulting code looks decent.
While SPL needs very little from the compiler and is mostly specializer only, vectorization needs a lot from the compiler. We need the compiler to tell us the iteration interval, the expected iteration count (if it can tell), and the strides of all recurrences. These are hard for the specializer to figure out, so we don’t try to recognize vector loops in the specializer, so far anyway. From the target description we know what the vector size will be for each reference, and from the iteration interval we can tell if that size is feasible or if we have intra-vector recurrences. There are ways to handle such recurrences in the lit, but don’t hold your breath. If it’s feasible we transform the loop to one that is still a scalar loop but where each “scalar” is a vector of the appropriate size; this involves scaling the control variable calculations. The smear operations for while loops are inserted at this point, and the trailing reductions when the loop yields a vector that must be reduced to a scalar. The result is just a normal loop, but with some funny operations in the code, and is scheduled (including SPL) normally.
Only some of all this is working as I write this; in particular none of what the specializer needs from the compiler is yet available. If you have compiler experience (especially LLVM) and are willing and able to put in significant time on a sweat-equity basis then send me a resume.