Phasing is a separate issue, having nothing to do with piping. It makes a difference only in open code, not loops, where it is redundant to piping. We would still pipeline to the same degree if there were only one phase.
Analysis inside LLVM discovers that the assignment to A[i] is the same value as the A[i-1] of the next iteration. This is a routine analysis/optimization done by all production compilers. As a result of this analysis, the genAsm the Mill specializer receives has only one load op in the loop body, and a “warm-up” load of A[-1] before the loop is entered. The modified loop body pipes normally. Doing two iterations per cycle is essentially unrolling the loop once and then piping the unrolled loop; it adds complications in the prologue and epilogue if it is not demonstrable that the total iteration count is evenly divisible by two, but the piping process itself is normal.
The Mill has some special operations to simplify the prologue (retire(), inner()) and epilogue (leave()); these are touched on in the videos.