The Wikipedia article (https://en.wikipedia.org/wiki/Software_pipelining) is a pretty good intro to pipelining; the academic papers that Google turns up are more for the specialist due to often opaque notation, or are CS slide decks that don’t get very far into the subject.
In general piping lets you fold the body of a loop in such a way that more than a single iteration is in flight concurrently. As well as other constraints on folding, there is a fundamental resource constraint: if the loop has two floating-point operations and the hardware has only one FP unit then you can’t fold the loop to fewer than two instructions. If the hardware has two FP units then you can (well, may be able to) fold the whole loop into one instruction. And if it has four then you may be able to fold two iterations into a single instruction. You do that by unrolling the loop once and then schedule it as if the result was a single, bigger, iteration. There is also a notion of fractional iterations: if the hardware has three FP units you can unroll twice to get a body with six FP ops that will schedule into three instructions with no FP units left over.
As a matter of actual practice, the demands of a particular loop very rarely exactly match the resources available, so you schedule for the tightest constraint and leave some resources left over. In your example loop, on a Gold, there is one FP op but at least three flow ops (load, store, branch) and maybe more depending on how the compiler did the address update recurrence. Consequently on a Gold the tightest constraint will be the flow ops. Currently Gold has five flow slots, not quite enough to fit two iterations, although we have been playing around with a slightly richer configuration that has six and would handle two iterations. So yes, a Gold can issue ~30 ops per cycle, but only if they are the right kinds of ops that match the available set of functional units. Your loop is very unbalanced in its demands: Gold has four FP units, so your expression in the loop could have been:
A[i] = A[i]*2.0 + A[i+1]*3.0 + 4.0
and it would still fit in one instruction.
The FU population selected for each Mill family member is partly a matter of intended market and partly a matter of hardware practicality. Thus a member intended for HPC supercomputer use will have more FP than one intended for use as a micro-controller (which may have no hardware FP at all, and will rely on software emulation). On the hardware side the belt limits the total number of values dropped per cycle, and it’s unclear whether we can build belts larger than 32 positions. It’s easy and relatively unconstrained to add more ALUs and FP units, but four accesses to the data cache per cycle is about all the current hardware design can handle at the desired clock rate.
About the version with the “i-1”. Loops like this have what are called loop-carried variables, where a value computed in one iteration is used in a following iteration. The standard technique for dealing with them on a genreg machine, as originated by Monica Lam, is to unroll and inject reg-to-reg copies. This is completely unnecessary on the Mill, because the drops from prior iterations are still on the belt and can be addressed directly without unrolling or copying. The Cydra5 and later the Itanium also addressed this issue, and Bob Rau invented rotating registers to deal with it. His solution made the registers into a circular queue, whereas the belt is non-circular, but for the purposes of loop-carried data they are the same.