Your explanations give a more realistic feel of what the Mill is capable of and what the relevant factors are. Assuming vector registers one might even do a 4×4-Matrix-Vector multiply in a couple of cycles. That is impressive imho.
About loop carried variables: I do see how conceptually the carried variables can just stay on the belt and therefore the maximal loop distance is proportional to the length of the belt (ignoring spill/fill ops). What I was trying to get at is that for the “i-1” version of the loop you can’t do two iterations in one instruction because of the data-dependency. Whereas the “i+1” version would allow more iterations per instruction due to no dependencies (ignoring other factors/limits).
So what I really wanted to ask: How can a single instruction contain/execute data-dependent integer or FPU ops (say a+b+c)?
According to the Execution talk (#6), an instruction is phased/pipelined internally. Now an instruction or bundle is decoded during 3 cycles and executed during 3 cycles but only one of the execution cycles is an ops phase (in which execution units can be used (true?)). I assume a+b+c can’t be computed during this one ops phase/cycle.