Pipelining on a Gold member of the Mill family

Author
Posts
Gregor Budweiser
Participant
October 24, 2018 at 5:44 am
Post count: 3
#3342 |
Hello everyone,
the Mill CPU is very interesting and all the talks helped me to better understand CPUs in general. Thanks for making the videos/slides available!
If I’m not completely mistaken software pipelining roughly means loop unrolling (if not, the following probably doesn’t make sense).
1) How would the example from the pipelining talk (talk #9) look when compiled to a Gold member?
The example is this:
```
for(int i = 0; i < N; ++i)
  A[i] = A[i] + A[i+1]
```
Assuming a Gold Mill can decode ~33 instructions per cycle and the load, add and store operations take 1 cycle (as in the talk) would lead to something around 6-7 loop iterations per cycle. (6 loads + 6 stores + 6 adds + loop stuff).
Is that roughly how it works?
2) Also what about this (note the -1 to the index):
```
for(int i = 0; i < N; ++i)
  A[i] = A[i] + A[i-1]
```
Can this be pipelined and executed in one cycle with the given dependencies?
Kind regards,
Gregor
Ivan Godard
Keymaster
October 24, 2018 at 10:37 am
Post count: 689
#3343
The Wikipedia article (https://en.wikipedia.org/wiki/Software_pipelining) is a pretty good intro to pipelining; the academic papers that Google turns up are more for the specialist due to often opaque notation, or are CS slide decks that don’t get very far into the subject.
In general piping lets you fold the body of a loop in such a way that more than a single iteration is in flight concurrently. As well as other constraints on folding, there is a fundamental resource constraint: if the loop has two floating-point operations and the hardware has only one FP unit then you can’t fold the loop to fewer than two instructions. If the hardware has two FP units then you can (well, may be able to) fold the whole loop into one instruction. And if it has four then you may be able to fold two iterations into a single instruction. You do that by unrolling the loop once and then schedule it as if the result was a single, bigger, iteration. There is also a notion of fractional iterations: if the hardware has three FP units you can unroll twice to get a body with six FP ops that will schedule into three instructions with no FP units left over.
As a matter of actual practice, the demands of a particular loop very rarely exactly match the resources available, so you schedule for the tightest constraint and leave some resources left over. In your example loop, on a Gold, there is one FP op but at least three flow ops (load, store, branch) and maybe more depending on how the compiler did the address update recurrence. Consequently on a Gold the tightest constraint will be the flow ops. Currently Gold has five flow slots, not quite enough to fit two iterations, although we have been playing around with a slightly richer configuration that has six and would handle two iterations. So yes, a Gold can issue ~30 ops per cycle, but only if they are the right kinds of ops that match the available set of functional units. Your loop is very unbalanced in its demands: Gold has four FP units, so your expression in the loop could have been:
A[i] = A[i]*2.0 + A[i+1]*3.0 + 4.0
and it would still fit in one instruction.
The FU population selected for each Mill family member is partly a matter of intended market and partly a matter of hardware practicality. Thus a member intended for HPC supercomputer use will have more FP than one intended for use as a micro-controller (which may have no hardware FP at all, and will rely on software emulation). On the hardware side the belt limits the total number of values dropped per cycle, and it’s unclear whether we can build belts larger than 32 positions. It’s easy and relatively unconstrained to add more ALUs and FP units, but four accesses to the data cache per cycle is about all the current hardware design can handle at the desired clock rate.
About the version with the “i-1”. Loops like this have what are called loop-carried variables, where a value computed in one iteration is used in a following iteration. The standard technique for dealing with them on a genreg machine, as originated by Monica Lam, is to unroll and inject reg-to-reg copies. This is completely unnecessary on the Mill, because the drops from prior iterations are still on the belt and can be addressed directly without unrolling or copying. The Cydra5 and later the Itanium also addressed this issue, and Bob Rau invented rotating registers to deal with it. His solution made the registers into a circular queue, whereas the belt is non-circular, but for the purposes of loop-carried data they are the same.
Gregor Budweiser
Participant
October 25, 2018 at 5:50 am
Post count: 3
#3344
Many thanks!
Your explanations give a more realistic feel of what the Mill is capable of and what the relevant factors are. Assuming vector registers one might even do a 4×4-Matrix-Vector multiply in a couple of cycles. That is impressive imho.
About loop carried variables: I do see how conceptually the carried variables can just stay on the belt and therefore the maximal loop distance is proportional to the length of the belt (ignoring spill/fill ops). What I was trying to get at is that for the “i-1” version of the loop you can’t do two iterations in one instruction because of the data-dependency. Whereas the “i+1” version would allow more iterations per instruction due to no dependencies (ignoring other factors/limits).
So what I really wanted to ask: How can a single instruction contain/execute data-dependent integer or FPU ops (say a+b+c)?
According to the Execution talk (#6), an instruction is phased/pipelined internally. Now an instruction or bundle is decoded during 3 cycles and executed during 3 cycles but only one of the execution cycles is an ops phase (in which execution units can be used (true?)). I assume a+b+c can’t be computed during this one ops phase/cycle.
- Ivan Godard
  Keymaster
  October 25, 2018 at 10:36 am
  Post count: 689
  #3345
  Phasing is a separate issue, having nothing to do with piping. It makes a difference only in open code, not loops, where it is redundant to piping. We would still pipeline to the same degree if there were only one phase.
  Analysis inside LLVM discovers that the assignment to A[i] is the same value as the A[i-1] of the next iteration. This is a routine analysis/optimization done by all production compilers. As a result of this analysis, the genAsm the Mill specializer receives has only one load op in the loop body, and a “warm-up” load of A[-1] before the loop is entered. The modified loop body pipes normally. Doing two iterations per cycle is essentially unrolling the loop once and then piping the unrolled loop; it adds complications in the prologue and epilogue if it is not demonstrable that the total iteration count is evenly divisible by two, but the piping process itself is normal.
  The Mill has some special operations to simplify the prologue (retire(), inner()) and epilogue (leave()); these are touched on in the videos.
Gregor Budweiser
Participant
October 25, 2018 at 12:11 pm
Post count: 3
#3346
Phasing is a separate issue, having nothing to do with piping. It makes a difference only in open code, not loops, where it is redundant to piping. We would still pipeline to the same degree if there were only one phase.
Oh dear, I didn’t realize that at all. In the decode talk you actually briefly mention that there are multiple execute phases (X0-X4+) per instruction/bundle. That would make dependent operations possible in a single instruction because they can execute in different cycles. I guess at this point I’ll have to carefully watch the talks again.
Thanks again. Your help is much appreciated!
- Ivan Godard
  Keymaster
  October 25, 2018 at 12:26 pm
  Post count: 689
  #3347
  Yes, phasing lets us put entire multi-operation dataflows in a single instruction. Piping lets us overlap what would be different instructions if it were open code and not a loop.
Author
Posts

You must be logged in to reply to this topic.