Mill Computing, Inc. Forums The Mill Architecture Loop pipelining and aliasing Reply To: Loop pipelining and aliasing

Ivan Godard
Keymaster
Post count: 689

Most of this is addressed in post http://millcomputing.com/topic/loop-pipelining-and-aliasing/#post-1180.

We won;t have to shoot new video for the pipelining talk; only the slides have to be changed so it should not add much delay (famous last words, I know).

Yes, only the specializer knows how many in-flight deferred loads are possible on the target. That constraint may force the specializer to produce a longer (more cycles) schedule for the piped loop than would be necessary without the constraint. As a simple example, if specializer policy is to always use a deferral large enough to hide the d$1 cache latency (three cycles) and there is only one load in the loop and only one retire station, then the schedule must use at least three cycles per iteration, whether the other slots have anything useful to do or not. Pipelining is still quite possible, and will be done, although the constraint may force no-ops that would not be necessary if loads were true one-cycle ops. This consideration was ignored in the talk, which was already over-long.

There are fewer retire stations on smaller members, so a random loop is more likely to hit this constraint than on a bigger member. However, other constraints (such as shortage of ALU units for the loop work) will also pinch on smaller members, causing the pipeline to need more cycles per iteration anyway. These effects offset, and the members are configured so that over the range of loops in average code no particular one of the constraints is exceptionally pinchy compared to the others. The result is that pipelines will be slower on smaller and faster on bigger members, as one would expect.

The compiler does not emit alternates for pipes (we’re not even sure that it is worth having the compiler emit alternates in any cases, but the possibility is there). The compiler also does not do scheduling: compiler output is a pure dataflow graph, with no temporal notions at all. The algorithm that the specializer uses to pipe loads is described in the cited post.

Ask again if the explanation isn’t clear.