Difference between revisions of "Phasing"

From Mill Computing Wiki
Jump to: navigation, search
Line 3:Line 3:
 
To utilize such levels of parallelism, the operations within one instruction are executed in different phases, which are executed in consecutive cycles.
 
To utilize such levels of parallelism, the operations within one instruction are executed in different phases, which are executed in consecutive cycles.
  
<div style="position: absolute; top: 7em; left: 18em;">[[File:Phasing.png|alt=Phasing Chart]]</div>
+
<div style="position: absolute; top: 10em; left: 18em;">[[File:Phasing.png|alt=Phasing Chart]]</div>
  
 
== The Phases in Order ==
 
== The Phases in Order ==

Revision as of 13:16, 13 August 2014

Mill instructions are potentially very wide on the higher end machines. Much wider even than any of the VLIW designs of the past or current GPUs, over 30 operations per instruction and per cycle.

To utilize such levels of parallelism, the operations within one instruction are executed in different phases, which are executed in consecutive cycles.

Phasing Chart

The Phases in Order

Which phase an operation is assigned to depends on how it produces and consumes values. This is because it determines how the functional units implementing the operations need to be wired on the chip.

The order those phases are executed in is determined by data flow: Producers before consumers to maximize instruction level parallelism.

The phasing is almost directly expressed in the instruction Encoding, and the Decode order maps to phase order, too.

Just as the decode of one instruction is stretched over 3 cycles, the issuing of the operations in an instruction is too. In 3 consecutive instructions, phase 1 operations of instruction 3 are issued in the same cycle as the phase 2 operations of intruction two and the phase 3 operations of instruction 1. And barring stalls this is the steady state in the system, over branches and everything: the operations of 3 different phases from 3 different instructions are issued every cycle.

Reader Phase

Reader Operations produce values for later consumption abd put them on the Belt, but have no dynamic operands that need to be pulled from the belt. Reader operation can occur in both the exu and flow instruction stream. There are no data dependencies in reader operations.

Because the exu stream instrucion format is optimized for many small operations with short hardware operands the reader operations that query internal hardware state are placed on the exu side.

The flow side is optimized for large immediates, so larger direct constants are encoded there.

Reader phase operations are almost all latency 1. They all take one cycle, so they can immediately be consumed by the operations in the next phase. This is quite easy to guarantee too, because they all either have immediate results that are available right from the decoder, or they access low latency hardware registers.

Compute Phase

This is where all the major data manipulation comes happens. Compute instructions take values from the belt, although most have limited support for small immediates too, and produce new values from them to put on the belt again. Compute operation latency is always defined and fixed for each, this is static scheduling, but can vary siginificantly. They also can occur in both instruction streams.

On the exu side are all your standard arithmetic and logic instructions. All floating point operations too. And since comparison results are produced explicitly too and are not implicitly routed over global condition codes they are all in there too.

The most important operation to be on the flow side in the compute phase is [[instruction_Set/Load|load]. It requires belt operands to compute addresses from, and then schedules the load to retire some time later. Pure address arithmetic without loads is also on the flow side, sharing the same hardware.

Call Phase

The call phase is a bit special. Its operations are only available in the flow instruction stream. And from the phasing perspecive within an instruction they don't require any cycles, although they of course do need them to execute and in reality take by far the most cycles of all the phases.

The call instruction itself takes one cycle. But it puts the data flow within the current frame on hold and executes all the instructions in the called EBB chain, including all the other calls in there. And when there is more than one call in an instruction they are called back to back, chaining into each other.

So while an instruction is oblivious to all the cycles spent in calls, they are by no means free. The only way an instruction might be able to tell is when the results of long latency compute operations are available after the call phase. But this is scheduled.

Call is actually not the only operation in the call phase. There are several variants of calls, like coditional calls. Also the inner operation for looping and promote belong in here.

Pick Phase

The pick phase truly contains only one instruction, pick. And a special instruction it is. It's encoded on the exu side and takes 3 Belt parameters. It has zero latency, because it is implemented in the rerouting of the Crossbar. This means it's parallelism is not determined by the amount of functional units that implement it. The number of available pick slots in the instuction is a quarter of the belt length.

What pick does is selecting a value based on a predicate and drops it at the front of the belt. This is the final step before all the data manipulations in the previous phases can be made permantent and become global state. And you make sure you got the correct values.

Writer Phase

The writer operations have input parameters from the belt, but drop no values back on it. As far as the internal data flow is concerned, writer operations are pure sinks.
That is not to say they do nothing. Quite the opposite. They almost all turn local values that have been contained on the belt into global state.

This is true in particular for the flow side writer operations, the two major categories being store and branches. Store obviously writes values into memory. And branches define the control flow.

The writer operations on the exu side again are aimed at internal state. This can be local data, like spilling belt values into the Scratchpad, but also can be global state like granting and revoking access rights.

Issue vs. Retire

It's important to emphasize that phasing determines the order operations issue in within an instruction, not the order they retire in. While a majority of operations only take one cycle, and there the issue order indeed defines the retire order, there are many operations that are not. Putting the operations in the in the right instruction to order their retire times appropriately is what Static Scheduling does at compile time.

This difference between issue and retire cycle is also what makes the cycle saving gains of phasing across control flow possible.
The branches of an instruction and the reads of the next instruction issue in the same cycle. Since reads don't depend on Belt operands it is always save to start decoding and issuing them. When the branch Prediction is correct, that cycle in the readers is saved, when not, the read retires are canceled and there is a stall.
And at the end of a basic block, all the data writes and logically following branches are issued in the same cycle, and no matter which way the branches go the writes retire.

Theory

The operations a computer performs can always be divided into different categories, depending on what they do. This goes back to the 3 kinds of terms in lambda calculus (variable, abstraction, application) or the equivalent 3 sets of symbols a turing machine works with (state, alphabet, shift).

Even a one instruction set computer has to perform the 3 tasks in its one instuction, generally as a compute, a compare and a control flow part of the instruction.

In all those cases the reading and writing of data happens implicitly or is described seprately of the formalism, but it can go the other way around too. In transport triggered architectures only the reading and writing of data is explicit, and all the other aspects of computation and comparing and control are implicit as side effects.

And there are many many more equivalent theories of computing that all choose to make some of these aspects more explicit and other less so. Pretty much all programming languages and computer architectures are.

What is common to all these approaches is that they define an order, a precedence of how those parts interact. They all define a most basic directed data flow, and the power comes from repeating it over and over.

Rationale

All computer architecures of course have to account for this. But in historical designs the engineering is primarily driven by the physical contraints of the hardware at the gate level. And all computer architectures in common use today are actually historical designs conceived 30-40 years ago. So on the instruction level the logical data flow grouping was more or less ad hoc, wherever the bits and wires fit. The instruction streams are flat and the data and control flows emerge from them ad hoc, too. The hardware had no real structure to work with and expect and be prepared for. That's one of the reasons out of order exists: it looks ahead in the instruction flow and tries to bring all those flat opaque instructions into a better ordered data and control flow for the available hardware. This is what the long pipelines are for: to have a window to find the data flows and then reorder them to best fit the hardware resources.

On the Mill those core data flows and their ordered phases are explicitly encoded in the instructions. Their most basic order is right there, within every instruction. Usually several within every instruction in parallel actually. And stringing them together into an instruction stream funnels the data flow through the available hardware in an almost direct mapping.

In doing so, the usable instruction level parallelism is essentially tripled on average, because all three phases of the most basic data flow can be done in parallel, just phase shifted by one iteration. This happens over control flow barriers, and actually only is beneficial over control flow barrierswhen compared to VLIW archtiectures. But control flow is very common, and usually the most severe road block to data flow. It's common enough to happen on almost every Mill instruction. And when there is no control flow, the Mill can exploit all its available width even more freely.

Media

Presentation on Execution and Phasing by Ivan Godard - Slides