Difference between revisions of "Pipeline"

From Mill Computing Wiki
Jump to: navigation, search
(create)
 
 
(5 intermediate revisions by the same user not shown)
Line 1:Line 1:
The functional units, organized into pipelines addressable by and referred to as slots from the [[Decode|decoder]], are the workhorses, where the computing and data manipulation happens.
+
The functional units, organized into a pipeline is addressed by a dedicated [[Decode|decoding hardware unit]] called a [[Slot]], are the workhorses, where the computing and data manipulation happens.
  
== Barber Shop Metaphor ==
+
<div style="position: absolute; left: 24em;">
 +
[[File:Slot.png|Functional]]
 +
</div>
  
Like with all metaphors, a lot of detail is muddled or left out for the sake of a more general understanding, but here we go:
+
== Overview ==
  
A slot can be compared to a barber shop. There is a number of barbers and hairdressers in the barber shop, each one specialized for a specific task. These are your functional units, the adders, shifters, multipliers etc.<br />
+
A pipeline is a cluster of functional units, that contain circuits like multipliers, adders, shifters, or floating point operations, but also for calls and branches and loads and stores. This grouping generally is along the [[Phasing|phases]] of the operations the functional units implement and it also depends on which [[Slot]] issues the operations to them.
Each cycle a new customer, or operation request, can come in. Only customers with a statically scheduled  appointment are allowed. So as soon  the customer arrives it gets referred to the right barber in front of the right mirror, depending on what kind of work needs to be done.<br />
+
If it's just a buzz cut (like integer adds) the client is out as soon as the next one comes in.<br />
+
Perms (multiplies) take longer. So while the corn row hairdresser is doing her thing, more customers with more buzz cuts or shaves arrive and leave and customers that also want perms can take their seat and will get treated. And when the first perm is done, the customer leaves together with all the others non-perm customers that came in in the mean time and happen (or rather, are scheduled) to be done at the same moment.
+
  
The weird thing about this barber shop is, because it only takes appointments, there is no waiting for new arrivals. But sometimes one of the bars across the street (the call functional units) takes over the whole street (the [[Belt]]) to throw a party, and none of the customers can leave. So that's why there is a wait queue for leaving the barber shop (the latency registers). They even sent people from the party over to shave their heads and stuff like that. But as long as the party goes on, none can leave that don't belong to the party. This can get to the point where customers are sent to wait in the queues of the neighboring barber shops, if they have rrom.<br/>
+
All the <abbr title="Functional Unit">FU</abbr>s in a pipeline also share the same 2 input channels from the belt, at least for the phases that have inputs from the belt.
And if it really gets out of hand and more bars start throwing parties they have to take a room in the pension in the backyard (the [[Spiller]]) until it all quiets down. When things are normal again, all in the queues can leave at once, the others come back from the pension into the queues and leave then too.
+
  
== Technical Overview ==
+
The different operations grouped into one pipeline may have different latencies, i.e. they take a different amount of cycles to complete. A each FU in the slot still can be issued one new operation every cycle, because they are fully pipelined.
  
Now for real. A slot is a cluster of functional units, like multipiers, adders, shifters, or floating poitn operations, but also for calls and branches and loads and stores. This grouping generally is along the [[Phasing|phases]] of the operations the functional units implement. The operations grouped into one slot still may have different latencies, and take a different amount of cycles to complete, but this is possible because the <abbr title="Functional Unit">FU</abbr>s are fully pipelined and can be issued an operation every cycle.
+
To catch all the results that are issued in different cycles, but may retire in the same cycle, there are latency registers for each latency possible in the pipeline for the values to be written into. They are dedicated, so only that pipeline can write into them.
 +
 
 +
=== Operands and Results ===
 +
This is to reduce the number of connecting circuits between all the functional units and registers. Those same latency registers are what on a higher level is visible as [[Belt]] locations. The trick here is that they are addressed differently for writing into them, for creating new values, and for reading them, when they serve as sources. As mentioned above, the registers can only be written by their dedicated pipeline, they are locally addressed. But they can serve as belt operand globally to all pipelines. In a similar way global addressing cycles through all the registers for reading, on a per pipeline basis the local addressing cycles through all the dedicated latency registers for writing. In contrast to the belt addressing this is completely hidden from the outside, it is purely machine internal state.
  
 
=== Result Replay ===
 
=== Result Replay ===
Because of the different latencies of the different operations in one slot, any time there are delays, from frame changes in interrupts or calls, or just from scheduled delays, the results of those operations must be saved until they can be dropped on the [[Belt]]. This is what the latency registers are for. There is one latency register dedicated to the results of each possible latency of the operations within a slot. This way when an operation of each latency finishes in the same cycle, all the resulst have a place to go. Those latency registers also catch the results of consecutive cycles if they cannot be dropped on the belt yet and form a kind of stack, shifting the results along. If too many results stack up eventually the [[Spiller]] saves the oldest values, and when interrupts and subroutines return and delays end, those values are restored in correct order and then finally retired to the belt.
+
There is actually double the amount of latency registers for each pipeline than you would need just to accommodate the produced values from the functional units. This is because while an operation is in flight over several cycles, a call or interrupt can happen, and the frame changes. The operations executing in the new frame need places to store their results, but the operations still running for the old frame do as well.
  
This saving of results if the control flow is interrupted or suspended in some way is called result replay. This is in contrast to execution replay on conventional machines, that throw away all transient state in that case, and then restart all, potentially very expensive, computations again from the beginning.  
+
Within a call of course more calls can happen, and then the latency registers aren't enough anymore. And there is other frame specific state too. This is where the [[Spiller]] comes in, it monitors frame transitions, and saves and restores all the frame specific state that might still be needed safely and transparently in the background. And a big part of that is saving the latency register values that would be overwritten in the new frames, and then restoring them on return.
  
=== Operands and Results ===
+
This saving of results when the control flow is interrupted or suspended in some way is called result replay. This is in contrast to execution replay on conventional machines, that throw away all transient state in that case, and then restart all, potentially very expensive, computations again from the beginning.
  
As explained above, each slot can retire at least as many results each cycle as it has result yielding operations of different latency in the pipeline. There can be more if there are multi result operations. But those all go on the belt in defined order.<br />
+
=== Interaction between Pipelines ===
The number of input operands is far more restricted. Most slots only allow 2. In the reader phase there are a few slots with only one operand, and those usually are immediates.
+
  
This reduction of data paths is one of the major factors that contributes to its low power consumption.
+
The main way pipelines interact with each other is by exchanging operands over the belt. The results of one operation onto the belt become the operands for the next.
  
=== Interaction between Slots ===
+
But there is this 2 operand limitation for pipelines. And there are some common more complex operations that need more than 2 operands, or that have additional implicit operands on conventional architectures.
  
The main way slots interact with each other is by exchanging operands over the belt. The results of one operation onto the belt become the operands for the next.
+
For such cases certain functional units in adjacent pipelines work together and bundle their input and compute resources and implement those operations cooperatively. This process is called [[Ganging]]
  
For some types of neighboring slots though, they can pass along operands in the middle of the pipeline to each other. The primary use of this are [[Ganging|operand gangs]] to overcome the severe input operand restrictions for some special case operations.<br />
+
The coordination between the two pipelines happens via two channels. For one there are special 2 slot operations defined that the decoder issues to the corresponding pipelines. But there are also simple and fast data connections between neighboring pipelines. Those are the blue arrows in the diagram above.
Those data paths also can be used for the saving of result replay values in case the own latency registers are full, which is a lot faster than going to the spiller right away.
+
Through those side channels, when triggered via those special operations, additional input operands or transient results or just operation state predicates can be passed from one pipeline to another without going through all the [[Crossbar]] wiring, even within the same cycle within the same phase. But only for those specially defined operations.
  
== Kinds of Slots ==
+
Those data paths also can be used for the saving of result replay values in the latency registers of a neighboring slot in case the own latency registers are full, so the [[Spiller]] doesn't have to intervene unless it really needs to.
  
Each slot on a Mill processor is [[Specification|specified]] to have its own capabilites by having a specific set of functional units in them. This can go so far as every slot pipeline on each chip being unique, but usually this is not the case. The basic reader slots and load and store units tend to be quite uniform, as well as the call slots. The <abbr title="Arithmetic Logic Unit">ALU</abbr> are a little more diverse, depending on the expected workloads the processor is configured for, but still not wildly different. But the Mill architecture certainly is flexible enough to implement special purpose operations and functional units.
+
== Kinds of Pipelines ==
 +
 
 +
Each slot on a Mill processor is [[Specification|specified]] to have its own capabilities by having a specific set of functional units their connected pipeline. This can go so far as every slot pipeline on each chip being unique, but usually this is not the case. The basic reader slots and load and store units tend to be quite uniform, as well as the call slots. The <abbr title="Arithmetic Logic Unit">ALU</abbr> pipelines are a little more diverse, depending on the expected workloads the processor is configured for, but still not wildly different. But the Mill architecture certainly is flexible enough to implement special purpose operations and functional units.
  
 
Consequently the  different blocks in the instructions [[Encoding|encode]] the operations for different kinds of slots. And if the [[Specification]] calls for it, every operation slot has its own unique operation encoding. All this is automatically tracked and generated by the Mill [[Synthesis]] software stack.
 
Consequently the  different blocks in the instructions [[Encoding|encode]] the operations for different kinds of slots. And if the [[Specification]] calls for it, every operation slot has its own unique operation encoding. All this is automatically tracked and generated by the Mill [[Synthesis]] software stack.
 +
 +
== Media ==
 +
[http://www.youtube.com/watch?v=43kh4y3Mnhw Presentation on Execution by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2014/02/execution.02.pptx Slides]<br />
 +
[http://www.youtube.com/watch?v=QGw-cy0ylCc Presentation on the Belt and Pipelines by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-07-11_mill_cpu_belt.pptx Slides]

Latest revision as of 22:29, 31 October 2014

The functional units, organized into a pipeline is addressed by a dedicated decoding hardware unit called a Slot, are the workhorses, where the computing and data manipulation happens.

Functional

Overview

A pipeline is a cluster of functional units, that contain circuits like multipliers, adders, shifters, or floating point operations, but also for calls and branches and loads and stores. This grouping generally is along the phases of the operations the functional units implement and it also depends on which Slot issues the operations to them.

All the FUs in a pipeline also share the same 2 input channels from the belt, at least for the phases that have inputs from the belt.

The different operations grouped into one pipeline may have different latencies, i.e. they take a different amount of cycles to complete. A each FU in the slot still can be issued one new operation every cycle, because they are fully pipelined.

To catch all the results that are issued in different cycles, but may retire in the same cycle, there are latency registers for each latency possible in the pipeline for the values to be written into. They are dedicated, so only that pipeline can write into them.

Operands and Results

This is to reduce the number of connecting circuits between all the functional units and registers. Those same latency registers are what on a higher level is visible as Belt locations. The trick here is that they are addressed differently for writing into them, for creating new values, and for reading them, when they serve as sources. As mentioned above, the registers can only be written by their dedicated pipeline, they are locally addressed. But they can serve as belt operand globally to all pipelines. In a similar way global addressing cycles through all the registers for reading, on a per pipeline basis the local addressing cycles through all the dedicated latency registers for writing. In contrast to the belt addressing this is completely hidden from the outside, it is purely machine internal state.

Result Replay

There is actually double the amount of latency registers for each pipeline than you would need just to accommodate the produced values from the functional units. This is because while an operation is in flight over several cycles, a call or interrupt can happen, and the frame changes. The operations executing in the new frame need places to store their results, but the operations still running for the old frame do as well.

Within a call of course more calls can happen, and then the latency registers aren't enough anymore. And there is other frame specific state too. This is where the Spiller comes in, it monitors frame transitions, and saves and restores all the frame specific state that might still be needed safely and transparently in the background. And a big part of that is saving the latency register values that would be overwritten in the new frames, and then restoring them on return.

This saving of results when the control flow is interrupted or suspended in some way is called result replay. This is in contrast to execution replay on conventional machines, that throw away all transient state in that case, and then restart all, potentially very expensive, computations again from the beginning.

Interaction between Pipelines

The main way pipelines interact with each other is by exchanging operands over the belt. The results of one operation onto the belt become the operands for the next.

But there is this 2 operand limitation for pipelines. And there are some common more complex operations that need more than 2 operands, or that have additional implicit operands on conventional architectures.

For such cases certain functional units in adjacent pipelines work together and bundle their input and compute resources and implement those operations cooperatively. This process is called Ganging

The coordination between the two pipelines happens via two channels. For one there are special 2 slot operations defined that the decoder issues to the corresponding pipelines. But there are also simple and fast data connections between neighboring pipelines. Those are the blue arrows in the diagram above. Through those side channels, when triggered via those special operations, additional input operands or transient results or just operation state predicates can be passed from one pipeline to another without going through all the Crossbar wiring, even within the same cycle within the same phase. But only for those specially defined operations.

Those data paths also can be used for the saving of result replay values in the latency registers of a neighboring slot in case the own latency registers are full, so the Spiller doesn't have to intervene unless it really needs to.

Kinds of Pipelines

Each slot on a Mill processor is specified to have its own capabilities by having a specific set of functional units their connected pipeline. This can go so far as every slot pipeline on each chip being unique, but usually this is not the case. The basic reader slots and load and store units tend to be quite uniform, as well as the call slots. The ALU pipelines are a little more diverse, depending on the expected workloads the processor is configured for, but still not wildly different. But the Mill architecture certainly is flexible enough to implement special purpose operations and functional units.

Consequently the different blocks in the instructions encode the operations for different kinds of slots. And if the Specification calls for it, every operation slot has its own unique operation encoding. All this is automatically tracked and generated by the Mill Synthesis software stack.

Media

Presentation on Execution by Ivan Godard - Slides
Presentation on the Belt and Pipelines by Ivan Godard - Slides