Difference between revisions of "Pipeline"

From Mill Computing Wiki
Jump to: navigation, search
(chart)
Line 5:Line 5:
 
</div>
 
</div>
  
== Barber Shop Metaphor ==
+
== Overview ==
  
Like with all metaphors, a lot of detail is muddled or left out for the sake of a more general understanding, but here we go:
+
A slot is a cluster of functional units, like multipliers, adders, shifters, or floating point operations, but also for calls and branches and loads and stores. This grouping generally is along the [[Phasing|phases]] of the operations the functional units implement and it also depends on which [[Decode|decoder]] issues the operations to them.
  
A slot can be compared to a barber shop. There is a number of barbers and hairdressers in the barber shop, each one specialized for a specific task. These are your functional units, the adders, shifters, multipliers etc.<br />
+
All the <abbr title="Functional Unit">FU</abbr>s in a slot pipeline also share the same 2 input channels from the belt, at least for the phases that have inputs from the belt.
Each cycle a new customer, or operation request, can come in. Only customers with a statically scheduled  appointment are allowed. So as soon  the customer arrives it gets referred to the right barber in front of the right mirror, depending on what kind of work needs to be done.<br />
+
If it's just a buzz cut (like integer adds) the client is out as soon as the next one comes in.<br />
+
Perms (multiplies) take longer. So while the corn row hairdresser is doing her thing, more customers with more buzz cuts or shaves arrive and leave and customers that also want perms can take their seat and will get treated.  And when the first perm is done, the customer leaves together with all the others non-perm customers that came in in the mean time and happen (or rather, are scheduled) to be done at the same moment.
+
  
The weird thing about this barber shop is, because it only takes appointments, there is no waiting for new arrivals. But sometimes one of the bars across the street (the call functional units) takes over the whole street (the [[Belt]]) to throw a party, and none of the customers can leave. So that's why there is a wait queue for leaving the barber shop (the latency registers). They even sent people from the party over to shave their heads and stuff like that. But as long as the party goes on, none can leave that don't belong to the party. This can get to the point where customers are sent to wait in the queues of the neighboring barber shops, if they have rrom.<br/>
+
The different operations grouped into one slot may have different latencies, i.e. they take a different amount of cycles to complete. A each FU in the slot still can be issued one new operation every cycle, because they are fully pipelined.  
And if it really gets out of hand and more bars start throwing parties they have to take a room in the pension in the backyard (the [[Spiller]]) until it all quiets down. When things are normal again, all in the queues can leave at once, the others come back from the pension into the queues and leave then too.
+
  
== Technical Overview ==
+
To catch all the results that are issued in different cycles, but may retire in the same cycle, there are latency registers for each latency possible in the pipeline for the values to be written into. They are dedicated to each slot so only that pipeline can write into them.
  
Now for real. A slot is a cluster of functional units, like multipiers, adders, shifters, or floating poitn operations, but also for calls and branches and loads and stores. This grouping generally is along the [[Phasing|phases]] of the operations the functional units implement. The operations grouped into one slot still may have different latencies, and take a different amount of cycles to complete, but this is possible because the <abbr title="Functional Unit">FU</abbr>s are fully pipelined and can be issued an operation every cycle.
+
=== Operands and Results ===
 +
This is to reduce the number of connecting circuits between all the functional units and registers. Those same latency registers are what on a higher level is visible as [[Belt]] locations. The trick here is that they are addressed differently for writing into them, for creating new values, and for reading them, when they serve as sources. As mentioned above, the registers can only be written by their dedicated slot, they are locally addressed. But they can serve as belt operand globally to all slots. In a similar way global addressing cycles through all the registers for reading, on a per slot basis the local addressing cycles through all the dedicated latency registers for writing. In contrast to the belt addressing this is completely hidden from the outside, it is purely machine internal state.
  
 
=== Result Replay ===
 
=== Result Replay ===
Because of the different latencies of the different operations in one slot, any time there are delays, from frame changes in interrupts or calls, or just from scheduled delays, the results of those operations must be saved until they can be dropped on the [[Belt]]. This is what the latency registers are for. There is one latency register dedicated to the results of each possible latency of the operations within a slot. This way when an operation of each latency finishes in the same cycle, all the resulst have a place to go. Those latency registers also catch the results of consecutive cycles if they cannot be dropped on the belt yet and form a kind of stack, shifting the results along. If too many results stack up eventually the [[Spiller]] saves the oldest values, and when interrupts and subroutines return and delays end, those values are restored in correct order and then finally retired to the belt.
+
There is actually double the amount of latency registers for each pipeline than you would need just to accomodate the produced values from the functional units. This is because while an operation is in flight over several cycles, a call or interrupt can happen, and the frame changes. The operations executing in the new frame need places to store their results, but the operations still running for the old frame do as well.
 
+
This saving of results if the control flow is interrupted or suspended in some way is called result replay. This is in contrast to execution replay on conventional machines, that throw away all transient state in that case, and then restart all, potentially very expensive, computations again from the beginning.  
+
 
+
=== Operands and Results ===
+
  
As explained above, each slot can retire at least as many results each cycle as it has result yielding operations of different latency in the pipeline. There can be more if there are multi result operations. But those all go on the belt in defined order.<br />
+
Within a call of course more calls can happen, and then the latency registers aren't enough anymore. And there is other frame specific state too. This is where the [[Spiller]] comes in, it monitors frame transitions, and saves and restores all the frame specific state that might still be needed safely and transparently in the background. And a big part of that is saving the latency register values that would be overwritten in the new frames, and then restoring them on return.
The number of input operands is far more restricted. Most slots only allow 2. In the reader phase there are a few slots with only one operand, and those usually are immediates.
+
  
This reduction of data paths is one of the major factors that contributes to its low power consumption.
+
This saving of results when the control flow is interrupted or suspended in some way is called result replay. This is in contrast to execution replay on conventional machines, that throw away all transient state in that case, and then restart all, potentially very expensive, computations again from the beginning.  
  
 
=== Interaction between Slots ===
 
=== Interaction between Slots ===

Revision as of 22:39, 11 August 2014

The functional units, organized into pipelines addressable by and referred to as slots from the decoder, are the workhorses, where the computing and data manipulation happens.

Functional

Overview

A slot is a cluster of functional units, like multipliers, adders, shifters, or floating point operations, but also for calls and branches and loads and stores. This grouping generally is along the phases of the operations the functional units implement and it also depends on which decoder issues the operations to them.

All the FUs in a slot pipeline also share the same 2 input channels from the belt, at least for the phases that have inputs from the belt.

The different operations grouped into one slot may have different latencies, i.e. they take a different amount of cycles to complete. A each FU in the slot still can be issued one new operation every cycle, because they are fully pipelined.

To catch all the results that are issued in different cycles, but may retire in the same cycle, there are latency registers for each latency possible in the pipeline for the values to be written into. They are dedicated to each slot so only that pipeline can write into them.

Operands and Results

This is to reduce the number of connecting circuits between all the functional units and registers. Those same latency registers are what on a higher level is visible as Belt locations. The trick here is that they are addressed differently for writing into them, for creating new values, and for reading them, when they serve as sources. As mentioned above, the registers can only be written by their dedicated slot, they are locally addressed. But they can serve as belt operand globally to all slots. In a similar way global addressing cycles through all the registers for reading, on a per slot basis the local addressing cycles through all the dedicated latency registers for writing. In contrast to the belt addressing this is completely hidden from the outside, it is purely machine internal state.

Result Replay

There is actually double the amount of latency registers for each pipeline than you would need just to accomodate the produced values from the functional units. This is because while an operation is in flight over several cycles, a call or interrupt can happen, and the frame changes. The operations executing in the new frame need places to store their results, but the operations still running for the old frame do as well.

Within a call of course more calls can happen, and then the latency registers aren't enough anymore. And there is other frame specific state too. This is where the Spiller comes in, it monitors frame transitions, and saves and restores all the frame specific state that might still be needed safely and transparently in the background. And a big part of that is saving the latency register values that would be overwritten in the new frames, and then restoring them on return.

This saving of results when the control flow is interrupted or suspended in some way is called result replay. This is in contrast to execution replay on conventional machines, that throw away all transient state in that case, and then restart all, potentially very expensive, computations again from the beginning.

Interaction between Slots

The main way slots interact with each other is by exchanging operands over the belt. The results of one operation onto the belt become the operands for the next.

For some types of neighboring slots though, they can pass along operands in the middle of the pipeline to each other. The primary use of this are operand gangs to overcome the severe input operand restrictions for some special case operations.
Those data paths also can be used for the saving of result replay values in case the own latency registers are full, which is a lot faster than going to the spiller right away.

Kinds of Slots

Each slot on a Mill processor is specified to have its own capabilites by having a specific set of functional units in them. This can go so far as every slot pipeline on each chip being unique, but usually this is not the case. The basic reader slots and load and store units tend to be quite uniform, as well as the call slots. The ALU are a little more diverse, depending on the expected workloads the processor is configured for, but still not wildly different. But the Mill architecture certainly is flexible enough to implement special purpose operations and functional units.

Consequently the different blocks in the instructions encode the operations for different kinds of slots. And if the Specification calls for it, every operation slot has its own unique operation encoding. All this is automatically tracked and generated by the Mill Synthesis software stack.