Pipeline

From Mill Computing Wiki
Revision as of 21:29, 8 August 2014 by Jan (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The functional units, organized into pipelines addressable by and referred to as slots from the decoder, are the workhorses, where the computing and data manipulation happens.

Barber Shop Metaphor

Like with all metaphors, a lot of detail is muddled or left out for the sake of a more general understanding, but here we go:

A slot can be compared to a barber shop. There is a number of barbers and hairdressers in the barber shop, each one specialized for a specific task. These are your functional units, the adders, shifters, multipliers etc.
Each cycle a new customer, or operation request, can come in. Only customers with a statically scheduled appointment are allowed. So as soon the customer arrives it gets referred to the right barber in front of the right mirror, depending on what kind of work needs to be done.
If it's just a buzz cut (like integer adds) the client is out as soon as the next one comes in.
Perms (multiplies) take longer. So while the corn row hairdresser is doing her thing, more customers with more buzz cuts or shaves arrive and leave and customers that also want perms can take their seat and will get treated. And when the first perm is done, the customer leaves together with all the others non-perm customers that came in in the mean time and happen (or rather, are scheduled) to be done at the same moment.

The weird thing about this barber shop is, because it only takes appointments, there is no waiting for new arrivals. But sometimes one of the bars across the street (the call functional units) takes over the whole street (the Belt) to throw a party, and none of the customers can leave. So that's why there is a wait queue for leaving the barber shop (the latency registers). They even sent people from the party over to shave their heads and stuff like that. But as long as the party goes on, none can leave that don't belong to the party. This can get to the point where customers are sent to wait in the queues of the neighboring barber shops, if they have rrom.
And if it really gets out of hand and more bars start throwing parties they have to take a room in the pension in the backyard (the Spiller) until it all quiets down. When things are normal again, all in the queues can leave at once, the others come back from the pension into the queues and leave then too.

Technical Overview

Now for real. A slot is a cluster of functional units, like multipiers, adders, shifters, or floating poitn operations, but also for calls and branches and loads and stores. This grouping generally is along the phases of the operations the functional units implement. The operations grouped into one slot still may have different latencies, and take a different amount of cycles to complete, but this is possible because the FUs are fully pipelined and can be issued an operation every cycle.

Result Replay

Because of the different latencies of the different operations in one slot, any time there are delays, from frame changes in interrupts or calls, or just from scheduled delays, the results of those operations must be saved until they can be dropped on the Belt. This is what the latency registers are for. There is one latency register dedicated to the results of each possible latency of the operations within a slot. This way when an operation of each latency finishes in the same cycle, all the resulst have a place to go. Those latency registers also catch the results of consecutive cycles if they cannot be dropped on the belt yet and form a kind of stack, shifting the results along. If too many results stack up eventually the Spiller saves the oldest values, and when interrupts and subroutines return and delays end, those values are restored in correct order and then finally retired to the belt.

This saving of results if the control flow is interrupted or suspended in some way is called result replay. This is in contrast to execution replay on conventional machines, that throw away all transient state in that case, and then restart all, potentially very expensive, computations again from the beginning.

Operands and Results

As explained above, each slot can retire at least as many results each cycle as it has result yielding operations of different latency in the pipeline. There can be more if there are multi result operations. But those all go on the belt in defined order.
The number of input operands is far more restricted. Most slots only allow 2. In the reader phase there are a few slots with only one operand, and those usually are immediates.

This reduction of data paths is one of the major factors that contributes to its low power consumption.

Interaction between Slots

The main way slots interact with each other is by exchanging operands over the belt. The results of one operation onto the belt become the operands for the next.

For some types of neighboring slots though, they can pass along operands in the middle of the pipeline to each other. The primary use of this are operand gangs to overcome the severe input operand restrictions for some special case operations.
Those data paths also can be used for the saving of result replay values in case the own latency registers are full, which is a lot faster than going to the spiller right away.

Kinds of Slots

Each slot on a Mill processor is specified to have its own capabilites by having a specific set of functional units in them. This can go so far as every slot pipeline on each chip being unique, but usually this is not the case. The basic reader slots and load and store units tend to be quite uniform, as well as the call slots. The ALU are a little more diverse, depending on the expected workloads the processor is configured for, but still not wildly different. But the Mill architecture certainly is flexible enough to implement special purpose operations and functional units.

Consequently the different blocks in the instructions encode the operations for different kinds of slots. And if the Specification calls for it, every operation slot has its own unique operation encoding. All this is automatically tracked and generated by the Mill Synthesis software stack.