Mill Computing, Inc. › Forums › The Mill › Architecture › Single register
Tagged: belt length, markets, morsel
- AuthorPosts
- #2163 |
int endIndex = 300;
int index = 0;
while (index < endIndex) {
//significant loop code here
}Am I right in thinking endIndex will be bounced to the scratchpad and back to the belt each time through the loop?
Would it make sense to have a single register to hold a loop condition value in order to prevent such bouncing?
In this code fragment “endIndex” is a literal and does not need to be spilled; the comparison will test against the literal directly. However, if “endIndex” is a computed value then it must be available for each iteration. In that case, “endIndex” will not be run out to scratchpad unless the EBB that forms the loop body is so long that it would fall off the end anyway if the EBB were straight line code rather than a loop. That is, the fact that it’s a loop doesn’t matter to the spilling.
Looping doesn’t matter because belt values carried over a control flow join (such as the bottom of the loop) are kept in the belt; the inbound branches carry the values as arguments and the join makes the belt congruent regardless of where control flow came from; this is a hardware version of the “phi” operation used in Static Single Assignment code representation. Irrelevant belt positions are discarded and “endIndex” will start the next iteration at or near the front of the belt.
When live values (including loop-carried vales, as here) drop off the belt before passing through a join, they are spilled to scratchpad and then filled for the next use. In the hardware, the technology used for the scratchpad is in fact the same as is used for a large register file; the only difference is how you get at the bits, and that save-restore for function calls is automatic. Consequently there is no power or latency improvement in using a register over using scratchpad, and adding a register for loops would needlessly complicate the compiler and access logic.
For those cases where live values would normaly drop off the belt, is there a keep operation to copy it to the front of the belt, or would endIndex +=0; compile to do that?
I’m thinking of say a renderer where you test ray intersections on 100,000s of polygons which lets say for arguments sake is running on a mid-end mill, and can’t quite keep a live value on the belt between iterations through the loop.
Belt contents are immutable; the belt is strictly SSA. However, although you cannot move a value on the belt, you can create a new value that is the same as an existing value, and your example of adding zero would do that. OR with zero would likely consume marginally less power than add with zero, especially if the operand you want to save is a vector.
So there’s a better way: the rescue op. This does the same belt normalization that control flow operations do, except without actually transferring to a new address. Rescue comes in two forms. One specifies the belt positions to move to the front using an argument list. This form can arbitrarily reorder the rescued operands, and can also duplicate operands; in this is acts exactly like (and uses the same hardware as) the call and retn operations. However, the position-list argument uses one morsel per position of encoding space.
The other form of rescue specifies the desired operands using a bitmap of the positions, which are moved to the front in numeric order and dups are not possible. The underlying implementation and effect is exactly the same, but the bitmap is a more compact encoding if you rescue more than two operands.
Because rescue is done entirely in the name-mapping logic it has zero latency, just like the equivalent functionality in branches and calls. It also does not require a trip through an ALU so it uses less power than e.g. adding zero, especially as the name-mapping is updated every cycle anyway. And although the belt effect is the same as that of a branch to the locally following instruction, there is not actual control flow and so the transfer-prediction hardware (with its power overhead and risk of prediction or icache miss) doesn’t get involved.
- AnonymousOctober 15, 2016 at 5:38 pmPost count: 7
How does this interact with decoding?
Way back in lecture #1, Ivan went on at some length about the importance of simple decoding for high IPC. The Mill doesn’t waste space like fixed-width VLIW, but each of two instruction streams has decoding pipelined so each pipeline stage presents the stream to an array of decoders. The unused slots are later nullified,
(Using the standard Mill nomenclature, “instructions” are executed one per cycle and are made up of a number of parallel “operations” which are similar to instructions on (super)scalar CPUs.)
The important thing is that operations are fixed size. It’s only their variable number that makes the number of instruction bits decoded this stage variable.
However, I keep hearing about these variadic call/retn/rescue/inner/leave operations, and now compressed versions of those based on bitmaps. Doesn’t that break fixed-size operations?
One thing I can think of is that if this is the only operation decoded in this stream this phase, then a variable-sized operand list would work. If there is no following operation to be decoded this phase, then it’s okay/ Basically, the operation gets presented to an array of operand decoders, and the unused ones are nullified.
But lecture 6 (@28:00) clearly says that one instructions can contain multiple call operations.
How can this work?
Each instruction (really a half-instruction) on each side comprised some number of blocks, each of which carries a variable amount of fixed-size info. On the Mill there are three blocks per side, but the encoding mechanism can have any number (but an odd number is more sensible). On each size, one block decodes in the initial cycle, just as you describe; thereafter two blocks can decode per cycle, because we decode the array of blocks from both ends; this is why an odd number of blocks makes most sense.
On the exu side (where the usual two-in-N out ops reside) the info in all three blocks is treated as operations. Consequently, we get one block’s worth of ops at the end of D0, and two at the end of D1. D2 is used to match the logical belt numbers from the ops to the physical data location, and the crossbar routing happens during the transition between D2 and X0 (aka D3). One-cycle ops (ALU typically) execute in X0, and their result is available to other ops in the following cycle. Thus the basic pipeline has four stages, three decode and one execute.
The flow side is slightly more complicated. Only the initial (D0) block contains operations; the remaining two blocks contain arguments to those operations. All flow ops contain two two-bit fields that specify how many arguments that op takes from the other two blocks. The flow decoder can pick out those fields at the start of D0, run them through what amounts to a priority encoder, and have control lines at the start of D1 to route the arguments from their block to the intended flow operation.
Of the two arg blocks, one is a simple vector of morsels. Each morsel is big enough to hold a belt operand number, so (depending on member) morsels are 3/4/5/6 bits long, and the two-bit field selects 0/1/2/3 of them. The morsel arguments are used for belt references, small literals and other incidental data. The other flow-side arg block is a byte array. The two-bit field in the flow op selects 0/1/2/4 bytes from this array.
Consequently each flow-side op has a main opcode (from the initial block), up to three argument morsels, and up to 4 bytes of immediate data. The opcode is available at the end of D0, the args at the end of D1, and D2 is used to map logical belt numbers to physical, just like the exu side. In fact the two half-instructions become one during D2.
Back to your question: where is the bitmap? In the literal argument of course. And it only uses as many bytes are actually needed. And there can be two or more ops using the bitfields, because the arg bytes that hold the bitfields are not in the block that has the opcodes and the two-bit fields.
So the only part of Mill encoding that is actually variable length is the block, and each block’s content is fixed-length. We do have a linear parse at the block level, but not at the operation level. It’s as if we needed to parse two consecutive x86 instructions, and had two cycles to do it in, which is easy.
BTW, some flow ops can need more arg data than the 4 byte+3 morsel that comes with the slot – a call with a long arg list for example, or a con(stant) op with a quad (128-bit) argument. For these we gang additional slots to the main slot, each adding four bytes plus three morsels to the available data arguments. By ganging you lose a potential operation in that instruction (the gang slot has a unique opcode), but you can have all the data you want up to the maximum that the block shifters handle. The shifters can handle the maximal single operation/argument combination, so all possible operations can be encoded, if necessary in one-op instructions. The specializer knows the instruction capacity, and won’t put more ops in (with their arguments) than will fit in an instruction.
So if you have three slots that support the call op, a Mill instruction can contain three calls: each has the code branch offset in its immediate, and has morsels for up to three belt arguments to the call. If one call takes more arguments then that will reduce the number of calls you can have in the instruction as the big call grabs slots for its additional arguments. In practice slot overflow for call is uncommon; most functions have three or fewer arguments. The big consumer of flow-side argument encoding is floating-point literals.
Each morsel is big enough to hold a belt operand number, so (depending on member) morsels are 3/4/5/6 bits long
A six-bit morsel suggests a Mill with a belt length of 64! (Thus wider than the gold model.) Is such a design in the works? What’s the anticipated market, high-performance/scientific computing?
It does, doesn’t it?
- AuthorPosts
You must be logged in to reply to this topic.