Mill Computing, Inc. › Forums › The Mill › Architecture › Control flow divergence and The Belt
- AuthorPosts
- #3314 |
I’m curious how an unbalanced fork/join control flow between hyperblocks would work with The Belt.
In the diagram above, EBB-A can jump to EBB-B (which then to EBB-C) or directly to EBB-C. If EBB-B produces results, it will interfere with the temporal addressing within EBB-C if it has been created assuming a direct path between EBB-A and EBB-C.
What is the general solution to this problem?
Thanks!
- This topic was modified 6 years, 2 months ago by hayesti.
The Mill has two forms of Branch-like operations: those with carried belt arguments and those without. Either can have additional belt arguments for a predicate and/or label pointer. Branches without carried arguments do not modify the belt; the target sees the same belt that the branch did. Branches with carried arguments replace the belt contents with the argument list, essentially the same way a call does.
In your example, the branch from EBB-A to EBB-B is likely to not have carried arguments; EBB-B will start with EBB-A’s belt as of the branch. In contrast, both branches to EBB-C will have carried arguments conveying the values needed by EBB-C. The inbound carried argument lists will have the same number of entries (possibly zero, up to the belt size) and the entries will have matching widths, so EBB-C will start with a congruent belt regardless of which path was taken.
In addition, when a branch is taken there may be a functional unit at work on a long-latency operation that has not yet retired. This is not a problem for branches without carried args; EBB-A can start an operation that will retire in EBB-B, because the belt in EBB-B is unambiguous both before and after the retire. In principle, the same could be done over branches that have carried args, but only if every source of an inbound branch arranged to have its long-latency operation retire at the same point in the target. Such a coincidence is extremely far-fetched and the specializer does not attempt to use it. Instead, it forces all operations (except loads) to have retired before any branch with carried args.
Loads are the exception, and the Mill has the special Pickup Load form to deal with loads that retire across arg-carrying branches. A pickup load operation carries a tag and does not retire normally. Instead, sometime later the code can issue a Pickup operation with the same tag and get the retire dropped to the belt. The Load and its Pickup can be in different ebbs; the only requirement is that each Load must eventually be picked up, and each pickup must have a previous load with the same tag.
Well, in the example above I was thinking that EBB-B could modify a variable that is produced in EBB-A and consumed in EBB-C. I wasn’t aware of the branches with arguments. Could you point me to an example of this in the ISA?
If we imagine EBB-A (and predecessors) producing many results and EBB-C (and successors) consuming said results, the branch from EBB-A to EBB-B may have to pass tens (possibly hundreds) of arguments in order to keep The Belt coherent. How can The Mill’s ISA encode more than three or four without running out of bits?
The ISA documentation on the site is woefully out of date and we have no volunteer to resume the work on it; please accept my apologies. Here’s a conAsm fragment (from the printf() function as it happens) that shows carrying branches in use:
eql(b4 %46, 108) %435 ^0, eql(b4 %46, 104) %434 ^0, brtrs(b0 %434, "printf$17_75", b5 %51) ^0, brfls(b1 %435, "printf$17_9", b8 %48/460, b10 %50/461) ^0, brs("printf$17_75", b2 %399) ^0, rd(w(128)) %399 ^263;
You see that the two branches to ebb $17_75 are carrying two different arguments, %51 for one and %399 for the other.
Ebbs don’t really “produce results”, except in the sense that they contain store operations that update memory. I think you mean more a notion of liveness or in-flight operands, those produced by an operation but that have not yet been consumed by all consumers. Live operands are initially created on the belt, and are consumed from the belt. If at any point there are more simultaneously live operands than fit in the fixed capacity of the belt (16 positions on a Silver, from which the above fragment was drawn) then the code must spill the excess, initially to scratchpad and ultimately, if scratch itself fills up, to memory. The specializer uses heuristics to decide which operands to spill. Spill and fill are a fact of life for any machine because there are always codes that exceed local resource constraints, although the compiler algorithms to do spilling are very different on a belt machine than on a general register machine.
Spilling is also done in a few other situations. For example, if an operand is created in one ebb and not consumed until much later after a long chain of branches then it is better code to spill it at point of production and fill it at point of consumption than to pass it along across many value-carrying branches.
Thanks for the extra information. I’m still a little uncertain of some details (apologies because I’m woefully ignorant about many facets of your ISA).
brtrs(b0 %434, “printf$17_75”, b5 %51) ^0,
brs(“printf$17_75”, b2 %399) ^0,For these forms of branch instructions, what happens to the last argument, i.e. the belt positions? It seems like the values of b5/b2 are moved/copied to b0 before the branch is retired. Is this correct?
If so, I’m still a little puzzled by the example in the original diagram. Let’s say that that EBB-A produces exactly 10 values and pushed onto The Belt positions 0-9 then later EBB-C consumes all of these values. If EBB-B produces exactly one value and pushes it onto The Belt at position 0, then the 10 values produced in EBB-A will subsequently be moved to positions 1-10. The question is: how does EBB-C know where to find the values it needs? Are they in positions 0-9 or 1-10? This seems very complicated without spatial addressing. Could you perhaps elaborate a little on how you would resolve this particular scenario?
Thanks for the extra information. I’m still a little uncertain of some details (apologies because I’m woefully ignorant about many facets of your ISA).
brtrs(b0 %434, "printf$17_75", b5 %51) ^0,
brs("printf$17_75", b2 %399) ^0,
For these forms of branch instructions, what happens to the last argument, i.e. the belt positions? It seems like the values of b5/b2 are moved/copied to b0 before the branch is retired. Is this correct?
If so, I’m still a little puzzled by the example in the original diagram. Let’s say that that EBB-A produces exactly 10 values and pushed onto The Belt positions 0-9 then later EBB-C consumes all of these values. If EBB-B produces exactly one value and pushes it onto The Belt at position 0, then the 10 values produced in EBB-A will subsequently be moved to positions 1-10. The question is: how does EBB-C know where to find the values it needs? Are they in positions 0-9 or 1-10? This seems very complicated without spatial addressing. Could you perhaps elaborate a little on how you would resolve this particular scenario?
- This reply was modified 6 years, 2 months ago by hayesti.
Thanks for the extra information. I’m still a little uncertain of some details (apologies because I’m woefully ignorant about many facets of your ISA).
For the form of branch instructions addressing printf$17_75, what happens to the last argument, i.e. the belt positions? It seems like the values of b5/b2 are moved/copied to b0 before the branch is retired. Is this correct?
If so, I’m still a little puzzled by the example in the original diagram. Let’s say that that EBB-A produces exactly 10 values and pushed onto The Belt positions 0-9 then later EBB-C consumes all of these values. If EBB-B produces exactly one value and pushes it onto The Belt at position 0, then the 10 values produced in EBB-A will subsequently be moved to positions 1-10. The question is: how does EBB-C know where to find the values it needs? Are they in positions 0-9 or 1-10? This seems very complicated without spatial addressing. Could you perhaps elaborate a little on how you would resolve this particular scenario?
Simplifying to two items.
Say A produces two items, which are at b2 and b5 when A->B happens and b4/b7 when A->C happens.A->B does not carry, so B inherits A’s belt. Hence at B those items are still at b2/b5. B adds an item, so the items are now at b3/b6 when B->C happens.
A->C carries, mapping b4 and b7. Let us say b4 maps to b0 and b7 to b1 (it could have been the other way around; source order is not significant).
B-C carries, mapping b3 to b0 and b6 to b1.
C receives items in b0 and b1 regardless of which path was taken. In this case the items themselves are the save regardless of path, although B could have passed different items instead of just relaying what it got from A.
A carrying branch acts like half a call, except there’s no new frame and no return.
Thanks for the extra information. I’m still a little uncertain of some details (apologies because I’m woefully ignorant about many facets of your ISA).
brtrs(b0 %434, "printf$17_75", b5 %51) ^0,
brs(“printf$17_75”, b2 %399) ^0,
For these forms of branch instructions, what happens to the last argument, i.e. the belt positions? It seems like the values of b5/b2 are moved/copied to b0 before the branch is retired. Is this correct?
If so, I’m still a little puzzled by the example in the original diagram. Let’s say that that EBB-A produces exactly 10 values and pushed onto The Belt positions 0-9 then later EBB-C consumes all of these values. If EBB-B produces exactly one value and pushes it onto The Belt at position 0, then the 10 values produced in EBB-A will subsequently be moved to positions 1-10. The question is: how does EBB-C know where to find the values it needs? Are they in positions 0-9 or 1-10? This seems very complicated without spatial addressing. Could you perhaps elaborate a little on how you would resolve this particular scenario?
- This reply was modified 6 years, 2 months ago by hayesti.
Thanks for the extra information. I’m still a little uncertain of some details (apologies because I’m woefully ignorant about many facets of your ISA).
brtrs(b0 %434, "printf$17_75", b5 %51) ^0, brs("printf$17_75", b2 %399) ^0,
For these forms of branch instructions, what happens to the last argument, i.e. the belt positions? It seems like the values of b5/b2 are moved/copied to b0 before the branch is retired. Is this correct?
If so, I’m still a little puzzled by the example in the original diagram. Let’s say that that EBB-A produces exactly 10 values and pushed onto The Belt positions 0-9 then later EBB-C consumes all of these values. If EBB-B produces exactly one value and pushes it onto The Belt at position 0, then the 10 values produced in EBB-A will subsequently be moved to positions 1-10. The question is: how does EBB-C know where to find the values it needs? Are they in positions 0-9 or 1-10? This seems very complicated without spatial addressing. Could you perhaps elaborate a little on how you would resolve this particular scenario?
The passing branches are the same regardless of the number passed, just if there are more arguments the list is longer: brs(“foo”, b4, b6) with two, brs(“foo”, b4, b6, b2, b2, b3, b8, b0, b9, b3, b11) with ten.
All inbound arcs must carry the same number of operands, so if B passes 10 to C then A must also. Consequently C sees the same belt structure regardless of how or from where it received control.
Perhaps you are thinking that C will only need so many from B and fewer from A, and can tell the difference based on the values of the common arguments. Such a situation would be analogous to a VARARGS function call. This notion does not exist in common languages and is not directly supported by the Mill, although in principle the specializer could supply dummy arguments for A->C to fill out the branch.
What would brs(“foo”, b4, b6, b2, b2, b3, b8, b0, b9, b3, b11) look like in binary form though? If I understood your video correctly, one Mill VLIW instruction contains three variable-sized blocks, however, inside each block every operation is fixed length. How can the BRS operation be fixed size if it need encode anywhere from zero to N belt positions as parameters?
Yes, each side has three variable-sized blocks, and within each block the element sizes are statically fixed (though not necessarily the same). In the exu side each block has a different op set, with the sets partitioned by phase, and each operation is complete in its set.
The flow side is different: only the flowBlock has operation elements like on the exu side, while in the con block the elements are bytes and in the ext block the elements are morsels big enough to hold a belt position number. Those flow ops that take an argument list (besides the branches there are the calls and various special ops including rescue, retire, and con) encode as much as they can in the flow block entry and put the rest of the list in the con and ext blocks. Each flow block entry has two bits to indicate how much has been put in the other blocks: 0/1/2/4 bytes and 0/1/2/3 morsels.
If the list is long enough that it doesn’t fit in an entry plus 4 bytes and 3 morsels then the op is ganged with a flowArgs pseudo op in the next slot, which provides another 4 bytes and three morsels. Rinse repeat. The maximal sizes of the blocks (for each family member) are set so that it is possible to encode a belt list with entries for every position, or a con with enough bits to express the largest possible operand. The constraints of the ext and con blocks are recognized by the specializer, so that (for example) it may encode two calls with few arguments in a single instruction, but only one call with a long argument list even though there are two slots supporting call.
I think I understand better now. One op can address a maximum of three belt operands (via the morsels) and this can be chained indefinitely using the special flowArgs op. So in the case of 10 belt operands being passed to EBB-C, the encoding would require one branch op w/ three morsels followed by three flowArg ops w/ three, three and one morsels?
Could you provide an intuition what the logic of the decoder looks like? I presume the branch pipeline’s decoder would require logic to buffer N operands (where N is the length of The Belt)? Also, how does this affect branch latency? A single-cycle branch op with N operands would have to rename N belt entries in parallel, correct? Perhaps you anticipate doing this over several cycles, however, I imagine this might hamper ILP since it would be complicated to produce values onto The Belt while simultaneously renaming said entries?
Any further insights would be greatly appreciated.
Close.
For polyadic arguments the arg list uses the bits in both the bytes in con block and the morsels in ext block, interleaved by flow slot. The decoder has a compositing buffer holding all these bits for all slots. For a four bit morsel machine, each slot provides 44 bits (4 bytes + three morsels) so with 7 slots (typical of a mid-range Mill) the compositor has 308 bits. As the maximal arg list for such a member has 16 args of four bits, the maximal list needs 64 bits and the buffer can hold almost five such lists. More to the point, two slots provide 88 bits, whereas the maximal list needs 64, so any arg list will need at most two slots, one with the main op and one with a flowArgs op for the extras.
Actually, composition demand is greatest for con ops (literals). If our member has 128-bit (quad) operands then the 308 compositing bits can hold two quad literals, each literal using three slots with a slot left over for lagniappe.
In operation, the decoder selects the pair of two-bit fields for each slot right out of the flow block in the instruction buffer. That can be done without even knowing how many flow slots the instruction uses; the decoder will just select garbage bits beyond the end. The selected pairs go through a priority decode to set up muxes to select which bytes/morsels in the con/ext blocks belong to each slot. That decode/select takes a cycle, during which the con and ext blocks are being isolated by the instruction-level shifters. The following cycle the previously set up muxes route bits from the isolated con/ext blocks to the correct position in the compositing buffer, zero-filling as necessary. At this point we have the full 308 bits as one bit-array; any garbage bits are at the end of that array and are ignored because they belong to slots that the instruction doesn’t use.
The compositing buffer has a bunch of selector muxes to peel off particular fields that would be of interest to one or another op. Starting at each 44-bit slot boundary, there’s a selector that grabs a four-byte word that had come from the con block. Each of the three morsels that came from the ext block has its own selector. There a selector that starts at the slot and extends over the bits of adjacent slots that provides big con literals, and another that grabs big arg lists. Con’s are right-aligned on slot boundaries in the compositor, so leading zeroes happen; there’s a “complement” bit in the main opcode that causes the selected literal to be one’s complemented during selection so we can express negative literals (note: the encoding uses ones’ complement literals so we don’t need an adder in the decoder). Arg lists are left-aligned on slot boundaries in the compositor so garbage bits are at the end; we know from other arguments in the the operation how long the list is, so the decoder actually gives the FU a single maximal list (64 bits) plus a one-hot mask (16 bits) that says which list position is valid.
All the various selecting for all the slots takes place in parallel in the second decode cycle. During that same cycle we have decoded the main opcode (from the flow block that was isolated in the first cycle) enough to know which of the selected fields are meaningful for this particular op, so only the meaningful fields go to the FUs as operation arguments. Many of the FU arguments are belt numbers, either singly or as lists. Belt remapping is done by the flow and exu sides together, because both sides are evaluating or dropping operands concurrently. This takes place in the third decode cycle, a cycle ahead of use. That is, the remapper tracks what will be happening on the belt in the following cycle. Bulk rename (br/call/retn/rescue) arg lists are actually easier then simply belt eval/drop (add, LEA, etc) because they don’t happen until after still another instruction boundary. The tightest timing is for con, which drops a cycle ahead of everything else. To make that work at high clock the belt addresses the literal directly inside the compositing buffer without moving it someplace else. Of course, the compositor will be reused for a different instruction the following cycle, so the literal does get moved out too, but there’s a whole cycle for that.
Whew! Well, you wanted to know.
Belt remapping is done by the flow and exu sides together, because both sides are evaluating or dropping operands concurrently. This takes place in the third decode cycle, a cycle ahead of use. That is, the remapper tracks what will be happening on the belt in the following cycle. Bulk rename (br/call/retn/rescue) arg lists are actually easier then simply belt eval/drop (add, LEA, etc) because they don’t happen until after still another instruction boundary.
Could you expand on this a little more? Let’s take your your example implementation with sixteen Belt positions. If a passing branch is issued with all sixteen Belt positions, what would the logic look like to rename said positions within a single cycle?
If I understood your video correctly, The Belt is implemented as a disperse set of registers each with an associated tag indicating which Belt position it currently represents. To address a particular Belt position, it’s first necessary to perform an associative lookup to determine which physical register is mapped to the specified logical Belt position. It follows that the microarchitecture must be capable of performing sixteen associative lookups in parallel. It would then be necessary to update all sixteen tags in parallel with ascending identifiers in order to re-arrange The Belt before the branch is completed. Is this correct?
Is the idea to do these sixteen lookup/updates in parallel within one cycle? That sounds like a fairly complex crossbar with difficult timing constraints. Have you modelled this in RTL? Have you considered how this would scale beyond sixteen registers?
Thanks in advance.
- This reply was modified 6 years, 2 months ago by hayesti.
There are several possible implementations for the remapper. I’m not a hardware guy, but I’m told that there’s no one-size-fits-all; some approaches work better for short belts but don’t scale to very long belts, which use a different approach, and where the implementation cutover(s) are is also process and clock dependent.
One representative approach is to attach a physical number to each operand when created using a simple take-a-number. Each consumer knows what physical # it wants, and broadcasts that to all the sources; the source with an operand with matching physnum responds with the operand. This is a straightforward MxN crossbar, and comprises a large part of the area and power on the larger members. A conventional genreg machine has a similar crossbar running from all registers to all FUs. I’m told that there are well known ways to use hardware cleverness to reduce the cost of such crossbars, but they are not well known to me.
The crossbar works in physnums; it is the job of the remapper to convert from the belt position numbers (beltnums) to physnums. The mapper (for an example 16-position belt) is a circular array of 4-bit adders, where each adder holds the physnum corresponding to the beltnum index in the array. Translation is just indexing into the array based off the current origin. Advancing the belt is just bumping the origin with wraparound. Dropping new physnums is taking to operation drop count and the statically defined drop order to get a zero-origin sequence of dropping physnums, which is biased by the origin and used to replace the content of the affected adders. Bulk remap (including branch) just builds a new zero-origin drop list by using the beltnums from the op to index the adders (this is just like indexing for arguments of any op) and then dropping the whole thing normally. There is lots of time in the cycle to do a 4-bit index for 4 bit result and a couple of 4-bit adds in time for the physnums to be available for broadcast.
An alternative useful for the larger members does not use beltnum to physnum to broadcast but instead maps directly from beltnum to residence number. Each possible location that can hold an operand gets a resnum; there are a lot of locations on the bigger members so resnums are 6 or 7 bits. The mapper works the same way, but the adders respond to the index step with a resnum, not a physnum, and the drops are resnums not physnums too. The resnums directly address the datapaths feeding the crossbar, so there’s no broadcast although the crossbar itself remains. However, when an operand is moved from one residence location to another the mapper adder content must be updated with the new resnum; this is in effect a drop. In the broadcast approach the physnum of an operand never changes so operands can be moved without updating the mapper.
Or so I’m told. It’s all transparent to the compiler level that I work at.
- AuthorPosts
You must be logged in to reply to this topic.