Mill Computing, Inc. Forums The Mill Architecture Control flow divergence and The Belt Reply To: Control flow divergence and The Belt

Ivan Godard
Post count: 689


For polyadic arguments the arg list uses the bits in both the bytes in con block and the morsels in ext block, interleaved by flow slot. The decoder has a compositing buffer holding all these bits for all slots. For a four bit morsel machine, each slot provides 44 bits (4 bytes + three morsels) so with 7 slots (typical of a mid-range Mill) the compositor has 308 bits. As the maximal arg list for such a member has 16 args of four bits, the maximal list needs 64 bits and the buffer can hold almost five such lists. More to the point, two slots provide 88 bits, whereas the maximal list needs 64, so any arg list will need at most two slots, one with the main op and one with a flowArgs op for the extras.

Actually, composition demand is greatest for con ops (literals). If our member has 128-bit (quad) operands then the 308 compositing bits can hold two quad literals, each literal using three slots with a slot left over for lagniappe.

In operation, the decoder selects the pair of two-bit fields for each slot right out of the flow block in the instruction buffer. That can be done without even knowing how many flow slots the instruction uses; the decoder will just select garbage bits beyond the end. The selected pairs go through a priority decode to set up muxes to select which bytes/morsels in the con/ext blocks belong to each slot. That decode/select takes a cycle, during which the con and ext blocks are being isolated by the instruction-level shifters. The following cycle the previously set up muxes route bits from the isolated con/ext blocks to the correct position in the compositing buffer, zero-filling as necessary. At this point we have the full 308 bits as one bit-array; any garbage bits are at the end of that array and are ignored because they belong to slots that the instruction doesn’t use.

The compositing buffer has a bunch of selector muxes to peel off particular fields that would be of interest to one or another op. Starting at each 44-bit slot boundary, there’s a selector that grabs a four-byte word that had come from the con block. Each of the three morsels that came from the ext block has its own selector. There a selector that starts at the slot and extends over the bits of adjacent slots that provides big con literals, and another that grabs big arg lists. Con’s are right-aligned on slot boundaries in the compositor, so leading zeroes happen; there’s a “complement” bit in the main opcode that causes the selected literal to be one’s complemented during selection so we can express negative literals (note: the encoding uses ones’ complement literals so we don’t need an adder in the decoder). Arg lists are left-aligned on slot boundaries in the compositor so garbage bits are at the end; we know from other arguments in the the operation how long the list is, so the decoder actually gives the FU a single maximal list (64 bits) plus a one-hot mask (16 bits) that says which list position is valid.

All the various selecting for all the slots takes place in parallel in the second decode cycle. During that same cycle we have decoded the main opcode (from the flow block that was isolated in the first cycle) enough to know which of the selected fields are meaningful for this particular op, so only the meaningful fields go to the FUs as operation arguments. Many of the FU arguments are belt numbers, either singly or as lists. Belt remapping is done by the flow and exu sides together, because both sides are evaluating or dropping operands concurrently. This takes place in the third decode cycle, a cycle ahead of use. That is, the remapper tracks what will be happening on the belt in the following cycle. Bulk rename (br/call/retn/rescue) arg lists are actually easier then simply belt eval/drop (add, LEA, etc) because they don’t happen until after still another instruction boundary. The tightest timing is for con, which drops a cycle ahead of everything else. To make that work at high clock the belt addresses the literal directly inside the compositing buffer without moving it someplace else. Of course, the compositor will be reused for a different instruction the following cycle, so the literal does get moved out too, but there’s a whole cycle for that.

Whew! Well, you wanted to know.