The first talk on decoding instructions explains how instructions are split into two streams (Exucode and Flowcode) with each half-instruction (bundle) composed of a header and three blocks, for a total of six blocks per instruction. The most recent (sixth) talk on execution explains how these blocks are decoded and executed in phases. Together, they create a more complete picture of the Mill instruction set.
In the exucode stream, the three blocks correlate directly to the functional pipelines. (Numbers correspond to the “Gold” member of the Mill family.)
Block 1x: 8 read operations (fill from scratchpad*)
Block 2x: 4 integer or binary floating point (including multiply) operations
and 4 integer only (no multiply) operations
Block 3x: 4 pick** operations
and 5 write operations (spill to scratchpad*)
* The only exucode side read/write operations that we currently know of are the fill and spill scratchpad ops.
** The pickPhase is describe as being between phase 2 and phase 3 but there are numerous reasons to include pick ops in the third block. For one, the pick logic resides in the crossbar that the writePhase ops use to select their inputs. Also, the maximum number of operations in one block is said to be nine.
The flowcode stream is quite different with only 8 dedicated pipelines compared to 25 for the exucode stream. The reason for this is that flowcode operations are much larger due to their in-line address and constant operands so the code size (and cache size) for both streams balances out. Another difference is that the pipelines include functional hardware for multiple phases:
4 pipes that can do either immediate constant, load/store or control transfer (branch, call) operations
4 pipes that can do either immediate constant or load/store operations
Suppose you have an instruction with 8 stores (writePhase), followed by an instruction with 8 loads (opPhase), followed by an instruction with 8 constants (readPhase). Because of the phasing, all 24 operations would hit the 8 pipelines during the last cycle. Would the the separate functional units handle it easily, or are there constraints on the mix of flowcode ops to balance out the phasing?