Diagrams are hard 🙂 I’ll try an explanation.
The phase of an operation (or the phase of a sub-action within an operation that is multi-phase, like store) is determined by explicit specification. The phases are reader, op, call, pick, and writer, and are evaluated/executed in that order; there are a couple more in the sim code for ease of implementations, but these don’t really exist in the hardware. There may be multiple instances of callPhase, one for each call in the instruction.
All ops must be decoded before they can be executed, and on the Mill decode takes more than one cycle, and some blocks in the instruction complete decode before others do. Consequently, it is convenient to place ops that must execute in an earlier phase so that they are in blocks that are decoded earlier too. Consequently readerPhase ops are placed in the block that decodes first, while writerPhase ops, while they could go in any block, are in practice placed in a block that decodes last because they are not in a hurry to execute.
The assignment of ops to blocks is also impacted by encoding format issues; slots in any block should contain only ops with similar bitwise format, for encoding entropy reasons and simpler decode. Thus the con operation, whose format is completely unlike the other readerPhase ops, is off in the same block with call (callPhase) and conform (writerPhase).
Now, as to the call op. CallPhase occurs after opPhase (when we do things like add). The physical belt changes only at cycle boundaries. To accommodate the intra-cycle logical changes of belt content, the physical belt is actually twice as long as the logical belt in the program model. If the code in the current frame C is F(a+b), where a and b are on the belt, then the add happens in the opPhase cycle, and the result is physically available at the end of that cycle. The result will be identified as C:b0, assuming nothing else drops, and the former C:b0 will become C:b1, C:b1 becomes C:b2, and so on.
Meanwhile, the hardware makes a copy of the result of the add and identifies it as F:b0, i.e. the first belt position in the called function. There are no other operands identified as F:bX, so the rest of the belt of the callee is empty. All this happens essentially at the cycle boundary, so that the first instruction of the callee sees the arguments and nothing else. In effect, callPhase is really that first callee instruction cycle. Because we go directly from the caller cycle containing the call to the callee cycle, the call in effect takes no time at all.
ReaderPhase is likewise free. After an opPhase op has been decoded, it takes a cycle to set up the crossbar to get the op its input data, so an add (for example) cannot execute immediately after decode, but instead there is an “issue” stage between the second decode and the execute stage. However, readerPhase ops do not take a belt argument, and so they don’t need a setup stage for the crossbar. So we can execute then in the same cycle in which the opPhase ops are getting ready for issue. So the timing is:
readerPhase execute/opPhase issue
callPhase/first called instruction
writerPhase (which is also opPhase of the next instruction and readerPhase of the one after that)