This wouldn’t be a Mill architecture, but I wonder if there would be a problem with adding more phases and having different otherwise identical functional units do their work in different phases. For example, maybe read, pick1, op1, call, op2, pick2, write. I suppose that would induce limitations as to how quickly you could schedule instructions back to back, but it would make decoding easier for a given width. Were there other important benefits and limitations I’m not thinking of? Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?
The phase structure is largely dictated by the encoding and the decode process. It takes three cycles to fully decode the extremely wide Mill instructions, but ops in some blocks are completed in each of those cycles and we issue them right way as soon as available; that gives us three phases. The pick op has no cycle, and the call op uses the cycle of the called function, which gives us five phases in all.
We could do as you suggest and have added phases with dedicated FUs. However, the ops for those phases would already have been decoded and would just be waiting for their phase, so why not issue them at once? We could of course go to a five-block encoding, which would take four cycles to decode, requiring another phase. However, three seems to be all that’s useful.
Assignment of ops to blocks (and hence phase) is ad hoc but guided by hardware constraints. For example, it is not possible to get the belt bypass set up in parallel with decoding the first (reader) block, so ops in reader phases cannot have belt arguments. Consequently the reader phase can only have pure-source ops or administration ops that do not take a belt argument. Reader phase also cannot have an op taking a large constant (such as the stackf frame allocation op) because the constant would wreck the entropy of the encoding of the other reader ops which are quite small.