This wouldn’t be a Mill architecture, but I wonder if there would be a problem with adding more phases and having different otherwise identical functional units do their work in different phases. For example, maybe read, pick1, op1, call, op2, pick2, write. I suppose that would induce limitations as to how quickly you could schedule instructions back to back, but it would make decoding easier for a given width. Were there other important benefits and limitations I’m not thinking of? Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?
With regard to calls, how many cycles do they normally take in overhead?
What happens if you have an FMA instruction but fail to also issue an Args? Or is it that the existence of an Args in an instruction turns the multiply operation into a FMA operation?
Am I right in guessing that which operations are connected to which in ganging is a result of which slot in the instruction the ops occupy?