Answering particular parts of your question where I can:
> Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?
Well, definitely informed by code analysis. There’s an upcoming talk on Configuration, and its been described in the talks and in the Hackaday interview.
A central tenet of the Mill is that people building products can prototype custom processor specifications in the simulator very quickly and choose a configuration with the best power performance tradeoff informed by benchmarking representative code.
> With regard to calls, how many cycles do they normally take in overhead?
One. Calls and branches (that are correctly predicted) transfer already in the next cycle. In the case of mis-predictions where the destination is in the instruction cache, again the Mill is unusually fast; it has a penalty of just five or so cycles.
Additionally, there is is none of the conventional pre- and post-ambles.