Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. Thus for example unrolling loops prevents software pipelining and forces spill/fill of transients that otherwise would live only on the belt for their lifetime; shut unrolling off for better performance.
Maybe this is a vocabulary problem, but there’s definitely a kind of unrolling that the Mill benefits from; we could call it “horizontal” unrolling as opposed to “vertical” unrolling which would be the traditional version.
But if you have code that looks like this:
If your Mill is wide enough, you will definitely benefit from turning it into:
CON(…), CON(…), CON(…),
ADD(…), LOAD(…), ADD(…), LOAD(…), ADD(…), LOAD(…),
STORE(…), STORE(…), STORE(…);
I don’t know exactly how you’d call that operation, but to me it’s “unrolling” of a sort. And it definitely hits that “efficiency / code size” tradeoff that PGO helps optimize.
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. […] It is also unclear how much function and block reordering will actually pay;
That doesn’t sound right. As long as your codegen is based on heuristics, you’ll benefit from PGO somewhere.
To give a Mill-specific example, there’s
load delays. Your current heuristic is “as long as you can get away with”, but a PGO analysis might say “actually, this load almost always hits L1, so we can give it a delay of 3 to start this following load earlier / to better pack instructions”.