This is getting into internal subtleties as well as definitional ones. Your desired code
can be reached by pipelining by scheduling further down the pipe with no unroll. Or maybe that should be called “unrolling” too? I assume that the loop is something like
for (int i = 0; i < N, ++i) a[i]+= 5;
so there is a recurrence in the update to the control variable, and the latency of the load must be dealt with. Assuming three-cycle loads, there will be three iterations in flight from load, and an add iteration, before the first store has anything to do. The pipeline logic has a (previously computed) known number of bundles to work with, and does a normal schedule that wraps around at that number. If there are unused resources (like load slots in the bundles) when it places the last instruction of the hyperblock it can just continue modulo placement until it runs out. I suppose it would be fair to call this “unrolling”, although I think of it as more piping.
As for PGO in general: I don’t see much to gain, but full disclosure: PGO is not implemented and measurement often surprises me.