The execution timing is the same so long as decoded ops are available, regardless of whether phasing is used or not: each execution pipe is fully pipelined and so can start an op each cycle, and the dataflow constrains the earliest that an op can be started. If you could decode it and had enough pipes, an OOO x86 would pipe dataflows into the pipes the same way that a Mill does. The barrel is no different.
The gain to phasing occurs at the boundary conditions, where the sequence of instruction decodes detaches from the sequence of issue and execution. This occurs at control flow points. For example, when you call a functions, unless you are OOO you must complete all issues in the caller, transfer, decode in the callee, and refill the pipelines in the callee. The same occurs at return, branches, and, importantly) at mispredicts.
Phasing is “really” a way to do short-range OOO without the OOO hardware. If the code is: 3-cycle dataflow -> branch -> 3 cycle dataflow -> branch -> 3 cycle dataflow, then an OOO machine will get it all done in three issue cycles and five cycles overall – and so will the Mill. A strict in-order will take 11 cycles, or nine if it has a predictor.
So phasing is just poor-man’s OOO, taking advantage of the fact that real code contains a lot of short independent dataflows that, if you have the MIMD width, you can overlap.
Of course, once you have phasing then other possibilities are enabled: multi-call, First Winner Rule, 0-cycle pick, …