It is easy to find cases where OOO is better, and cases where static is better. The only fundamental resolution of your question is measurement of real programs, and we’re not far enough along for that.
To take your particular example: MEM++ will only have visible impact if MEM is in DRAM and MEM++ is executed frequently. That combination is something of an oxymoron: if it is accessed frequently it will be in cache, and if it isn’t then the memory stall will happen but its effect will be in the noise. So the case we have to worry about is when there are lots of different MEMs that are individually infrequently accessed but collectively matter. That’s a description of a “working set”; it is important that the working set fit in cache, or performance dies on any machine. The Mill pays a lot of attention to working set: the high entropy code, backless memory, streamers, whatnot.
Now if MEM is in fact coming from DRAM then OOO doesn’t buy much. Yes, it can execute ahead of the stall, but in practice in general-purpose code an OOO machine runs out of one resource or another quickly, and the DRAM stall happens anyway. (An exception is when there is a ton of independent unrelated ops available, as in BLAS benchmarks or any other code that SIMDs well – but that’s not your case and we do as well as OOO on that kind of code anyway).
So let’s get more realistic and say that both the OOO and the Mill have the working set in the (LLC) cache, and that MEM is unrelated to what is following after the return. Then yes, Mill deferred loads won’t help, because the function is doing nothing but the increment, while the OOO deferral of execution will help because the increment can be overlapped with the caller – so long as nothing happens to drain the pipe, like an interrupt or mispredict. Like I said, it’s easy to find bounded cases where OOO is better.
But customers don’t care about cases, they care about whole programs. You hypothesize that the MEM++ function cannot be inlined. Well, sure, but if you cared about your performance why would you prevent inlining? And there are system-level tradeoffs to consider: in this case OOO might save a 10-cycle LLC delay, but then the cost is a 15 cycle mispredict stall – evey time – instead of the Mill’s five cycle, and a heat/power budget that prevents adding another core or two that the Mill can have.
In some cases CPU engineering provides absolute guarantees that one approach is better than another: more instruction cache at the same power and latency is better, for example. But other matters are complex tradeoffs, for which the design answers often boil down to a seat-of-the-pants feeling that “real programs don’t do that”.
We can’t wait to publish our tool chain and sim so people can actually measure real programs. No doubt we will find that some of the things we thought were cool ideas turn out to not buy much, and we will find unsuspected bottlenecks. Measurement will tell. As we have long said as a sort of inside joke: “Only Mr. Sim knows!”