- JoshParticipantJuly 12, 2015 at 6:02 amPost count: 2
I’ve gone through the talks and it’s a beautiful architecture. There’s one thing I’m not sure of regarding deferred loads; maybe I’ve missed it from the talks.
As I see it, one of the strengths of OOO machines is that they can schedule dynamically across control flow boundaries. Mill relies on the compiler, and in many case it’s difficult to find enough work ahead of time to fill even the L1 load latency.
For example, take a leaf function that just does a *mem++, and for whichever reason can’t be inlined. That’s a load-add-store and there’s possibly nothing you could do to move the load away from its consumer. An OOO machine handles this just fine, but with deferred loads you’d stall for the full (cache) access.
As far as I can tell Mill doesn’t speculatively fire loads, you always have to at least cover for the L1 latency. Did I get something wrong?
- Ivan GodardKeymasterJuly 12, 2015 at 1:40 pmPost count: 629
It is easy to find cases where OOO is better, and cases where static is better. The only fundamental resolution of your question is measurement of real programs, and we’re not far enough along for that.
To take your particular example: MEM++ will only have visible impact if MEM is in DRAM and MEM++ is executed frequently. That combination is something of an oxymoron: if it is accessed frequently it will be in cache, and if it isn’t then the memory stall will happen but its effect will be in the noise. So the case we have to worry about is when there are lots of different MEMs that are individually infrequently accessed but collectively matter. That’s a description of a “working set”; it is important that the working set fit in cache, or performance dies on any machine. The Mill pays a lot of attention to working set: the high entropy code, backless memory, streamers, whatnot.
Now if MEM is in fact coming from DRAM then OOO doesn’t buy much. Yes, it can execute ahead of the stall, but in practice in general-purpose code an OOO machine runs out of one resource or another quickly, and the DRAM stall happens anyway. (An exception is when there is a ton of independent unrelated ops available, as in BLAS benchmarks or any other code that SIMDs well – but that’s not your case and we do as well as OOO on that kind of code anyway).
So let’s get more realistic and say that both the OOO and the Mill have the working set in the (LLC) cache, and that MEM is unrelated to what is following after the return. Then yes, Mill deferred loads won’t help, because the function is doing nothing but the increment, while the OOO deferral of execution will help because the increment can be overlapped with the caller – so long as nothing happens to drain the pipe, like an interrupt or mispredict. Like I said, it’s easy to find bounded cases where OOO is better.
But customers don’t care about cases, they care about whole programs. You hypothesize that the MEM++ function cannot be inlined. Well, sure, but if you cared about your performance why would you prevent inlining? And there are system-level tradeoffs to consider: in this case OOO might save a 10-cycle LLC delay, but then the cost is a 15 cycle mispredict stall – evey time – instead of the Mill’s five cycle, and a heat/power budget that prevents adding another core or two that the Mill can have.
In some cases CPU engineering provides absolute guarantees that one approach is better than another: more instruction cache at the same power and latency is better, for example. But other matters are complex tradeoffs, for which the design answers often boil down to a seat-of-the-pants feeling that “real programs don’t do that”.
We can’t wait to publish our tool chain and sim so people can actually measure real programs. No doubt we will find that some of the things we thought were cool ideas turn out to not buy much, and we will find unsuspected bottlenecks. Measurement will tell. As we have long said as a sort of inside joke: “Only Mr. Sim knows!”
- JoshParticipantJuly 12, 2015 at 7:59 pmPost count: 2
Thank you for your detailed analysis.
I had the belief that needing a loaded value early on a function would be a common case, and a 10 or even 3 cycle stall would be kind of a big deal on a machine as wide as the Mill.
But yeah, I don’t have data to really back it up.
This also seems to happen from the caller side, since you have to drop everything onto the belt in the right order before the call. In this case it could be mitigated (at some cost) if the hardware didn’t stall until the corresponding belt position is accessed by the callee.
This is all hand-waving at this point though.
- mermericoModeratorJuly 19, 2015 at 10:25 amPost count: 10
Streamers are mentioned in a couple places on the site, but I don’t remember seeing them in the talks. Have they been filed?
You must be logged in to reply to this topic.