Streamers NYF 🙂
Deferral is limited by the encoding: how many cycles can you encode. The deferral is in a morsel and is biased by the minimum, so a Copper can defer 11 cycles and a Silver 20. This is enough to hide L1 and usually L2 in open code.
The number of retire stations is another limit that sometimes bites in open code but is especially a problem in piped loops. For compute loops a rule of thumb is two loads and a store per iteration to size the FU counts. We try to saturate to a piped one-bundle loop, so with a latency of 3 we have six in flight, which helps size the retire station count. And yes, that gives a miss any time we need to go to the L2 or further. However, this is mitigated by the cache hardware that holds whole cache lines. If the loop is stepping through arrays (typical) then only the first miss of a line will take a hit: later loads to the same line will hit in the L1. So for typical line sizes you are missing on fewer than 10% of the loads regardless of where the line is.
Of course, that doesn’t help if the loop has no locality of reference: a hash table perhaps. But piping helps there too, because the pipe has several loads in flight. If the first misses, the later ones can miss too (miss under miss) without additional delay. The more loads the code can issue before one needs one to retire, the better. But that is bounded by the number of stations. Member configuration tuning is an art 🙂
As for compilers, we don’t try to do much of anything beyond piping and hoisting. Mill can do anything that can be done in other ISAs, such as prefetching in software or hardware, but we have no experience yet that suggests that we should. The big win was the architecture that split issue from retire in loads and made the load latency visible.