I have a question about hoisted deferred loads and function boundaries.
Consider a relatively large function that requires multiple loads. I would expect at the start of the function to be a flurry of address calculations and deferred loads, as much as possible before getting on with the rest of its’ functionality in order to hide cache/dram latency as much as possible. I might even call it a ‘deferred load preamble’, not officially, but I could see it being a common enough pattern to recognize it.
So my first question: Does this scenario sound reasonable? Would you expect it to be that common?
Now lets extend it. Break up the function into three smaller functions. Lets assume it’s very simple and you can just group instructions together into their own functions, with outputs flowing to inputs etc. So instead of one big section at the beginning where all the loads are issued, each smaller function has its own ‘deferred load preamble’. This would mean that e.g. the last of the three was not able to defer its loads as far and may suffer more from memory latency issues.
Does this also sound reasonable? Is it just the compiler’s (|| specializer’s) responsibility to inline functions and hoist loads as much as possible or does mill hardware offer any mitigation to this issue? It’s not OOO, so I wouldn’t really expect it to “peek ahead” to see those loads, but then again the mill’s durability to speculation would really help such an implementation.