Beyond object method dispatch, lambda passing style would be another source of unstable transfer points. Custom lambdas passed into central utilities are often per-use closures. On the upside, the actual function that will be called is hoisted all the way up to an input parameter, which again would be an ideal for software injecting “upcoming|next EBB” hints into the prefetcher.
For context, how many cycles ahead of an exit would the exit table need to be notified in order to prefetch code from cache to execute without any stalls? I suspect this would vary between family members. If it’s under 10, I would imagine software hints could eliminate stalls for many truly unstable dispatched exits.
I agree that DRAM latency isn’t worth considering in these optimization scenarios. However, if the 5-cycle mispredict penalties are a concern, the fact remains that the absolute correct target for fully dynamic dispatch should be available in the software view far enough ahead of the call in enough situations to be beneficial to the hardware mechanism. The problem is communicating it from software into hardware.
The Mill has software-controlled prefetching of data via speculation, but not software-controlled prefetching of code (that we’ve seen). If the hardware predictor consistently fails for a dispatched usage case, there’s no other way to update or augment its functionality.
Having a compiler decide between whether to generate preemptive dispatch hints vs letting the predictor attempt to compensate would probably best be left to runtime log feedback, and might not be used by all compilers. But not having that option at all seems to me to be missing functionality that manifests in penalizing dynamic code.
(Obviously, hopping through multiple layers of dynamic indirection quickly would likely cause stalls no matter the prefetch mechanism, but most dynamic designs boil down to just one indirection per function call.)