– 1) Exit prediction with dynamic addresses.
What’s the current story for exit prediction with dynamic function pointers, vtables and such? Obviously, if you have a loop iterating over an array of virtual objects, the predicted address for a single call instruction isn’t going to be too helpful. But it’s probably suboptimal to throw up your arms and say “it can’t be predicted”, since sometimes these addresses will be available tens of cycles ahead of time.
I think you mentioned potentially using deferred branches, but eventually gave up on the idea. If nothing else, deferred branches wouldn’t help you with the “array of objects” cases, since it could presumably only predict the first one.
Ideally, you would want to have predictions available for each vtable in the array, so you can call the next method as soon as you returned from the last one. Maybe you could have a prediction queue of some sort? Or an alternate encoding scheme for when the Exit Table predicts that a given exit will always be a dynamic branch?
– 2) Hardware prefetching
How does the Mill handle data-side hardware prefetching? Traditional CPUs mostly rely on detecting stride patterns, eg “If you asked for addresses 12037, 12047 and 12057, you’re probably going to ask for 12067”. Do you expect Mill cores to do stride detection?
Deferred load can help hide the latency of an L1 miss, but obviously don’t help with an L3 miss. But there are some common patterns (eg tree traversal, arrays of pointers) where stride detection and deferred loads don’t help at all, but the hardware would still have enough information to prefetch lots of data ahead of time. For instance, a foreach loop iterating depth-first over a binary tree might want to take the left branch immediately and prefetch the right branch, thus skipping half the L3 misses. Does the Mill provide anything to facilitate this?
– 3) Inner-loop specific optimizations
The “inner” instruction seems like it would open a lot of potential for optimizations. Have you explored loop-specific optimizations?
I’m thinking about things like exact exit prediction (if you’re looping from “i = 0” to “i < array.size”, you know exactly how many iterations the loop will run), smart prefetching (in a loop with no branches, you know exactly what loads the next iterations will run, so you can prefetch them ahead of time), etc.
I know that a lot of software prefetching is wasted because it prefetches past the end of a loop and things like that, but the hardware would have enough info to be precise.