Prediction
The main task of prediction is to know what code to prefetch in the instruction stream in the face of branches to avoid stalls. The Mill features precise prediction, that knows how much to prefetch from where, usually several branches ahread.
Exit Table
The data structure that enables those unique features is the exit table. It doesn't monitor whether to take each branch or not. I monitors at which branch control flow leaves an EBB. This of course drastically reduces the amount of branches that need to be tracked. And the aggressive Speculation possible on the Mill eliminates even more branches.
For this reason it becomes feasible to carry a lot more information for each exit, which wouldn't be possible for every branch. An exit entry holds:
- an offset to the target EBB entry address
- how many cache lines to load in your own EBB
- at which branch in the last cache line the exit is, this is not a real address, it serves more as an identifier to construct the hashes for the exit table, together with the cache line counts
- what kind of branch it is, a call, a return, an unconditional branch etc.
The exit table must be fast, but it doesn't need 100% coverage. A missing entry just means a stall, the same as a mispredict. It is a cost/benefit trade-off. For this reason the exit table is organized as a hash table of carefully chosen size.
Exit Chains
And this is how the prefetch gets ahead of the instruction stream: From any exit table entry, the address of the next exit can be constructed. This is used to get the next exit and so forth. Until the hardware either doesn't find an entry or comes accross one it already prefetched, which means there is a loop.
Prefetch and Fetch
The prefetcher pulls cache lines it thinks it might need, according to the exit table, into the I$1 and places the prediction entries into the fast and fully associative prediction cache. This cannot fault or fail. One level above that, the fetcher pulls cache lines it is certain to need into the I$0, which is the micro caches the decoders directly access, and it also gives the decoder the next prediction, the next predicted EBB entry point, so it knows where to go next.
Prediction
This might, of course, still turn out to be wrong in actual execution. There just was a mispredict. The delay from execution to decode is 3 cycles (the 3 phases, branches being in the writer phase), the delay from the decoder to the prediction cache (which was preseeded by the prefetcher), is 2 more cycles, for a 5 cycle penalty for a mispredict.
Although the prediction cache is kept big enough for most inner loops, it of course still might not contain the correct entry in exceedingly rare cases. The prefetch delay from exit table to prediction cache is another 3 cycles.
On a mispredict, the fetcher also updates the exit table with the new correct prediction. This can use any history based prediction algorithm, and probably is member dependend for the expected workloads. This means the exit table represents the history of program flow. It actively keeps track of it, and this history information also contributes to the lookup keys for the exit table.