Prediction
The main task of prediction is to know what code to prefetch in the instruction stream in the face of branches to avoid stalls. The Mill features precise prediction, that knows how much to prefetch from where, usually several branches ahread.
Exit Table
The data structure that enables those unique features is the exit table. It doesn't monitor whether to take each branch or not. I monitors at which branch control flow leaves an EBB. This of course drastically reduces the amount of branches that need to be tracked. And the aggressive Speculation possible on the Mill eliminates even more branches.
For this reason it becomes feasible to carry a lot more information for each exit, which wouldn't be possible for every branch. An exit entry holds:
- an offset to the target EBB entry address
- how many cache lines to load in your own EBB
- at which branch in the last cache line the exit is, this is not a real address, it serves more as an identifier to construct the hashes for the exit table, together with the cache line counts
- what kind of branch it is, a call, a return, an unconditional branch etc.
The exit table must be fast, but it doesn't need 100% coverage. A missing entry just means a stall, the same as a mispredict. It is a cost/benefit trade-off. For this reason the exit table is organized as a hash table of carefully chosen size.
Not all branches even need to be in the table. You don't really want rarely used error code in there crowding out often used entries. And since it is a hash table, newer entries can overwrite older entries on hash collisions. The hardware tracks hash collisions, and on lookup they are a normal miss.
Exit Chains
And this is how the prefetch gets ahead of the instruction stream: From any EBB's exit table entry, the address of the next exit can be constructed. This is used to construct the keys for the next probable exits and so forth. Until the hardware either doesn't find an entry or comes accross one it already prefetched, which means there is a loop.
Prefetch and Fetch
The prefetcher pulls cache lines it thinks it might need, according to the exit table, into the I$1 and places the prediction entries into the fast and fully associative prediction cache. This cannot fault or fail.
One level above that, the fetcher pulls cache lines it is certain to need into the I$0, which is the micro caches the decoders directly access, and it also gives the decoder the next prediction, the next predicted EBB entry point, so it knows where to go next.
Prediction
This might, of course, still turn out to be wrong in actual execution. There just was a mispredict. The delay from execution to decode is 3 cycles (the 3 phases, branches being in the writer phase), the delay from the decoder to the prediction cache (which was preseeded by the prefetcher), is 2 more cycles, for a 5 cycle penalty for a mispredict.
Although the prediction cache is kept big enough for most inner loops, it of course still might not contain the correct entry in exceedingly rare cases. The prefetch delay from exit table to prediction cache is another 3 cycles.
On a mispredict, the fetcher also updates the exit table with the new correct prediction. This can use any history based prediction algorithm, and probably is member dependend for the expected workloads. This means the exit table represents the history of program flow. It actively keeps track of it, and this history information also contributes to the lookup keys for the exit table.
Cold Code
The exit table is seeded from a saved exit table from a dedicated load module section. This predefined exit table is intially produced by the compiler, which can perfectly create all unconditional brances and often has a pretty good idea of conditional branches as well. This can be improved upon with profilers. And most conveniently, the hardware provides facilities to dump the hardware exit table to memory. The system runtime can utilize this to update the exit table section in the load modules with real world usage data. Properly automated this can improve future program performance every time you run your program. And manually, in the hands of a skilled developer this can tickle out the last drop of performance.
The preseeding of the exit table happens in batches, to actually gain performance from prediction and not do memory lookups every time you get stuck. Whenever a call target isn't in the exit table, all exit table entry predicictions connected to that call are loaded at once from the load module. That way you don't inch through the code with a load on every branch. You have a prediciton load and a code load on the first few calls (often constructors and main()), and after that you are up and running almost at normal speed with the actual runtime updates getting you the last few percent.