The main task of prediction is to know what code to prefetch in the instruction stream in the face of branches to avoid stalls. The Mill features precise prediction, that knows how much to prefetch from where, usually several branches ahead.
The data structure that enables those unique features is the exit table. It doesn't monitor whether to take each branch or not. It monitors at which branch control flow leaves an EBB. This drastically reduces the amount of branches that need to be tracked. And the aggressive Speculation possible on the Mill eliminates even more branches.
For this reason it becomes feasible to carry a lot more information for each exit, which wouldn't be possible for every branch. An exit entry holds:
- an offset to the target EBB entry address
- how many cache lines to load in the user's EBB
- the instruction in the last cache line the branch is in, so the decoder knows how many instructions to issue
- what kind of branch it is: a call, a return, an unconditional branch etc.
The exit table must be fast, but it doesn't need 100% coverage. A missing entry just means a stall, the same as a mispredict. It is a cost/benefit trade-off. For this reason the exit table is organized as a hash table of carefully chosen size.
Not all branches even need to be in the table. Rarely used error code crowding out often used entries is not ideal. And since it is a hash table, newer entries can overwrite older entries on hash collisions. The hardware tracks hash collisions, and on lookup they are a normal miss.
This is how the prefetch gets ahead of the instruction stream: From any EBB's exit table entry, the address of the next exit can be constructed. This is used to construct the keys for the next probable exits and so forth. Which continues until the hardware either doesn't find an entry or comes across one it already prefetched, meaning there is a loop.
Each EBB can have multiple entries. The reason for this is calls. Calls return to the same place they are called from, so they are not really a diversion from the control flow even if they are conditional. They end up in the same place and are still part of the same EBB. But subroutines do represent an own EBB that needs to be prefetched.
Prefetch and Fetch
The prefetcher pulls cache lines it thinks it might need, according to the exit table, into the I$1 and places the prediction entries into the fast and fully associative prediction cache. This cannot fault or fail.
One level above that, the fetcher pulls cache lines it is certain to need into the I$0, which is the micro cache the decoders directly access. It also gives the decoder the next prediction, the next predicted EBB entry point, so it knows where to go next.
This might, of course, still turn out to be wrong in actual execution. There was just a mispredict. The delay from execution to decode is 3 cycles (the 3 phases, branches being in the writer phase). The delay from the decoder to the prediction cache (which was pre-seeded by the prefetcher), is 2 more cycles, for a 5 cycle penalty for a mispredict.
Although the prediction cache is kept big enough for most inner loops, it still might not contain the correct entry in exceedingly rare cases. The prefetch delay from exit table to prediction cache is another 3 cycles.
On a mispredict, the fetcher also updates the exit table with the new correct prediction. This can use any history based prediction algorithm, and probably is member dependent for the expected workloads. This means the exit table represents the history of program flow. It actively keeps track of it, and this history information also contributes to the lookup keys for the exit table.
The exit table is seeded from a saved exit table from a dedicated load module section. This predefined exit table is initially produced by the compiler, which can perfectly create all unconditional branches and often has a pretty good idea of conditional branches as well. This can be improved upon with profilers. And most conveniently, the hardware provides facilities to dump the hardware exit table to memory. The system runtime can utilize this to update the exit table section in the load modules with real world usage data. Properly automated, this can improve future program performance every time a user's program is run. And manually, in the hands of a skilled developer this can tickle out the last drop of performance.
The pre-seeding of the exit table happens in batches, to actually gain performance from prediction and not do memory lookups every time a user gets stuck. Whenever a call target isn't in the exit table, all exit table entry predictions connected to that call are loaded at once from the load module. That way a user doesn't inch through the code with a load on every branch. There is a prediction load and a code load on the first few calls (often constructors and main()), and after that it is up and running almost at normal speed with the run time history updates getting the last few percent.