Difference between revisions of "Prediction"

From Mill Computing Wiki
Jump to: navigation, search
Line 10:Line 10:
  
 
* an offset to the target <abbr title="Extended Basic Block">EBB</abbr> entry address
 
* an offset to the target <abbr title="Extended Basic Block">EBB</abbr> entry address
* how many cache lines to load from there
+
* how many cache lines to load in your own EBB
* at which point in the last cache line the next exit is
+
* at which branch in the last cache line the exit is, this is not a real address, it serves more as an identifier to construct the hashes for the exit table, together with the cache line counts
 
* what kind of branch it is, a call, a return, an unconditional branch etc.
 
* what kind of branch it is, a call, a return, an unconditional branch etc.
  
Line 18:Line 18:
 
=== Exit Chains ===
 
=== Exit Chains ===
  
And this is how the prefetch gets ahead of the instruction stream: From any exit table entry, the address of the next exit can be constructed. This is used to get the next exit and so forth. Until the hardware either doesn't find and entry or comes accross one it already prefetched, which means there is a loop.
+
And this is how the prefetch gets ahead of the instruction stream: From any exit table entry, the address of the next exit can be constructed. This is used to get the next exit and so forth. Until the hardware either doesn't find an entry or comes accross one it already prefetched, which means there is a loop.
  
== Prefetch ==
+
== Prefetch and Fetch ==
 +
 
 +
The prefetcher pulls cache lines it thinks it might need, according to the exit table, into the <abbr title="Level 1 Instruction Cache">I$1</abbr> and places the prediction entries into the fast and fully associative prediction cache. This cannot fault or fail. One level above that, the fetcher pulls cache lines it is certain to need into the <abbr title="Level 0 Instruction Cache">I$0</abbr>, which is the micro caches the decoders directly access, and it also gives the decoder the next prediction, the next predicted <abbr title="Extended Basic Block">EBB</abbr> entry point, so it knows where to go next.<br />
  
 
== Prediction ==
 
== Prediction ==
 +
 +
This might, of course, still turn out to be wrong in actual execution. There just was a mispredict. The delay from execution to decode is 3 cycles (the 3 phases, branches being in the writer phase), the delay from the decoder to the prediction cache (which was preseeded by the prefetcher), is 2 more cycles, for a 5 cycle penalty for a mispredict.
 +
 +
Although the prediction cache is kept big enough for most inner loops, it of course still might not contain the correct entry in exceedingly rare cases. The prefetch delay from exit table to prediction cache is another 3 cycles.
 +
 +
On a mispredict, the fetcher also updates the exit table with the new correct prediction. This can use any history based prediction algorithm, and probably is member dependend for the expected workloads. This means the exit table represents the history of program flow. It actively keeps track of it, and this history information also contributes to the lookup keys for the exit table.
 +
 +
=== Cold Code ===
 +
 +
 +
  
 
== Media ==
 
== Media ==
 
[http://www.youtube.com/watch?v=e9Tb2ZB3nN4 Presentation on Prediction by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-11-12_prediction.02.pptx Slides]
 
[http://www.youtube.com/watch?v=e9Tb2ZB3nN4 Presentation on Prediction by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-11-12_prediction.02.pptx Slides]

Revision as of 16:41, 5 August 2014

The main task of prediction is to know what code to prefetch in the instruction stream in the face of branches to avoid stalls. The Mill features precise prediction, that knows how much to prefetch from where, usually several branches ahread.

Prefetch Overview

Exit Table

The data structure that enables those unique features is the exit table. It doesn't monitor whether to take each branch or not. I monitors at which branch control flow leaves an EBB. This of course drastically reduces the amount of branches that need to be tracked. And the aggressive Speculation possible on the Mill eliminates even more branches.
For this reason it becomes feasible to carry a lot more information for each exit, which wouldn't be possible for every branch. An exit entry holds:

  • an offset to the target EBB entry address
  • how many cache lines to load in your own EBB
  • at which branch in the last cache line the exit is, this is not a real address, it serves more as an identifier to construct the hashes for the exit table, together with the cache line counts
  • what kind of branch it is, a call, a return, an unconditional branch etc.

The exit table must be fast, but it doesn't need 100% coverage. A missing entry just means a stall, the same as a mispredict. It is a cost/benefit trade-off. For this reason the exit table is organized as a hash table of carefully chosen size.

Exit Chains

And this is how the prefetch gets ahead of the instruction stream: From any exit table entry, the address of the next exit can be constructed. This is used to get the next exit and so forth. Until the hardware either doesn't find an entry or comes accross one it already prefetched, which means there is a loop.

Prefetch and Fetch

The prefetcher pulls cache lines it thinks it might need, according to the exit table, into the I$1 and places the prediction entries into the fast and fully associative prediction cache. This cannot fault or fail. One level above that, the fetcher pulls cache lines it is certain to need into the I$0, which is the micro caches the decoders directly access, and it also gives the decoder the next prediction, the next predicted EBB entry point, so it knows where to go next.

Prediction

This might, of course, still turn out to be wrong in actual execution. There just was a mispredict. The delay from execution to decode is 3 cycles (the 3 phases, branches being in the writer phase), the delay from the decoder to the prediction cache (which was preseeded by the prefetcher), is 2 more cycles, for a 5 cycle penalty for a mispredict.

Although the prediction cache is kept big enough for most inner loops, it of course still might not contain the correct entry in exceedingly rare cases. The prefetch delay from exit table to prediction cache is another 3 cycles.

On a mispredict, the fetcher also updates the exit table with the new correct prediction. This can use any history based prediction algorithm, and probably is member dependend for the expected workloads. This means the exit table represents the history of program flow. It actively keeps track of it, and this history information also contributes to the lookup keys for the exit table.

Cold Code

Media

Presentation on Prediction by Ivan Godard - Slides