Reposted from comp.arch with permission, as likely of general interest:
On 6/28/2017 1:47 PM, Stephen Fuld wrote:
The discussion in another thread about branch prediction versus exit prediction got me thinking, and that led me to realize that I didn’t fully understand Mill exit prediction. So let me ask some questions as a means of increasing my understanding.
What exactly are you keeping? The address of the instruction that is predicted to cause the exit? The address of the EBB that is predicted to be executed next? The number of cycles until the exit? I can see advantages for each, and perhaps even multiple things.
The latter two; see below.
I presume the “index” for the predictor is the starting address of the EBB. Is that correct?
More likely the entry index into the SRAM prediction table.The entry address of the executing ebb is in a specReg (whimsically called “entryReg”). That plus the target offset from the prediction gives the entry address you will exit to. Rinse repeat.
If the EBB contains a (perhaps) conditional subroutine call, is that temporary exit from the EBB handled by the exit predictor? I see problems with both yes and no answers. Or is there some other mechanism to handle this case?
Yes, every transfer of control involves a prediction. Eventually there will be a return prediction which takes the target from a conventional return-address stack.
What about conditional branches within the EBB? Are they not predicted at all?
There is no such thing on a Mill. You can branch to the beginning of yourself (one-ebb loops are *very* common on the Mill), but you can’t branch into the middle of any ebb including yourself.
Thanks in advance for helping my understanding.
This is more fully explained in http://millcomputing.com/technology/docs/#prediction, but here’s an overview:
Each logical entry contains some prediction history state, the details depending on the prediction algorithm(s) used; those are not program-visible and may vary by member. Possible keys include the entry address of the ebb that is going to exit; some history of such addresses; some perceptron or TAGE logic derived from such addresses, and so on. In general the Mill can use the same algorithms as any other ISA, but there are fewer entries (one per ebb rather than one per branch) but each entry has more prediction state (as described below vs. one taken/untaken bit). However it is done, the prediction logic must produce a prediction-table entry index and from that a prediction.
Each prediction contains two critical pieces, plus assorted niceties. One is the target, conceptually the address of the next ebb to execute. In practice it is the offset from the entry of the executing ebb, because all Mill branch-like things are entry-relative. As a result we don’t have to have a 60-bit address for each. In effect entry offsets form a linked list in the prediction table, so the prefetch logic can run arbitrarily ahead of fetch and execution. Other than representation, the target is similar to what is in a BTB.
The other crucial part of a prediction is a duration: the number of cycles between when the ebb is entered and when it exits. Duration is statically known; Mill is a statically scheduled exposed pipeline machine. If the duration in the prediction is three (for example) then the fetch and decode logic knows to fetch and decode exactly three (very wide) instructions before starting to fetch from the next ebb in the target chain.
Some of the niceties: rather than the duration being in cycles from the beginning of the ebb, the duration is in cache lines plus instructions within the last cache line. This both tells the prefetcher how many lines to munge, and also reduces the number of bits needed for duration in for large ebbs. Duration can overflow any fixed bit assignment of course; we have not yet determined whether injecting a dummy branch or simply permitting a mispredict at the end of large ebbs is better. Current our working sims use a dedicated set of logN representations for the long duration in lines. This will always mispredict (5 cycles) and may be over-eager in fetching from DRAM, but so far it looks like over-eager is better than under-eager. Pending measurement.
Another nicety: Mill code can have several (possibly conditional) calls in a single instruction; taken calls are executed in slot order and each can see the results of its predecessors. Each unconditional or taken call will have its own prediction, so an instruction-count duration is ambiguous if there are more than one. Consequently the duration is not only cache-lines+instructions but cache-lines+instruction+taken-calls.
There is a pretty conventional return-predictor; a table entry for a return uses a dedicated target value to indicate a return.
There is a transfer-kind code in the entry. This is used both to distinguish returns and to let the transfer hardware get a head start on the side-effects of the transfer: calls create stack and spiller frames and returns unwind them, the leave op discards in-flights, etc. Mill is very CISC, in the sense that single ops do a lot, not in the VAX-like sense in which ops are composed out of subops that must be parsed and interpreted in the context of the rest of the instruction; unlike x86 there is at most one memory access per operation. There’s good and bad in that. On the one hand a call (say) is conceptually simple, encodes densely, and is free of special cases. On the other hand the ISA gives you exactly one way to make a call, and you can’t tailor your own call protocol.
A great many loops on a Mill are one or only a few instructions in one or a few ebbs. As in many architectures there is a loop buffer on a Mill, or rather two of them. One of them holds the hardware signals coming from the decoder; the decoder is a 3-cycle process so we buffer up to three signal sets, and a special prediction tells the decode-and-issue to feed from the signals buffers rather than decoding around the loop. This doesn’t speed anything up (we issue every non-stall cycle regardless), but it does save power. The second loop buffer is the i$0 microcache that holds some small (per member) number of cache lines fully associative. If the lines of a loop, whether one or several ebbs, fit in i$0 then we avoid banging on the main instruction cache, again saving power.