- staffKeymasterDecember 29, 2013 at 8:34 pmPost count: 46
Talk by Ivan Godard – 2013-11-12 at
IEEE CS Santa Clara Valley
Slides: PowerPoint (.pptx)
Run-ahead transfer prediction in the Mill CPU architecture
Programs frequently execute only a handful of operations between transfers of control: branches, calls, and returns. Yet modern wide-issue VLIW and superscalar CPUs can issue similar handfuls of operations every cycle, so the hardware must be able to change to a new point of execution each cycle if performance is not to suffer from stalls. Changing the point of execution requires determining the new execution address, fetching instructions at that address from the memory hierarchy, decoding the instructions, and issuing them—steps that can take tens of cycles on modern out-of-order machines. Without hardware help, a machine could take 20 cycles to transfer, for just one cycle of actual work.
The branch predictor hardware in conventional out-of-order processors does help a lot. It attempts to predict the taken vs. untaken state of conditional branches based on historical behavior of the same branch in earlier executions. Modern predictors achieve 95% accuracy, and large instruction-decode windows can hide top-level cache latency. Together these effects are sufficient for programs like benchmarks that are regular and small. However, on real-world problems today’s CPUs can spend a third or more of their cycles stalled for instructions.
The Mill uses a novel prediction mechanism to avoid these problems; it predicts transfers rather than branches. It can do so for all code, including code that has not yet ever been executed, running well ahead of execution so as to mask all cache latency and most memory latency. It needs no area- and power-hungry instruction window, using instead a very short decode pipeline and direct in-order issue and execution. It can use all present and future prediction algorithms, with the same accuracy as any other processor. On those occasions in which prediction is in error, the mispredict penalty is four cycles, a quarter that of superscalar designs. As a result, code stall is a rarity on a Mill, even on large programs with irregular control flow.
The talk describes the prediction mechanism of the Mill and compares it with the conventional approach.
- SymmetryParticipantFebruary 1, 2014 at 8:27 amPost count: 28
I was re-watching the prediction video this morning. It seems like there are some serious advantages to the EBB chaining prediction approach, but it also seems like one drawback of it is that you completely lose the ability to make predictions about secondary or tertiary exits from an EBB, such as you might find in a loop within a loop. It seems like this makes the mill better at branchy or unstructred code at the expense of making it worse at the sort of code DSPs would typically excel at. Am I missing something?
- Ivan GodardKeymasterFebruary 1, 2014 at 2:18 pmPost count: 506
Unfortunately, to fit in time and mind-boggle limits the talks have to gloss over quite a few important details. 🙁
It’s been mentioned that Mill EBBs can contain embedded calls (not permitted in the usual definition of EBB). That means that an EBB must have multiple predictions if it contains calls: one from entry to the first taken call (calls can be conditional), one from the return of that call to the next taken call, and so on until one from the last taken call to the taken return (returns can also be conditional). Each of these predictions needs a key for lookup in the prediction table. The first key is obviously the address of the EBB entry point; what should the rest be?
It’s tempting to use the “from” address of the taken call (where the call will return to) as the key for the next prediction. However that doesn’t work. To begin with, there are two “from” addresses due to split-stream encoding, and the exu address may be ambiguous due to no-op elision, but both these could be worked around. Unworkable is that the return address is not itself predicted from the information in the prediction: the pred contains a line count and an instruction count, but not a byte count. Without a byte count, we could not know from one prediction what the key of the next would be (if the key is the byte address), which means that we cannot do prediction chaining with the code not yet in memory, which is the goal of the whole exercise.
Could we use a byte count in the preds instead of line/cycle counts? No, because the decoder needs to stage lines into the instruction shifter on a cycle-by-cycle basis. We don’t know how bytes translate into cycles without looking at the instruction headers, and by the time we have figured out from the remaining-byte count that we have done the last instruction it’s too late to get the next line into the shifter. And adding a byte count as well as line/inst counts would blow up the pred entry too much.
The solution is to use the entry address for the initial key, and for each later key to decrement the entry address by two. The initial key is known from the prediction that enters the EBB, and each subsequent key is known just by arithmetic on the entry key; consequently chaining is possible even with embedded calls (including multiple calls within a single instruction, which is possible on a Mill – more on that in the Execution talk on Wednesday).
I mentioned above that there are conditional calls, which need keys. If (say) we enter the EBB at 0x12345 and the first call is conditional and predicted taken then the next key (for use after the return) will be 0x12343, two less than the entry. If however the first call is predicted not taken and we expect to execute until the next call, then the return from that (and the next key to use in the chain) is 0x12341, i.e. the entry decremented by two, twice. When chaining, how do we know which next key to use? The answer comes from realizing that we need to decrement the key not just for taken calls but for untaken ones as well. To do that, each pred also contains a small field for untakenCallCount, the number of untaken calls stepped over by the prediction. With that, we can decrement the right number of times and get the correct next key for chaining.
There’s a potential ambiguity in using decremented entry addresses as subsequent keys: what if the decremented address was decremented enough to move the key pseudo-address into the adjacent EBB’s code and collide with that EBB’s entry address that it uses for its own prediction? Can’t happen: a decrement is moving in the flow-code direction from the entry point (increment would move in the exu direction). The call operation is a flow-side operation, and all possible call encodings occupy at least two bytes. Consequently, a two-byte decrement will necessarily still lie in the flow side of the EBB because the call that requires another key must occupy enough bytes to keep the pseudo-key within the memory range of the EBB. Increment would not work, because EBBs with no exu side are possible and enough increments could move the key into ambiguity.
Lastly: why decrement by two; why not by one? Answer: each prediction (the “main” pred) may have an alternate pred that has the same key minus one. If we mispredict, we probe for the alternate key. There may not be an alternate; by definition alternates are used less than main predictions, and the alternate may never have been taken (and hence added to the table) or may have aged out, but if the transfer is at all loosey-goosey then there is a decent chance of an alternate. We chain normally from the alternate, and get prefetch etc started on that chain.
Alternates deal with the case in which a program usually stays on the main line, but sometimes goes off into something complicated. Once we know it’s been rerouted, we want to be able to get prediction running ahead of it as quickly as possible, without additional misses. We already had a miss; without the alternate prediction, there would be another miss as soon as we got to the end of wherever the first miss took us. With the alternate, we avoid that second miss, and in addition have gotten prefetch and fetch working early.
Obviously this alternate mechanism applies to loops, and in fact it was first developed as support for exactly the problem you identify. Since then we have added more loop support (NYF), but have kept the alternate machinery because it is valuable even outside the loop context.
Now you see why I didn’t try to include the details in the prediction talk 🙂
Speaking of which, anybody want to volunteer to write a white paper on the full prediction mechanism? You would get some glory (name on paper, will get published on our site and in Computer Architecture News) and some sweat-equity in the Mill/OOTBC.
- Will_EdwardsModeratorFebruary 1, 2014 at 2:24 pmPost count: 98
The Mill has some mitigations, perhaps? It has an extremely small mispredict penalty (5 cycles) if the taken path is in the instruction cache. It can execute up to 6 dependent instructions in a single cycle. It also makes classic VLIW definitions of Very seem exaggerated 😉
What is predicted is very novel to the Mill, but the how is normal. There are predictors that try to predict the number of iterations and so on; these are implementation choices and models may differ.
- bakingParticipantFebruary 16, 2014 at 5:03 pmPost count: 7
This is obviously one of those details that had to be glossed over in the talk, but the line and instruction counts in the exit table can only refer to one stream. Are there two pairs of counts, one for each stream? Or have you worked out some trick to get by with just one.
- SymmetryParticipantMarch 26, 2014 at 3:41 pmPost count: 28
I seem to recall that you mentioned you loaded new predictions on a function call in the talk, and that seemed rather frequent to me. Now that the security talk is out, are they actually loaded on portal calls?
- Ivan GodardKeymasterMarch 26, 2014 at 4:29 pmPost count: 506
Predictions load only if we have no prediction for the call, and the loading is in parallel with the miss processing, so it will necessarily result in faster execution than simply missing and building up experience without loading.
The prediction table holds the target address of a prediction as an offset from the entry address of the caller EBB (the same addressing that branch/call uses). The offsets are within a single load module, which has bounded size much smaller than the total address space; that way we need only 20-30 bits per address in the table, rather than 60 bits for an arbitrary address.
In contrast, portals may be set up anywhere in the global address space, and so would need more bits in the prediction. We expect portal calls to be rare compared to ordinary calls, so we economize on table space and wires and don’t try to predict portal calls.
- PeterHParticipantMarch 27, 2014 at 6:54 pmPost count: 38
So while a function call, aided by prediction, will usually take a single cycle, a portal call will need at least as long as 2 fetches from L1 code cache, one load for the portal itself and a second for the destination code?
- Ivan GodardKeymasterMarch 27, 2014 at 7:20 pmPost count: 506
Yes. The load of the info block (in case of a callback) overlaps with the target code fetch, so there are only two fetches on the latency path: the portal itself, and the code miss. If you work through the detailed timing, this is the same as is needed for a trampoline (thunk) calling a dynamically linked (and local in the same protection environment as the caller) function that did not use a portal.
Consequently we expect that portals with wildcard ids (which use the caller’s ids) will be used for local dynamic linking as well as (with explicit ids) for actual protection transit; there’s no need for two mechanisms.
It is unfortunate that the predictor cannot predict far transits, but permitting it to do so would nearly double the size of the prediction tables for the same capacity. Measurement may show that far transits are common enough to justify the extra bits, in which case we will change the prediction formats and make portal calls be predictable, but our best guess for now says that far transits are rare enough compared to near transits that it is better to save the bits.
- castenParticipantMay 10, 2014 at 10:03 pmPost count: 4
It was mentioned in thew video that when there is a cold load of a program there is no history to use for prediction, so branch operations just guess. Might it be possible to prime the history with an annotation in source code? I suppose this could be an innate field in a branch instruction, or a separate instruction to to inform the predictor similar to __builtin_expect. The difference from __builtin_expect being, the value would only inform a cold predictor. It seems like this might be a large benefit for initial program performance; i.e., one of the larger factors in users’ subjective performance perception.
- Ivan GodardKeymasterMay 11, 2014 at 5:01 amPost count: 506
The Mill doesn’t put “expects” in the code, because the entropy cost of the field in the encoding then would have to be paid on every use of such instructions. Instead “expects” go in the loadable prediction table in the load module. The effect as far as initial prediction quality is the same, but the predictor table is promptly overwritten by actual experience history.
Also, “expects” in the code only can supply taken/not taken information, which arrives much too late to be useful in guiding decode. Instead the Mill predictor carries “where to next” information so that it can run-ahead of fetch and thereby avoid code fetch stalls.
Will is right that anything the program can do to help guide the process is useful and will be incorporated in the program for better prediction. The difference is in how that added information is represented, and where.
- castenParticipantMay 20, 2014 at 12:27 amPost count: 4
I didn’t mean to explicitly ask about __builtin_expect. I only meant to reference it as something with a similar function. Articles on the net say programmers are more likely to mispredict frequencies unless they use the proper tools in specific circumstances.
My intent was to suggest using a similar annotation or instruction to inform a cold predictor. Being in a base state, a programmer can probably have a reasonable idea about the likely code path. And then once the predictor is warmed up, the annotation should be ignored.
Of course support for __builtin_expect is nice. But I think that something like __builtin_cold_expect would be generally more practical. Once things are primed, I wouldn’t want to second guess predictors that do 90+%.
As far as entropy cost, it is understable that an additional instruction might be too costly. What about an extra, optional or unused bit on a branch instruction?
Another related question. When speculating on a branch, might there any added benefit to knowing the probability within a granularity greater than 50% accuracy? It seems like if the decision were not simple and a weight could be applied, e.g. one branch had a more expensive subsequent instruction than the alternative, this too might be useful. But I’m probably speculating too much myself. I’ll stop this line of questioning as I give it a probability of .05 of being potentially useful.
- This reply was modified 4 years, 3 months ago by casten.
- Ivan GodardKeymasterMay 20, 2014 at 12:53 amPost count: 506
Rather than manual annotation, why not just run the program a few times and let it train?
Modifying the binary, such as by adding bits to the branch ops, hits the problem that the decoders cannot know they have a branch until after they have already committed to taking or not taking the transfer. This is a consequence of the pipelining of the decode logic; we have to know to issue the line fetch at least a cache-latency ahead of getting the line, and it’s a couple of cycles more until decode has figured out that a transfer is to be executed. There are two possible resolutions of this: predictive fetch, or retarded execution. We use the former, and have generalized predictive fetch to achieve arbitrary run-ahead. Retarded execution is used by OOO machines, by delaying operations that depend on control flow until the control flow is resolved, using the OOO hardware to keep the execute units busy once the pipe has started up.
The two approaches can approximate each other in the steady state, but predictive fetch has much less startup cost than retarded, so we have a five cycle mispredict penalty rather than the 15 or so typical of retarded OOO. Predictive does require more table state than retarded for equivalent accuracy, but the Mill avoids that issue through our trainable prediction, which in effect gives us arbitrarily large tables in fixed and small hardware.
There may be a third approach to deal with pipelined decoder timing beyond predictive and retarded, but I don’t know of one.
- DavidParticipantOctober 16, 2014 at 2:04 pmPost count: 32
Something that was not addressed in this talk is virtual and indirect calls/jumps. Predicting these has significant effect not just with dynamic languages but as low-level as C++.
But often the type varies for a given call site. Unlike branch prediction, it’s not limited to a binary decision between two potential addresses. Opportunity to prefetch the next EBB can come from hoisting the load of the dispatch address, which can go pretty far in functions which perform processing themselves before the dispatch call. However, the software would need to be able to give this address to the exit chaining mechanism in order to gain value from this prefetch.
There are also concepts of history tables linked to object identity instead of call site address. With these, multiple call sites could benefit from a single knowledge update, but I don’t think they’re as appropriate for general purpose CPUs as they’re usually specific to object system ABIs.
- Ivan GodardKeymasterOctober 16, 2014 at 6:08 pmPost count: 506
When the prediction facility was being designed, we decided that any unstable transfer point such as method calls (what Cliff Click calls “megamorphic”) would be of interest only if control passed through it frequently enough to keep the various target EBBs in the L2. Consequently we did not concern ourselves with issues of DRAM latency, and were primarily concerned with the miss-predict penalty (yes, the Mill has a penalty that is a third that of other architectures, but a stall is still a stall) and the L1/L2 latencies.
Next, we observed that the codes we looked at, and in our personal experience, megamorphic transfers tended to not be calls on leaf function. Instead, a megamorphic transfer usually led to a sequence of transfers that were not megamorphic, until control returned back through the megamorphic point. This reflects a program model in which a test point figures out the type of a datum, and then does stuff with the known type. Mill run-ahead prediction works well with this model, because once through the megamorph the subsequent control flow will predict well and will be prefetched ahead of need. Incidentally, the original source may not have been written to this model, but the compiler’s value-set analysis and inlining will tend to remove the type-checks internally to the known-type cluster, and even if it does not, the type will in fact be fixed and the transfers not be megamorphic.
The alternative model works less well, where the same object is subjected to a succession of leaf-function calls. Each of the calls will be megamorphic if the object is (and so will miss-predict frequently), and the leaf functions don’t have long enough transfer chains for run-ahead to win much.
This suggests an obvious optimization for the compiler: transform the latter structure into the former. This dimensionality inversion or transpose is the control space analogy to the re-tiling optimizations that are used to improve array traversal in the data space, and for all I know some compilers already do it. It should benefit any prediction mechanism, not just ours.
- DavidParticipantOctober 18, 2014 at 12:59 amPost count: 32
Beyond object method dispatch, lambda passing style would be another source of unstable transfer points. Custom lambdas passed into central utilities are often per-use closures. On the upside, the actual function that will be called is hoisted all the way up to an input parameter, which again would be an ideal for software injecting “upcoming|next EBB” hints into the prefetcher.
For context, how many cycles ahead of an exit would the exit table need to be notified in order to prefetch code from cache to execute without any stalls? I suspect this would vary between family members. If it’s under 10, I would imagine software hints could eliminate stalls for many truly unstable dispatched exits.
I agree that DRAM latency isn’t worth considering in these optimization scenarios. However, if the 5-cycle mispredict penalties are a concern, the fact remains that the absolute correct target for fully dynamic dispatch should be available in the software view far enough ahead of the call in enough situations to be beneficial to the hardware mechanism. The problem is communicating it from software into hardware.
The Mill has software-controlled prefetching of data via speculation, but not software-controlled prefetching of code (that we’ve seen). If the hardware predictor consistently fails for a dispatched usage case, there’s no other way to update or augment its functionality.
Having a compiler decide between whether to generate preemptive dispatch hints vs letting the predictor attempt to compensate would probably best be left to runtime log feedback, and might not be used by all compilers. But not having that option at all seems to me to be missing functionality that manifests in penalizing dynamic code.
(Obviously, hopping through multiple layers of dynamic indirection quickly would likely cause stalls no matter the prefetch mechanism, but most dynamic designs boil down to just one indirection per function call.)
- Will_EdwardsModeratorOctober 18, 2014 at 4:07 amPost count: 98
the actual function that will be called is hoisted all the way up to an input parameter, which again would be an ideal for software injecting “upcoming|next EBB” hints into the prefetcher.
Luckily this hoisting would happen in the specializer (or jit), which can use deferred calls or other hints if the concrete Mill supports it. It is nothing a conventional compiler generating the flow graph need concern itself with.
- gideony2ParticipantJuly 4, 2015 at 10:01 amPost count: 7
In an embedded system, with code in ROM, how can the prediction table get back into the load module?
- LarryPParticipantOctober 15, 2015 at 4:22 amPost count: 76
For some perspective on the economic value of improved prediction, you might enjoy reading about a recent, large ($862 million US) patent-infringement award against Apple. Here’s a news article:
And the underlying patent (US5781752):
To me, this says that:
* Improvements to prediction meaningfully improve performance of real CPUs.
* Even incremental improvements in prediction are worth hundreds of millions of dollars!
It will be most interesting to see how the Mill’s prediction (and anti-false-aliasing deferred load) mechanisms work in practice to improve single-thread performance.
- Ivan GodardKeymasterOctober 16, 2015 at 2:58 pmPost count: 506
The prediction in the patent addresses data prediction in OOO machines, which we are not 🙂 For data access, the Mill does not have a better answer than the patent. Instead, the Mill has changed the question, and the patent’s answer is irrelevant.
As for code prediction, the Mill innovations also change the question: we ask not whether a branch is taken, but where control flow will go in the future. This can be wrapped around any mechanism for recognizing similarity with the past; we don’t have anything novel in the predictor itself. Our initial predictor, for testing purposes, is the dumbest possible two saturating-bit local predictor, known for 30 years. But that predictor could be replaced by a global predictor, or a perceptron, or a … – regardless of the prediction mechanism, the Mill can run fetch ahead without needing to see the code.
Two-bit local predictors run 90-95% accurate (the best modern predictors run 98-99%), so ours will miss every 10 or 20 taken transfers. But that means that the Mill can run 10-20 EBBs ahead, more than enough to hide memory latency. Should be interesting 🙂
- PeterHParticipantOctober 19, 2015 at 10:13 amPost count: 38
Considering how the Mill uses metadata and select statements, 10-20 deep prediction may be a match for 20-40 deep prediction on a more conventional out-of-order superscalar. If that’s enough to effectively hide memory latency, likely better to keep with the simple two-bit predictor leaving more silicon real-estate for other mechanisms.
- Ivan GodardKeymasterJune 28, 2017 at 4:26 pmPost count: 506
Reposted from comp.arch with permission, as likely of general interest:
On 6/28/2017 1:47 PM, Stephen Fuld wrote:
The discussion in another thread about branch prediction versus exit prediction got me thinking, and that led me to realize that I didn’t fully understand Mill exit prediction. So let me ask some questions as a means of increasing my understanding.
What exactly are you keeping? The address of the instruction that is predicted to cause the exit? The address of the EBB that is predicted to be executed next? The number of cycles until the exit? I can see advantages for each, and perhaps even multiple things.
The latter two; see below.
I presume the “index” for the predictor is the starting address of the EBB. Is that correct?
More likely the entry index into the SRAM prediction table.The entry address of the executing ebb is in a specReg (whimsically called “entryReg”). That plus the target offset from the prediction gives the entry address you will exit to. Rinse repeat.
If the EBB contains a (perhaps) conditional subroutine call, is that temporary exit from the EBB handled by the exit predictor? I see problems with both yes and no answers. Or is there some other mechanism to handle this case?
Yes, every transfer of control involves a prediction. Eventually there will be a return prediction which takes the target from a conventional return-address stack.
What about conditional branches within the EBB? Are they not predicted at all?
There is no such thing on a Mill. You can branch to the beginning of yourself (one-ebb loops are *very* common on the Mill), but you can’t branch into the middle of any ebb including yourself.
Thanks in advance for helping my understanding.
This is more fully explained in http://millcomputing.com/technology/docs/#prediction, but here’s an overview:
Each logical entry contains some prediction history state, the details depending on the prediction algorithm(s) used; those are not program-visible and may vary by member. Possible keys include the entry address of the ebb that is going to exit; some history of such addresses; some perceptron or TAGE logic derived from such addresses, and so on. In general the Mill can use the same algorithms as any other ISA, but there are fewer entries (one per ebb rather than one per branch) but each entry has more prediction state (as described below vs. one taken/untaken bit). However it is done, the prediction logic must produce a prediction-table entry index and from that a prediction.
Each prediction contains two critical pieces, plus assorted niceties. One is the target, conceptually the address of the next ebb to execute. In practice it is the offset from the entry of the executing ebb, because all Mill branch-like things are entry-relative. As a result we don’t have to have a 60-bit address for each. In effect entry offsets form a linked list in the prediction table, so the prefetch logic can run arbitrarily ahead of fetch and execution. Other than representation, the target is similar to what is in a BTB.
The other crucial part of a prediction is a duration: the number of cycles between when the ebb is entered and when it exits. Duration is statically known; Mill is a statically scheduled exposed pipeline machine. If the duration in the prediction is three (for example) then the fetch and decode logic knows to fetch and decode exactly three (very wide) instructions before starting to fetch from the next ebb in the target chain.
Some of the niceties: rather than the duration being in cycles from the beginning of the ebb, the duration is in cache lines plus instructions within the last cache line. This both tells the prefetcher how many lines to munge, and also reduces the number of bits needed for duration in for large ebbs. Duration can overflow any fixed bit assignment of course; we have not yet determined whether injecting a dummy branch or simply permitting a mispredict at the end of large ebbs is better. Current our working sims use a dedicated set of logN representations for the long duration in lines. This will always mispredict (5 cycles) and may be over-eager in fetching from DRAM, but so far it looks like over-eager is better than under-eager. Pending measurement.
Another nicety: Mill code can have several (possibly conditional) calls in a single instruction; taken calls are executed in slot order and each can see the results of its predecessors. Each unconditional or taken call will have its own prediction, so an instruction-count duration is ambiguous if there are more than one. Consequently the duration is not only cache-lines+instructions but cache-lines+instruction+taken-calls.
There is a pretty conventional return-predictor; a table entry for a return uses a dedicated target value to indicate a return.
There is a transfer-kind code in the entry. This is used both to distinguish returns and to let the transfer hardware get a head start on the side-effects of the transfer: calls create stack and spiller frames and returns unwind them, the leave op discards in-flights, etc. Mill is very CISC, in the sense that single ops do a lot, not in the VAX-like sense in which ops are composed out of subops that must be parsed and interpreted in the context of the rest of the instruction; unlike x86 there is at most one memory access per operation. There’s good and bad in that. On the one hand a call (say) is conceptually simple, encodes densely, and is free of special cases. On the other hand the ISA gives you exactly one way to make a call, and you can’t tailor your own call protocol.
A great many loops on a Mill are one or only a few instructions in one or a few ebbs. As in many architectures there is a loop buffer on a Mill, or rather two of them. One of them holds the hardware signals coming from the decoder; the decoder is a 3-cycle process so we buffer up to three signal sets, and a special prediction tells the decode-and-issue to feed from the signals buffers rather than decoding around the loop. This doesn’t speed anything up (we issue every non-stall cycle regardless), but it does save power. The second loop buffer is the i$0 microcache that holds some small (per member) number of cache lines fully associative. If the lines of a loop, whether one or several ebbs, fit in i$0 then we avoid banging on the main instruction cache, again saving power.
- rolandpjParticipantOctober 19, 2017 at 11:12 amPost count: 4
An idea for debunking – this seems as good a place as any.
A fundamental issue that branch prediction attempts to solve is that there is no explicit clue as to a branch (or computed jump or call or return) target, until the actual PC-mutating instruction itself (is decoded).
This is directly analogous to the memory load problem – you don’t see the memory address until (the same instruction as when) the load result is supposed to be available.
So… why not solve it with explicit support for preload, rather than prediction.
In more (deliberately vague) detail:
Add a second register/buffer in the decode stage. Include an explicit ISA instruction to fetch the alternative instruction stream. As with ‘load from memory’, you can do this as soon as you know what the next (possible) PC/IP target is, which for most conditional and absolute branches/jumps/calls is known at compile-time. C++/JVM etc. ‘virtual’ call destinations, while dynamic, are also typically available at least several instructions before the actual PC/IP switch point (from linear execution).
You want, of course, to implement this in hardware at the level of i-cache lines, probably 32/64-byte units. At all times, your primary source of instructions is the linear execution path, but you have a prepared line of ‘alternative’ (possibly conditional or virtual or return) address instruction buffer.
The actual branch/jump/call/return instruction then reduces to a ‘switch-instruction-buffer’.
Even better, call/return for leaf call-points is a natural consequence – the return point is immediately available as the alternative instruction buffer after the ‘call’.
I seem to recall some history in this approach, but I don’t have any references at hand.
At a high level, it is entirely analogous to ‘prefetch… load’, which the Mill does explicitly as I recall.
So, is branch prediction actually necessary, or is it a reaction to pipelining on ISA’s which didn’t originally consider that they might be pipelined?
- Ivan GodardKeymasterOctober 19, 2017 at 1:17 pmPost count: 506
This goes about half-way to what the Mill does. You are correct that the usual legacy branch-predictor doesn’t know its got a branch until too late, after it has decoded the branch. There are known improvements, such as marking branches in the I$1 cache lines, which lets you know you have a branch coming when you fetch the line. However this is also usually too late in branch-rich code. Still more complex approaches use the regular predictor to build branch-free code sequences, often already partly decoded; this can save a lot in loops but doesn’t help in open code.
The need to look at “here” before you can realize that you will need to go “there” introduces potential memory delays in prefetch. Adding an explicit prefetch operation doesn’t really help, because you have to decode and execute the prefetch op and generally that comes too late too. A prefetch…load sequence would hide the I$1 latency, but not a 300 cycle DRAM latency – the compiler has no idea where it will be 300 cycles in the future. However, runtime often can have a pretty good idea, and we use that.
The Mill moves the whole predict and fetch machinery away from the code into a side structure with its own engine unrelated to the decoders. That engine can run arbitrarily far ahead of decode; in particular, it can run more than a memory-fetch-time ahead.
Your approach of a second buffer might help on a legacy single-issue architecture, but not on a Mill where an instruction can contain several branches, only one (or none) taken; a side buffer for each would blow out the hardware, especially as the following instruction can have several too – see the code example in the “Switches” talk. There is an I$0 microcache in the Mill, which is vaguely similar to multiple side buffers, but this is to cut the power/bandwidth demand in loops more than to cover mispredictions.
But even if you keep a side buffer per instruction the side buffers greatly increase the bandwidth demand into the decoders; instruction bandwidth is a limiting factor for high-end Mills. The Mill only moves the predicted path to the decoders; the alternate paths are prefetched to the caches but not fetched to the decoder. That lets prefetch be slow and relatively rare (and so not in the way), with the run-ahead prediction giving the necessary advance notice. We still have to get a missed target from the caches, but we avoid the great bulk of the cost of getting from memory.
To answer your final question: no, branch prediction is not actually necessary, and Mill uses exit prediction instead. But so long as memory latency is much longer than a compiler can see, and path fanout remains larger than hardware can handle in parallel, prediction of some kind will be necessary for performance.
- NXTanglParticipantFebruary 23, 2018 at 11:13 amPost count: 12
Don’t some Mills also have deferred branches? Since multiple branches that retire in the same cycle are defined to resolve to the first address in issue+slot order, this has some potential to lock in a code path for prediction: issue a deferred branch some cycles ahead, and if it’s known taken, we know enough to lock in that address and start the I$ load immediately. Of course, that mechanism works better for open code, since most Mill loops will be single-cycle pipelined loops that can’t schedule the exit branch as anything other than immediate.
- Ivan GodardKeymasterFebruary 24, 2018 at 4:07 pmPost count: 506
There are multiple branches but no deferred branches. Deferred branches were included once but we got rid of them ages ago. One can think of phasing as being a single-cycle deferral – however all branches phase the same way, whereas the essence of deferral is to support variable latency.
The basic problem with deferred branches is belt congruency. If the target of a branch is a join then the branch carries belt arguments giving what belt objects are to be passed to the target, and their belt order; multiple incoming branches carry the belt arguments to match the target’s belt signature. If we hoist a branch then an argument it wants to pass may not exist yet when the branch issues; conversely, if we restrict its arguments to those that pre-exist it then those may have fallen off the belt by the time the branch is taken. Of course, we could split the branch op into a (hoisted) branch part and an (unhoisted) argument part, but then the hoisted part is really no more than a prefetch.
It would still be possible to use a deferred branch when the target is not a join; in that case the target just inherits the belt as it is when the branch is taken. But such an op, and a prefetch op, wait on having big enough code sample (and enough free engineering time) to do serious measuring to see if they would pay.
You must be logged in to reply to this topic.