Mill Computing, Inc. › Forums › The Mill › Architecture › Can branch predictors perfectly predict counting loops?
Tagged: Loops prediction
- AuthorPosts
- #3891 |
A lot of loops run for a number of iterations that is known (or can easily be calculated) from the very beginning. E.g. an iteratation might count from
0
up toarray.length-1
or vice versa. Even if there is another termination condition, it MUST terminate after the iteration forarray.length-1
is done. My question is whether this data can actually be transported over to the decoder part of the pipeline where prediction occurs and actually make the branch predictor know for certain when a loop will be done by doing a decrement in the predictor each cycle and predicting an exit once the value is decremented to 0 (or 5). The talks said the Mill’s mispredict penalty is ~5 cycles, which means it might take a few cycles to move data to the place where branch prediction happens, however, since we know that distance we should be able to account for that. As long as the loop lasts longer than the time it takes to move the upper bound on the number of iterations over to the predictor, we will know exactly when we have to predict an exit. I would think that having a static schedule would enable this type of optimization because we know how long the first trip through the loop takes, we know how many iterations we can compute per cycle, and with this information it shouldn’t be hard to figure out EXACTLY when we HAVE TO exit, given the upper bound on the number of iterations.Is this actually doable? Is it worth it in hardware? Would some information need to be added to the loop starting instruction to support this? One problem I foresee is that there might be a weird number of iterations we can execute per cycle aside from a power of 2, which means that it might make sense to encode that value (if it can’t be calculated by the machine during decoding) at the start of the loop and do a subtraction by that value every cycle.
As far as I understand it, on the Mill the compiler converts each loop into a recursive function (which is a basic block). But that’s beside the point, as (as far as I understand it) the Mill’s predictor predicts that basic block A exits to basic block B. So, the cache loads basic block A, preloads basic block B, and the number of iterations are mostly meaningless to efficiency. Then again, I have slept many times since watching that talk.
Prediction offers little value in lengthy loops. Even in architectures with no form of prediction, there may be a miss on enter and another on exit, but for all the other iterations the loop body is already in the top level cache. At most a predictor can let fetch and decode proceed through the back branch without waiting for that branch to execute and the target resolve. That wait can be significant in very tight loops, but predictor hardware isn’t free and in cost conscious embedded usage predictors are an engineering choice.
This also applies to the Mill exit predictor. You are right that ours permits fetch-ahead without decode of the branch because the next ebb’s address is in the prediction, similarly to a BTB entry. The entry also contains a duration, in cycles, which tells the decode when to stop the current ebb and start on the new one. If the predicator was correct, the target ebb (and likely many more) is already in the TLC and the correct sequence of bundles passes through decode to issue with no bubbles.
That’s if no miss. When we execute the branch and discover that the prediction was wrong, or if we take a exit that wasn’t predicted, the hardware has already issued two cycles down the wrong path. Miss recovery has to unwind that, fetch the right target, and decode it, which takes 5 cycles if the target is in cache (legacy miss costs are usually around 20 cycles). A naive predictor will always miss at the bottom of a loop at the last iteration. If the loop has a low trip count the miss can be significant. To avoid it, a more sophisticated predictor uses prior branch history or iteration counts to switch the prediction on the last time through. We’ve tried both approaches, and found that which works best depends on the kind of apps being run, so that’s another configuration choice.
I am not sure that my specific question has been answered. I am wondering if it is conceivable that there could be a mechanism whereby data from the running program that indicates how many iterations the loop plans on doing could be shipped over to the predictor. I know it often runs on historical data, but my question is whether “live” data could be used to override the history-based prediction. E.g. let’s say I have a loop that is supposed to run 5 times, this time. Maybe it ran 10 times last time. For simplicity let’s say I can only run one iteration of the loop at a time, and each iteration takes 4 cycles. So we know the loop will run for 20 cycles, then exit. We know we are going to run the loop 5 times because there is a counter on the belt with
5
in it. Because the loop is running for longer than the pipeline length of 5 cycles, we can send some live data over to the predictor and tell it, “when I sent the value, we had 5 iterations left to go”. Now because X cycles must have passed since that value was sent, maybe we have 4 iterations left to go, but still, the predictor now knows with certainty when an exit should occur, right? This could potentially be a side effect to some of the Mill’s loop-specific instructions. My apologies if this is completely untenable to implement, I do not know the pains of designing hardware.Could that be generalized to any arbitrary branch?
For example, consider this code:
int foo(int a) { // just whatever random code int b = a * a; int c = b * b; int d = a + b + c; // a branch that could be precalculated at the beginning of the function if (a) { return bar(d); } else { return baz(d); } }
At the beginning of the function, we have enough information to know whether we will take that branch or not. A compiler could potentially introduce an instruction at the beginning to precalculate if the branch will be taken and notify the predictor, or there could be “delayed branch” instruction something like “branch in 4 cycles if a is non zero”.
This kind of code shows all the time in loops and precondition checking in functions.
Your idea is possible in principle, but the details are problematic. In essence you propose a deferred branch whose transfer is detached from the evaluation of the predicate, similar to the way Mill separates load issue from load retire.
The deferred branch, DB, would contain a predicate, a target address, and a deferral cycle count. At DB issue time the predictor has already made an exit prediction and fetch has already run down the prediction chain loading the code of the predicted path. If the DB predicate evals false then the DB is ignored. If it evals true then the DB target address should be compared with the predicted target and the deferral with the remaining count of the prediction; if equal then the DB is asserting the same transfer as the prediction, and can be ignored. If they differ then the fetch/decode logic needs to be reset, essentially in the same way as with a mispredict, by updating the pending count and target and restarting the fetch chain at the new target.
This could be done. However there are both semantic and implementation issues. One is the time required to reset the fetch. If the target is not in the I0 microcache then reset would take roughly as long as mispredict recovery, i.e. five cycles in our test configs. Even an I0 hit would likely need three cycles to reset. If the deferral was less than the remaining count as predicted then then we would have already executed a ways down the predicted (before the reset) path, and would need a full unwind miss recovery, and DB buys nothing. How often can we eval a predicate five cycles before the transfer? Not often I’d guess, but I have no numbers.
A semantic issue is how the DB interacts with other branches. Suppose a regular branch is executed between DB issue/eval and transfer? Who wins? I’m sure you can think of similar conundrums.
nihil novi
I had exactly the same idea, my obvious case was quicksort that switches to more regular quadratic algorithms for short arrays.
This way I have rediscovered Loop Termination Prediction, Count Register, Counted Loop Branch, etc.Prediction offers little value in lengthy loops
Yes, but there is surprisingly high amount of relatively short loops.
And don’t forget, that vectorization makes loops shorter up to 32 times!When we execute the branch and discover that the prediction was wrong, or if we take a exit that wasn’t predicted, the hardware has already issued two cycles down the wrong path. Miss recovery has to unwind that, fetch the right target, and decode it, which takes 5 cycles if the target is in cache
5 cycles on Mill is like 15 cycles on conventional processor – quite a big loss!
In essence you propose a deferred branch whose transfer is detached from the evaluation of the predicate, similar to the way Mill separates load issue from load retire.
IIRC Elbrus E2k uses this technique, it has Prepare Branch and Execute Branch instructions.
One is the time required to reset the fetch. If the target is not in the I0 microcache then reset would take roughly as long as mispredict recovery, i.e. five cycles in our test configs.
It looks like you should split speculative fetch and speculative code prefetch.
How often can we eval a predicate five cycles before the transfer? Not often I’d guess, but I have no numbers.
But you can start resetting the fetch earlier! Prepare Branch and the following instructions before Branch Execute are valid…
A semantic issue is how the DB interacts with other branches. Suppose a regular branch is executed between DB issue/eval and transfer? Who wins?
It doesn’t matter as long as it is consistent! We could treat it in the same way as delay slots, so the most “powerful” answer is:both of them, unfortunately it won’t work with Belt I suppose.
- This reply was modified 1 year, 3 months ago by peceed.
One is the time required to reset the fetch. If the target is not in the I0 microcache then reset would take roughly as long as mispredict recovery, i.e. five cycles in our test configs. Even an I0 hit would likely need three cycles to reset.
It is hard to believe that reset on such early stage of instruction execution (no registry/belt/memory operations) is something different than stall of subsequent “correct” stages. So there should be no lower limit on its duration when the information is available earlier. Just issue correct instruction and wait for its “execution front”.
- AuthorPosts
You must be logged in to reply to this topic.