Prediction

staff

Talk by Ivan Godard – 2013-11-12 at

IEEE CS Santa Clara Valley

Run-ahead transfer prediction in the Mill CPU architecture

Programs frequently execute only a handful of operations between transfers of control: branches, calls, and returns. Yet modern wide-issue VLIW and superscalar CPUs can issue similar handfuls of operations every cycle, so the hardware must be able to change to a new point of execution each cycle if performance is not to suffer from stalls. Changing the point of execution requires determining the new execution address, fetching instructions at that address from the memory hierarchy, decoding the instructions, and issuing them—steps that can take tens of cycles on modern out-of-order machines. Without hardware help, a machine could take 20 cycles to transfer, for just one cycle of actual work.

The branch predictor hardware in conventional out-of-order processors does help a lot. It attempts to predict the taken vs. untaken state of conditional branches based on historical behavior of the same branch in earlier executions. Modern predictors achieve 95% accuracy, and large instruction-decode windows can hide top-level cache latency. Together these effects are sufficient for programs like benchmarks that are regular and small. However, on real-world problems today’s CPUs can spend a third or more of their cycles stalled for instructions.

The Mill uses a novel prediction mechanism to avoid these problems; it predicts transfers rather than branches. It can do so for all code, including code that has not yet ever been executed, running well ahead of execution so as to mask all cache latency and most memory latency. It needs no area- and power-hungry instruction window, using instead a very short decode pipeline and direct in-order issue and execution. It can use all present and future prediction algorithms, with the same accuracy as any other processor. On those occasions in which prediction is in error, the mispredict penalty is four cycles, a quarter that of superscalar designs. As a result, code stall is a rarity on a Mill, even on large programs with irregular control flow.

The talk describes the prediction mechanism of the Mill and compares it with the conventional approach.

Symmetry

I was re-watching the prediction video this morning. It seems like there are some serious advantages to the EBB chaining prediction approach, but it also seems like one drawback of it is that you completely lose the ability to make predictions about secondary or tertiary exits from an EBB, such as you might find in a loop within a loop. It seems like this makes the mill better at branchy or unstructred code at the expense of making it worse at the sort of code DSPs would typically excel at. Am I missing something?

ivan

Unfortunately, to fit in time and mind-boggle limits the talks have to gloss over quite a few important details. :-(

It's been mentioned that Mill EBBs can contain embedded calls (not permitted in the usual definition of EBB). That means that an EBB must have multiple predictions if it contains calls: one from entry to the first taken call (calls can be conditional), one from the return of that call to the next taken call, and so on until one from the last taken call to the taken return (returns can also be conditional). Each of these predictions needs a key for lookup in the prediction table. The first key is obviously the address of the EBB entry point; what should the rest be?

It's tempting to use the "from" address of the taken call (where the call will return to) as the key for the next prediction. However that doesn't work. To begin with, there are two "from" addresses due to split-stream encoding, and the exu address may be ambiguous due to no-op elision, but both these could be worked around. Unworkable is that the return address is not itself predicted from the information in the prediction: the pred contains a line count and an instruction count, but not a byte count. Without a byte count, we could not know from one prediction what the key of the next would be (if the key is the byte address), which means that we cannot do prediction chaining with the code not yet in memory, which is the goal of the whole exercise.

Could we use a byte count in the preds instead of line/cycle counts? No, because the decoder needs to stage lines into the instruction shifter on a cycle-by-cycle basis. We don't know how bytes translate into cycles without looking at the instruction headers, and by the time we have figured out from the remaining-byte count that we have done the last instruction it's too late to get the next line into the shifter. And adding a byte count as well as line/inst counts would blow up the pred entry too much.

The solution is to use the entry address for the initial key, and for each later key to decrement the entry address by two. The initial key is known from the prediction that enters the EBB, and each subsequent key is known just by arithmetic on the entry key; consequently chaining is possible even with embedded calls (including multiple calls within a single instruction, which is possible on a Mill - more on that in the Execution talk on Wednesday).

I mentioned above that there are conditional calls, which need keys. If (say) we enter the EBB at 0x12345 and the first call is conditional and predicted taken then the next key (for use after the return) will be 0x12343, two less than the entry. If however the first call is predicted not taken and we expect to execute until the next call, then the return from that (and the next key to use in the chain) is 0x12341, i.e. the entry decremented by two, twice. When chaining, how do we know which next key to use? The answer comes from realizing that we need to decrement the key not just for taken calls but for untaken ones as well. To do that, each pred also contains a small field for untakenCallCount, the number of untaken calls stepped over by the prediction. With that, we can decrement the right number of times and get the correct next key for chaining.

There's a potential ambiguity in using decremented entry addresses as subsequent keys: what if the decremented address was decremented enough to move the key pseudo-address into the adjacent EBB's code and collide with that EBB's entry address that it uses for its own prediction? Can't happen: a decrement is moving in the flow-code direction from the entry point (increment would move in the exu direction). The call operation is a flow-side operation, and all possible call encodings occupy at least two bytes. Consequently, a two-byte decrement will necessarily still lie in the flow side of the EBB because the call that requires another key must occupy enough bytes to keep the pseudo-key within the memory range of the EBB. Increment would not work, because EBBs with no exu side are possible and enough increments could move the key into ambiguity.

Lastly: why decrement by two; why not by one? Answer: each prediction (the "main" pred) may have an alternate pred that has the same key minus one. If we mispredict, we probe for the alternate key. There may not be an alternate; by definition alternates are used less than main predictions, and the alternate may never have been taken (and hence added to the table) or may have aged out, but if the transfer is at all loosey-goosey then there is a decent chance of an alternate. We chain normally from the alternate, and get prefetch etc started on that chain.

Alternates deal with the case in which a program usually stays on the main line, but sometimes goes off into something complicated. Once we know it's been rerouted, we want to be able to get prediction running ahead of it as quickly as possible, without additional misses. We already had a miss; without the alternate prediction, there would be another miss as soon as we got to the end of wherever the first miss took us. With the alternate, we avoid that second miss, and in addition have gotten prefetch and fetch working early.

Obviously this alternate mechanism applies to loops, and in fact it was first developed as support for exactly the problem you identify. Since then we have added more loop support (NYF), but have kept the alternate machinery because it is valuable even outside the loop context.

Now you see why I didn't try to include the details in the prediction talk :-)

Speaking of which, anybody want to volunteer to write a white paper on the full prediction mechanism? You would get some glory (name on paper, will get published on our site and in Computer Architecture News) and some sweat-equity in the Mill/OOTBC.

Will_Edwards

The Mill has some mitigations, perhaps? It has an extremely small mispredict penalty (5 cycles) if the taken path is in the instruction cache. It can execute up to 6 dependent instructions in a single cycle. It also makes classic VLIW definitions of Very seem exaggerated ;)

What is predicted is very novel to the Mill, but the how is normal. There are predictors that try to predict the number of iterations and so on; these are implementation choices and models may differ.

baking

This is obviously one of those details that had to be glossed over in the talk, but the line and instruction counts in the exit table can only refer to one stream. Are there two pairs of counts, one for each stream? Or have you worked out some trick to get by with just one.

ivan

There are two pairs of counts in a prediction, one pair for each side.

Symmetry

I seem to recall that you mentioned you loaded new predictions on a function call in the talk, and that seemed rather frequent to me. Now that the security talk is out, are they actually loaded on portal calls?

ivan

Predictions load only if we have no prediction for the call, and the loading is in parallel with the miss processing, so it will necessarily result in faster execution than simply missing and building up experience without loading.

The prediction table holds the target address of a prediction as an offset from the entry address of the caller EBB (the same addressing that branch/call uses). The offsets are within a single load module, which has bounded size much smaller than the total address space; that way we need only 20-30 bits per address in the table, rather than 60 bits for an arbitrary address.

In contrast, portals may be set up anywhere in the global address space, and so would need more bits in the prediction. We expect portal calls to be rare compared to ordinary calls, so we economize on table space and wires and don't try to predict portal calls.

PeterH

So while a function call, aided by prediction, will usually take a single cycle, a portal call will need at least as long as 2 fetches from L1 code cache, one load for the portal itself and a second for the destination code?

ivan

Yes. The load of the info block (in case of a callback) overlaps with the target code fetch, so there are only two fetches on the latency path: the portal itself, and the code miss. If you work through the detailed timing, this is the same as is needed for a trampoline (thunk) calling a dynamically linked (and local in the same protection environment as the caller) function that did not use a portal.

Consequently we expect that portals with wildcard ids (which use the caller's ids) will be used for local dynamic linking as well as (with explicit ids) for actual protection transit; there's no need for two mechanisms.

It is unfortunate that the predictor cannot predict far transits, but permitting it to do so would nearly double the size of the prediction tables for the same capacity. Measurement may show that far transits are common enough to justify the extra bits, in which case we will change the prediction formats and make portal calls be predictable, but our best guess for now says that far transits are rare enough compared to near transits that it is better to save the bits.

casten

It was mentioned in thew video that when there is a cold load of a program there is no history to use for prediction, so branch operations just guess. Might it be possible to prime the history with an annotation in source code? I suppose this could be an innate field in a branch instruction, or a separate instruction to to inform the predictor similar to __builtin_expect. The difference from __builtin_expect being, the value would only inform a cold predictor. It seems like this might be a large benefit for initial program performance; i.e., one of the larger factors in users' subjective performance perception.

Will_Edwards

Yes, this is a very good idea and I expect we'll be able to pick up __builtin_expect and equivalents when we prepare the starting statistics.

ivan

The Mill doesn't put "expects" in the code, because the entropy cost of the field in the encoding then would have to be paid on every use of such instructions. Instead "expects" go in the loadable prediction table in the load module. The effect as far as initial prediction quality is the same, but the predictor table is promptly overwritten by actual experience history.

Also, "expects" in the code only can supply taken/not taken information, which arrives much too late to be useful in guiding decode. Instead the Mill predictor carries "where to next" information so that it can run-ahead of fetch and thereby avoid code fetch stalls.

Will is right that anything the program can do to help guide the process is useful and will be incorporated in the program for better prediction. The difference is in how that added information is represented, and where.

casten

I didn't mean to explicitly ask about __builtin_expect. I only meant to reference it as something with a similar function. Articles on the net say programmers are more likely to mispredict frequencies unless they use the proper tools in specific circumstances.

My intent was to suggest using a similar annotation or instruction to inform a cold predictor. Being in a base state, a programmer can probably have a reasonable idea about the likely code path. And then once the predictor is warmed up, the annotation should be ignored.

Of course support for __builtin_expect is nice. But I think that something like __builtin_cold_expect would be generally more practical. Once things are primed, I wouldn't want to second guess predictors that do 90+%.

As far as entropy cost, it is understable that an additional instruction might be too costly. What about an extra, optional or unused bit on a branch instruction?

Another related question. When speculating on a branch, might there any added benefit to knowing the probability within a granularity greater than 50% accuracy? It seems like if the decision were not simple and a weight could be applied, e.g. one branch had a more expensive subsequent instruction than the alternative, this too might be useful. But I'm probably speculating too much myself. I'll stop this line of questioning as I give it a probability of .05 of being potentially useful.

ivan

Rather than manual annotation, why not just run the program a few times and let it train?

Modifying the binary, such as by adding bits to the branch ops, hits the problem that the decoders cannot know they have a branch until after they have already committed to taking or not taking the transfer. This is a consequence of the pipelining of the decode logic; we have to know to issue the line fetch at least a cache-latency ahead of getting the line, and it's a couple of cycles more until decode has figured out that a transfer is to be executed. There are two possible resolutions of this: predictive fetch, or retarded execution. We use the former, and have generalized predictive fetch to achieve arbitrary run-ahead. Retarded execution is used by OOO machines, by delaying operations that depend on control flow until the control flow is resolved, using the OOO hardware to keep the execute units busy once the pipe has started up.

The two approaches can approximate each other in the steady state, but predictive fetch has much less startup cost than retarded, so we have a five cycle mispredict penalty rather than the 15 or so typical of retarded OOO. Predictive does require more table state than retarded for equivalent accuracy, but the Mill avoids that issue through our trainable prediction, which in effect gives us arbitrarily large tables in fixed and small hardware.

There may be a third approach to deal with pipelined decoder timing beyond predictive and retarded, but I don't know of one.

casten

Makes sense. Thanks for the explanation.

David

Something that was not addressed in this talk is virtual and indirect calls/jumps. Predicting these has significant effect not just with dynamic languages but as low-level as C++.

One facet would be virtual dispatch where the type of object being dispatched is generally the same across multiple executions at the same call site. The existing Mill predictor would naturally be able to learn this, and Java and JavaScript tend to have function rewriting to perform this type of prediction in software.

But often the type varies for a given call site. Unlike branch prediction, it's not limited to a binary decision between two potential addresses. Opportunity to prefetch the next EBB can come from hoisting the load of the dispatch address, which can go pretty far in functions which perform processing themselves before the dispatch call. However, the software would need to be able to give this address to the exit chaining mechanism in order to gain value from this prefetch.

There are also concepts of history tables linked to object identity instead of call site address. With these, multiple call sites could benefit from a single knowledge update, but I don't think they're as appropriate for general purpose CPUs as they're usually specific to object system ABIs.

ivan

Exactly right.

When the prediction facility was being designed, we decided that any unstable transfer point such as method calls (what Cliff Click calls "megamorphic") would be of interest only if control passed through it frequently enough to keep the various target EBBs in the L2. Consequently we did not concern ourselves with issues of DRAM latency, and were primarily concerned with the miss-predict penalty (yes, the Mill has a penalty that is a third that of other architectures, but a stall is still a stall) and the L1/L2 latencies.

Next, we observed that the codes we looked at, and in our personal experience, megamorphic transfers tended to not be calls on leaf function. Instead, a megamorphic transfer usually led to a sequence of transfers that were not megamorphic, until control returned back through the megamorphic point. This reflects a program model in which a test point figures out the type of a datum, and then does stuff with the known type. Mill run-ahead prediction works well with this model, because once through the megamorph the subsequent control flow will predict well and will be prefetched ahead of need. Incidentally, the original source may not have been written to this model, but the compiler's value-set analysis and inlining will tend to remove the type-checks internally to the known-type cluster, and even if it does not, the type will in fact be fixed and the transfers not be megamorphic.

The alternative model works less well, where the same object is subjected to a succession of leaf-function calls. Each of the calls will be megamorphic if the object is (and so will miss-predict frequently), and the leaf functions don't have long enough transfer chains for run-ahead to win much.

This suggests an obvious optimization for the compiler: transform the latter structure into the former. This dimensionality inversion or transpose is the control space analogy to the re-tiling optimizations that are used to improve array traversal in the data space, and for all I know some compilers already do it. It should benefit any prediction mechanism, not just ours.

David

Beyond object method dispatch, lambda passing style would be another source of unstable transfer points. Custom lambdas passed into central utilities are often per-use closures. On the upside, the actual function that will be called is hoisted all the way up to an input parameter, which again would be an ideal for software injecting "upcoming|next EBB" hints into the prefetcher.

For context, how many cycles ahead of an exit would the exit table need to be notified in order to prefetch code from cache to execute without any stalls? I suspect this would vary between family members. If it's under 10, I would imagine software hints could eliminate stalls for many truly unstable dispatched exits.

I agree that DRAM latency isn't worth considering in these optimization scenarios. However, if the 5-cycle mispredict penalties are a concern, the fact remains that the absolute correct target for fully dynamic dispatch should be available in the software view far enough ahead of the call in enough situations to be beneficial to the hardware mechanism. The problem is communicating it from software into hardware.

The Mill has software-controlled prefetching of data via speculation, but not software-controlled prefetching of code (that we've seen). If the hardware predictor consistently fails for a dispatched usage case, there's no other way to update or augment its functionality.

Having a compiler decide between whether to generate preemptive dispatch hints vs letting the predictor attempt to compensate would probably best be left to runtime log feedback, and might not be used by all compilers. But not having that option at all seems to me to be missing functionality that manifests in penalizing dynamic code.

(Obviously, hopping through multiple layers of dynamic indirection quickly would likely cause stalls no matter the prefetch mechanism, but most dynamic designs boil down to just one indirection per function call.)

Will_Edwards

Perceptive. Some concrete Mills may have deferred control flow (e.g. the delay on the br op) and this may need careful simulator profiling for the case of call too.

the actual function that will be called is hoisted all the way up to an input parameter, which again would be an ideal for software injecting “upcoming|next EBB” hints into the prefetcher.

Exactly.

Luckily this hoisting would happen in the specializer (or jit), which can use deferred calls or other hints if the concrete Mill supports it. It is nothing a conventional compiler generating the flow graph need concern itself with.