Unfortunately, to fit in time and mind-boggle limits the talks have to gloss over quite a few important details. 🙁
It’s been mentioned that Mill EBBs can contain embedded calls (not permitted in the usual definition of EBB). That means that an EBB must have multiple predictions if it contains calls: one from entry to the first taken call (calls can be conditional), one from the return of that call to the next taken call, and so on until one from the last taken call to the taken return (returns can also be conditional). Each of these predictions needs a key for lookup in the prediction table. The first key is obviously the address of the EBB entry point; what should the rest be?
It’s tempting to use the “from” address of the taken call (where the call will return to) as the key for the next prediction. However that doesn’t work. To begin with, there are two “from” addresses due to split-stream encoding, and the exu address may be ambiguous due to no-op elision, but both these could be worked around. Unworkable is that the return address is not itself predicted from the information in the prediction: the pred contains a line count and an instruction count, but not a byte count. Without a byte count, we could not know from one prediction what the key of the next would be (if the key is the byte address), which means that we cannot do prediction chaining with the code not yet in memory, which is the goal of the whole exercise.
Could we use a byte count in the preds instead of line/cycle counts? No, because the decoder needs to stage lines into the instruction shifter on a cycle-by-cycle basis. We don’t know how bytes translate into cycles without looking at the instruction headers, and by the time we have figured out from the remaining-byte count that we have done the last instruction it’s too late to get the next line into the shifter. And adding a byte count as well as line/inst counts would blow up the pred entry too much.
The solution is to use the entry address for the initial key, and for each later key to decrement the entry address by two. The initial key is known from the prediction that enters the EBB, and each subsequent key is known just by arithmetic on the entry key; consequently chaining is possible even with embedded calls (including multiple calls within a single instruction, which is possible on a Mill – more on that in the Execution talk on Wednesday).
I mentioned above that there are conditional calls, which need keys. If (say) we enter the EBB at 0x12345 and the first call is conditional and predicted taken then the next key (for use after the return) will be 0x12343, two less than the entry. If however the first call is predicted not taken and we expect to execute until the next call, then the return from that (and the next key to use in the chain) is 0x12341, i.e. the entry decremented by two, twice. When chaining, how do we know which next key to use? The answer comes from realizing that we need to decrement the key not just for taken calls but for untaken ones as well. To do that, each pred also contains a small field for untakenCallCount, the number of untaken calls stepped over by the prediction. With that, we can decrement the right number of times and get the correct next key for chaining.
There’s a potential ambiguity in using decremented entry addresses as subsequent keys: what if the decremented address was decremented enough to move the key pseudo-address into the adjacent EBB’s code and collide with that EBB’s entry address that it uses for its own prediction? Can’t happen: a decrement is moving in the flow-code direction from the entry point (increment would move in the exu direction). The call operation is a flow-side operation, and all possible call encodings occupy at least two bytes. Consequently, a two-byte decrement will necessarily still lie in the flow side of the EBB because the call that requires another key must occupy enough bytes to keep the pseudo-key within the memory range of the EBB. Increment would not work, because EBBs with no exu side are possible and enough increments could move the key into ambiguity.
Lastly: why decrement by two; why not by one? Answer: each prediction (the “main” pred) may have an alternate pred that has the same key minus one. If we mispredict, we probe for the alternate key. There may not be an alternate; by definition alternates are used less than main predictions, and the alternate may never have been taken (and hence added to the table) or may have aged out, but if the transfer is at all loosey-goosey then there is a decent chance of an alternate. We chain normally from the alternate, and get prefetch etc started on that chain.
Alternates deal with the case in which a program usually stays on the main line, but sometimes goes off into something complicated. Once we know it’s been rerouted, we want to be able to get prediction running ahead of it as quickly as possible, without additional misses. We already had a miss; without the alternate prediction, there would be another miss as soon as we got to the end of wherever the first miss took us. With the alternate, we avoid that second miss, and in addition have gotten prefetch and fetch working early.
Obviously this alternate mechanism applies to loops, and in fact it was first developed as support for exactly the problem you identify. Since then we have added more loop support (NYF), but have kept the alternate machinery because it is valuable even outside the loop context.
Now you see why I didn’t try to include the details in the prediction talk 🙂
Speaking of which, anybody want to volunteer to write a white paper on the full prediction mechanism? You would get some glory (name on paper, will get published on our site and in Computer Architecture News) and some sweat-equity in the Mill/OOTBC.