Control-flow folding

Author
Posts
Ivan Godard
Keymaster
March 5, 2017 at 8:19 pm
Post count: 689
#2806 |
As a wide machine that can safely speculate most operations even while checking for such things as integer overflow, the Mill permits a lot of control flow folding. For trivial example, this source:
```
   int foo(int a);
   void bar(int b) {
   if (b != 17)
       bar(b+1);
   if (b != 0)
       foo(b+17);
   }
```
gets this one-instruction asm code from the specializer on Silver:
```
   F("bar") %0;
        add(b0 %0, 17) %2,
          eqlb(b0 %0, 17) %1,
          eqlb(b0 %0, 0) %4,
          add(b0 %0, 1) %3,
          callfl0(b2 %1, "bar", b0 %3),
          callfl0(b1 %4, "foo", b3 %2),
          retn();
```
The app would not throw a fault even if one of the adds overflows, unless the corresponding call was taken. Mill predicts exits, not branches, so there may be from 1-3 predictions for this function, one for the return and 0/1/2 for the calls. Conservation of prediction space is an important consideration when optimizing Mill code.
The (possible) trailing call to foo prevents LLVM from applying tail recursion elimination to bar, but in fact the whole function with its recursive call can be replaced by an assignment followed by a loop over foo in Mill code; it’s hard for LLVM to find that though. You might see what other compilers can do with this.
goldbug
Participant
March 17, 2017 at 6:40 am
Post count: 53
#2808
From the lecture you gave on prediction, I get that you predict exits.
In this case, there can be up to 3 exits, one for each call and the return. In fact, in this particular function there will always be at least 2 exits taken. At least one of the calls will be taken and the return will be taken. If I understood your talk correctly, you only have 1 prediction per EBB. Does that mean that your prediction would fail at least 50% of the time in this function? Or you have some mechanism to predict both exits?
I would think this is pretty common, we (developers) often make functions that call other functions and then return.
Also, on the lecture you gave some pretty interesting numbers at the beginning of the talk about how much time is wasted on cold runs in conventional hardware. You explain what the bottlenecks are and how you are addressing them in the Mill. But you don’t mention how effective your solutions are. Do you have any numbers as to how your solution performs vs conventional hardware regarding mispredictions?
- This reply was modified 7 years, 4 months ago by goldbug.
- This reply was modified 7 years, 4 months ago by goldbug.
- Ivan Godard
  Keymaster
  March 17, 2017 at 10:11 am
  Post count: 689
  #2811
  Perhaps the talk is oversimplified and misleading. In greater detail, the ebb is divided into fragments, one fragment between each possible transfer. Each sequence of fragments that ends with a taken transfer has a prediction, even unconditional transfers, so we are actually predicting exit from a fragment chain rather than from an ebb. The difference is visible when the ebb contains taken calls, as in this example. Thus while all ebbs will have at least one prediction, those containing calls will have an additional one for each (predicted-)taken call.
  The behavior, cost, and advantages of the Mill predictor and a legacy predictor are quite different and not readily comparable. The actual prediction mechanism, be it history based, AI or whatever, is irrelevant; Mill can use any of them in a family member, and the chosen method will predict as well for Mill as anyone. What is different is the information retained by the predictor and what we do with it. Conventional predictors carry one bit (taken/not taken) for each conditional transfer in the code, and an address (in the branch target buffer or BTB) for each dynamically-addressed transfer, either unconditional or taken. They also have a queue of decoded instructions awaiting issue. In the usual decode pipeline that instruction buffering gives them enough lead time to fetch a predicted target instruction from the I$1 without a stall. However, there’s not usually enough lead to handle an I$1 miss, and definitely not enough to go deeper in the memory hierarchy. In addition, the decode queue is yet more work that has to be thrown away on a mispredict.
  The Mill predictor contains a (compressed) address in every prediction, essentially folding the BTB into the branch predictor. And there’s an entry for every taken transfer, conditional or not and dynamic or not (a conventional BTB holds addresses only for dynamic transfers). As a result, Mill doesn’t have to look at the code itself to find the target address of a transfer, nor to see which instructions are conditional branches to look up. Mill predictions form a chain, each to the next, extending into future execution; that chain is broken only by a misprediction. And that chain can be prefetched, even out to DRAM, with all fragments fetched in parallel rather than sequentially as a conventional must, even on a conventional with a scout thread.
  The downside of the Mill approach is that the space occupied by the predictor for a given hunk of code is much bigger than a simple taken/not taken bit table for the same code. Conventional predictors were developed when transistors were scarce and the table size mattered.
  The difference between Mill and legacy approaches shows up differently in different workloads. On loop-heavy loads with few transfers and fewer mispredicts both approaches work well. In branch-heavy code with lots of mispredicts but with a code working set that fits in the I$1 (a lexer for example) then Mill wins not because it predicts better but because its mispredict recovery is fast, in part because it doesn’t need an instruction queue. But the big difference shows up in workloads with very large code working sets and fairly common mispredicts, such as database and browser work loads. A mispredict in such loads commonly means that the application has entered a sub-function that has its own working set, and that working set is not in the I$1 and typically not on the chip at all.
  That sub-function will need all its working set. A legacy CPU will fetch that working set one transfer at a time as the I$1 misses the fetch of the code that has to be decoded to find the address of the next code to fetch, which will miss in the I$1 again… Meanwhile, the Mill predictor will have issued fetches for the whole working set, as fast as the DRAM bandwidth will go, by chaining in the predictor without looking at the code itself. Yes, the first few transfers will have to wait for the fetched code, but then the fetcher catches up and decode finds the code it wants in cache, just waiting to execute.
  There are actually two costs in a mispredict: the immediate stall as state is flushed and reloaded, and the long tail of deferred costs as the instruction caches reload to the new working set. The numbers you see published for mispredict costs show only the first cost, and if the transfer doesn’t change to a new working set then that is the only cost. But (neglecting task swap) any time you see a flurry of instruction cache misses you are seeing a change in working set, and the flurry was preceded by a mispredict and the costs of the cache misses are part of the cost of that prediction miss.
  Mill addresses the immediate cost with short pipes and little state; mispredict stall runs a third to a fifth of that of legacy machines. For many applications that’s the only cost that matters, but for others the subsequent fetch-miss cost is very significant, and Mill run-ahead prediction addresses that.
  The cost is hardware table size. Because the Mill is configurable, different Mill members can adopt different predictors. The video describes the expected default predictor configuration. A Mill for a market that cares only about price (== chip area) and not about latency can be configured with a conventional predictor, or no predictor at all as is the case of some embedded designs.
  This reply was modified 7 years, 4 months ago by Ivan Godard. Reason: typo
- Ivan Godard
  Keymaster
  March 17, 2017 at 10:33 am
  Post count: 689
  #2813
  (Please put one question per post, rather than adding more by edits that may cross with responses)
  About numbers: we’re working on it, but we have nothing big enough yet to really exercise the predictor and especially the inter-run history mechanism. Something like mySQL would do, or a Windows port maybe, but for now we claim benefits based on first principles and seat-of-the-pants, not on measurement. And yes, we are even more impatient about that than you are 🙂
goldbug
Participant
March 18, 2017 at 1:54 pm
Post count: 53
#2814
Thanks for the detailed explanation. Seems like you guys have thought of everything :). I look forward to play around with the simulator when it is available.
Author
Posts

You must be logged in to reply to this topic.