Mill Computing, Inc. Forums The Mill Architecture Control-flow folding Reply To: Control-flow folding

Ivan Godard
Post count: 689

Perhaps the talk is oversimplified and misleading. In greater detail, the ebb is divided into fragments, one fragment between each possible transfer. Each sequence of fragments that ends with a taken transfer has a prediction, even unconditional transfers, so we are actually predicting exit from a fragment chain rather than from an ebb. The difference is visible when the ebb contains taken calls, as in this example. Thus while all ebbs will have at least one prediction, those containing calls will have an additional one for each (predicted-)taken call.

The behavior, cost, and advantages of the Mill predictor and a legacy predictor are quite different and not readily comparable. The actual prediction mechanism, be it history based, AI or whatever, is irrelevant; Mill can use any of them in a family member, and the chosen method will predict as well for Mill as anyone. What is different is the information retained by the predictor and what we do with it. Conventional predictors carry one bit (taken/not taken) for each conditional transfer in the code, and an address (in the branch target buffer or BTB) for each dynamically-addressed transfer, either unconditional or taken. They also have a queue of decoded instructions awaiting issue. In the usual decode pipeline that instruction buffering gives them enough lead time to fetch a predicted target instruction from the I$1 without a stall. However, there’s not usually enough lead to handle an I$1 miss, and definitely not enough to go deeper in the memory hierarchy. In addition, the decode queue is yet more work that has to be thrown away on a mispredict.

The Mill predictor contains a (compressed) address in every prediction, essentially folding the BTB into the branch predictor. And there’s an entry for every taken transfer, conditional or not and dynamic or not (a conventional BTB holds addresses only for dynamic transfers). As a result, Mill doesn’t have to look at the code itself to find the target address of a transfer, nor to see which instructions are conditional branches to look up. Mill predictions form a chain, each to the next, extending into future execution; that chain is broken only by a misprediction. And that chain can be prefetched, even out to DRAM, with all fragments fetched in parallel rather than sequentially as a conventional must, even on a conventional with a scout thread.

The downside of the Mill approach is that the space occupied by the predictor for a given hunk of code is much bigger than a simple taken/not taken bit table for the same code. Conventional predictors were developed when transistors were scarce and the table size mattered.

The difference between Mill and legacy approaches shows up differently in different workloads. On loop-heavy loads with few transfers and fewer mispredicts both approaches work well. In branch-heavy code with lots of mispredicts but with a code working set that fits in the I$1 (a lexer for example) then Mill wins not because it predicts better but because its mispredict recovery is fast, in part because it doesn’t need an instruction queue. But the big difference shows up in workloads with very large code working sets and fairly common mispredicts, such as database and browser work loads. A mispredict in such loads commonly means that the application has entered a sub-function that has its own working set, and that working set is not in the I$1 and typically not on the chip at all.

That sub-function will need all its working set. A legacy CPU will fetch that working set one transfer at a time as the I$1 misses the fetch of the code that has to be decoded to find the address of the next code to fetch, which will miss in the I$1 again… Meanwhile, the Mill predictor will have issued fetches for the whole working set, as fast as the DRAM bandwidth will go, by chaining in the predictor without looking at the code itself. Yes, the first few transfers will have to wait for the fetched code, but then the fetcher catches up and decode finds the code it wants in cache, just waiting to execute.

There are actually two costs in a mispredict: the immediate stall as state is flushed and reloaded, and the long tail of deferred costs as the instruction caches reload to the new working set. The numbers you see published for mispredict costs show only the first cost, and if the transfer doesn’t change to a new working set then that is the only cost. But (neglecting task swap) any time you see a flurry of instruction cache misses you are seeing a change in working set, and the flurry was preceded by a mispredict and the costs of the cache misses are part of the cost of that prediction miss.

Mill addresses the immediate cost with short pipes and little state; mispredict stall runs a third to a fifth of that of legacy machines. For many applications that’s the only cost that matters, but for others the subsequent fetch-miss cost is very significant, and Mill run-ahead prediction addresses that.

The cost is hardware table size. Because the Mill is configurable, different Mill members can adopt different predictors. The video describes the expected default predictor configuration. A Mill for a market that cares only about price (== chip area) and not about latency can be configured with a conventional predictor, or no predictor at all as is the case of some embedded designs.

  • This reply was modified 7 years, 1 month ago by  Ivan Godard. Reason: typo