Forum Topic: Prediction
Talk by Ivan Godard – 2013-11-12 at
IEEE CS Santa Clara Valley
NOTE: the slides require genuine Microsoft PowerPoint to view; open source PowerPoint clones are unable to show the animations, which are essential to the slide content. If you do not have access to PowerPoint then watch the video, which shows the slides as intended.
Slides: PowerPoint (.pptx)
Run-ahead transfer prediction in the Mill CPU architecture
Programs frequently execute only a handful of operations between transfers of control: branches, calls, and returns. Yet modern wide-issue VLIW and superscalar CPUs can issue similar handfuls of operations every cycle, so the hardware must be able to change to a new point of execution each cycle if performance is not to suffer from stalls. Changing the point of execution requires determining the new execution address, fetching instructions at that address from the memory hierarchy, decoding the instructions, and issuing them—steps that can take tens of cycles on modern out-of-order machines. Without hardware help, a machine could take 20 cycles to transfer, for just one cycle of actual work.
The branch predictor hardware in conventional out-of-order processors does help a lot. It attempts to predict the taken vs. untaken state of conditional branches based on historical behavior of the same branch in earlier executions. Modern predictors achieve 95% accuracy, and large instruction-decode windows can hide top-level cache latency. Together these effects are sufficient for programs like benchmarks that are regular and small. However, on real-world problems today’s CPUs can spend a third or more of their cycles stalled for instructions.
The Mill uses a novel prediction mechanism to avoid these problems; it predicts transfers rather than branches. It can do so for all code, including code that has not yet ever been executed, running well ahead of execution so as to mask all cache latency and most memory latency. It needs no area- and power-hungry instruction window, using instead a very short decode pipeline and direct in-order issue and execution. It can use all present and future prediction algorithms, with the same accuracy as any other processor. On those occasions in which prediction is in error, the mispredict penalty is four cycles, a quarter that of superscalar designs. As a result, code stall is a rarity on a Mill, even on large programs with irregular control flow.
The talk describes the prediction mechanism of the Mill and compares it with the conventional approach.
Speaker bio
Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.
Ivan is currently CTO at Mill Computing, a startup now emerging from stealth mode. Mill Computing has developed the Mill, a clean-sheet rethink of general-purpose CPU architectures. The Mill is the subject of this talk.