Mill Computing, Inc

Keymaster

January 2, 2015 at 2:00 pm

Post count: 689

Summarizing Mill prediction:

Most machines predict branches (taken/not taken) and have a separate table (Branch Target Table) with actual addresses for dynamic branches. These table are consulted only when execution reached the decoder, and so must be very fast and hence small, especially the BTT, which causes pain in codes with lots of indirect branches such as virtual function dispatch.

The Mill predicts exits, not branches. Mill code is organized into Extended Basic Blocks, one-entry-many-exit code sequences. Unusually, the Mill form of EBB may contain embedded calls. If you enter an EBB, no matter how many exit branches the EBB contains only one will be taken, and the Mill predicts that exit, including the target address. In effect, the Mill has only a BTT and no branch table at all, and no entries for exits not predicted to be taken.

The Mill organization permits run-ahead prediction – we don’t need to look at the code to follow a prediction chain. Thus the tables can be well away from the decoders, and hence can be big, slow, and cheap.

Now that Mill advantage assumes that things are actually predictable; it doesn’t buy anything (no predictor does) when control flow is random and you miss all the time. Cliff Click calls such code megamorphic, and besides virtual function dispatch the poster-child for megamorphic is the dispatch loop/switch of interpreters.

Consequently a performance interpreter, for Mill or anything, has to avoid switch statements in the dispatcher. It is far better to take a little more time in the encoding or translating to change op indices into actual jump/call instructions in execution order. Even if the result of a JIT is just a sequence of call ops to exec functions that are not inlined, the JITed code will run *much* better than if the original byte codes were indices into a switch.

Of course, if everything gets inlined then you have one machine EBB for each EBB that is present in the (Java, say) bytecode source – and you run like a house afire.

There are two ways to do the inlining. The slow way, which is member independent and will give the best performance, is to turn each bytecode into a function call on the exec function, all represented either in genAsm or directly in genForm. Mark all the exec functions (well, the simple ones anyway) as “inline”, and use the genAsm and specializer libraries to get you target binary, and then call the OS to mark the result as executable. Not fast. But of course, you can do it once for all of each package at package-load time, and thus amortize the cost.

The fast way to do inlining, which is member-dependent and will not give as good performance as the other route, is to look up in a pre-built table (one per member) the actual raw binary bit pattern of a Mill instruction. Said instruction contains a call on the exec function and whatever belt diddling is needed for the interpretation. Table entries will have two pieces, one for flow side and one for exu side, and your baby-JIT will just glue the pieces on the correct end of the EBB being built. When you hit the end of an EBB in the bytecode then give your new binary EBB to the OS to bless as code and start on the next. If you can’t tell where the EBB boundaries are in the bytecode then just treat each basic block (which you should be able to delimit) as an EBB.

Easy, huh? 🙂

Reply To: Simulation