Mill Computing, Inc

Keymaster

January 8, 2014 at 12:25 pm

Post count: 689

Branch operations can contain an explicit delay, the way loads can; a delay of zero may be omitted, indicating an immediate branch, which takes effect in the following cycle, as described. A delay of one takes effect in the cycle after that, and so on. As a result, there may be several branches in flight at the same time. These all will occur, in their respective cycle.

Instructions may contain several different branch operations. For sensible code, these will normally use different predicates and there will normally be at most one unconditional branch. All predicates are evaluated, and any branches with unsatisfied predicates are ignored.

For any given cycle, there will be a (possibly empty) set of branches which are due to retire in that cycle; some with zero delay from the current instruction, and some from prior instructions that are at last timing out. One of these retiring branches wins, according to the First Winner Rule. FWR says that shorter delay beats longer delay, and for equal delay higher slot number beats lower slot number. Operations are in asm are packed into slots in reverse textual order, so slot zero is on the right of the instruction as written. Consequently you can read the code and the textually first branch that is satisfied will be the one taken; hence First Winner Rule.

As an example:

if (a == 0) goto A0;
else if (a == 1) goto A1;
else goto Arest;

codes as

eql(<a>, 0), eql(<a>, 1);
brtr(b0, "A0"), brtr(b1, "A1"), br("Arest");

With phasing (due in the 2/5 talk) the code can be reduced to a single instruction rather than the two shown here. I use goto for simplicity here, but the same structure is used for any branches.

Loops are no different:
while (--i != 0) { a += a; }
can be encoded as:

L("loop");
sub(<i>, 1);
eql(<sub>, 0);
brtr(<eql>, "xit");
add(<a>, <a>), br("loop")

but would be better coded as:

L("loop");
sub(<i>, 1);
eql(<sub>, 0);
add(<a>, <a>), brfl(<eql>, "loop");
//falls through on exit

Yes, the second form does the add an extra time on the last iteration, but the former value is still on the belt and is the correct value for “a” at the end. The belt permits optimizations like this that are not possible if the a+=a was updating a register. Combine this optimization with phasing and a NYF operation and the whole loop is one instruction.

I’m not sure what a flat-out OOO superscalar would do with this code, but it clearly would not be better than the Mill 🙂

This reply was modified 10 years, 6 months ago by Ivan Godard.

Reply To: Instruction Encoding