Mill Computing, Inc. › Forums › The Mill › Tools › Compilers › Loop compilation › Reply To: Loop compilation
Here’s what the compiler gives you today for your example (Silver, no pipelining or vectorization, –O2):
F("f") %0 %1;
load(dp, gl("n"), w, 3) %4/27 ^14,
scratchf(s[1]) ^11;
// V%0 V%1
spill(s[0], b0 %0/26) ^10,
spill(s[4], b1 %1/29) ^10;
// ^%0 ^%1
spill(s[1], b0 %4/27) ^14;
// L7C24F0=/home/ivan/mill/specializer/src/test7.c
// V%4 ^%4
nop();
nop();
gtrsb(b1 %4/27, 0) %5 ^15, // L7C22F0
brfls(b0 %5, "f$0_3", b1 %2, b1 %2) ^16, // L7C4F0
rd(w(0)) %2 ^12;
// V%2 ^%4 V%5 ^%5 ^%2 ^%2
load(dp, gl("A1"), d, 3) %9/28 ^21;
nop();
spill(s[1], b0 %9/28) ^21;
// V%9 ^%9
nop();
nop();
inners("f$1_2", b2 %2, b2 %2, b2 %2) ^22; // L7C4F0
// ^%2 ^%2 ^%2
L("f$0_3") %20 %21;
// for.cond.cleanup
mul(b1 %20, b0 %29/1) %22 ^38, // L12C20F0
fill(s[4], w) %29/1 ^10;
// V%21 V%20 V%29 ^%20 ^%29
nop();
nop(); // V%22
mul(b0 %22, b3 %21) %23 ^39; // L12C27F0
// ^%22 ^%21
nop();
retn(b0 %23) ^40; // L12C8F0
// V%23 ^%23
L("f$1_2") %10 %11 %12;
// for.body // loop=$2 header
mul(b3 %11, b1 %26/0) %14 ^28, // L8C26F0
loadus(b0 %28/9, 0, b3 %11, w, 3) %13 ^27,
fill(s[0], w) %26/0 ^10,
fill(s[1], d) %28/9 ^21;
// V%11 V%10 V%12 V%26 V%28 ^%11 ^%28 ^%26 ^%11
nop();
nop(); // V%13 V%14
mul(b0 %14, b1 %13) %15 ^29; // L9C23F0
// ^%14 ^%13
nop();
nop(); // V%15
fma(b0 %15, b6 %11, b7 %12) %17 ^31; // L10C15F0
// ^%12 ^%11 ^%15
add(b6 %11, 1) %18 ^32; // L7C28F0
// ^%11 V%18
lsssb(b1 %18, b0 %27/4) %19 ^33, // L7C22F0
add(b2 %15, b7 %10) %16 ^30, // L9C14F0
backtr(b1 %19, "f$1_2", b0 %16, b4 %18, b2 %17) ^34, // L7C4F0
leaves("f$0_3", b2 %17, b0 %16) ^34, // L7C4F0
fill(s[1], w) %27/4 ^14;
// L7C24F0=/home/ivan/mill/specializer/src/test7.c
// V%27 ^%18 ^%27 ^%10 ^%15 V%19 V%17 V%16 ^%19 ^%16 ^%18 ^%17 ^%17 ^%16
Your three ebbs are there; f$1_2 is the loop body. The back branch is backtr(), the exit is leaves(). These take belt arguments (%N…) which are passed to the formals at the start of the target ebb (the formals are listed after the L(…) label). The initial entry is the inners() op, which supplies the starting values for the formals.
Branch argument passing works like call argument passing: there is no actual movement of data, just a renumbering of the belt to put the actuals into the formals order. Absent a mispredict, a branch (including the back/inner/leave forms) takes zero time, including passing the arguments.
The imported loop-constant values “n” and “x” are first spilled to the scratchpad in the prologue, and then filled for each iteration. It would also be possible to pass the values around the loop as an addition formal and back() argument. The specializer did not do that in this case; the latency would be the same, entropy slightly less, and power usage hardware dependent.
Incidentally, the loop body here is nine cycles (count the semicolons); with pipelining enabled it drops to three.