Mill Computing, Inc

Keymaster

April 1, 2019 at 7:35 am

Post count: 689

Here’s what the compiler gives you today for your example (Silver, no pipelining or vectorization, –O2):

F("f") %0 %1;
    load(dp, gl("n"), w, 3) %4/27 ^14,
    scratchf(s[1]) ^11;
                // V%0 V%1

    spill(s[0], b0 %0/26) ^10,
    spill(s[4], b1 %1/29) ^10;
                // ^%0 ^%1

    spill(s[1], b0 %4/27) ^14;
        // L7C24F0=/home/ivan/mill/specializer/src/test7.c
                // V%4 ^%4

    nop();
    nop();
    gtrsb(b1 %4/27, 0) %5 ^15,                                  // L7C22F0
    brfls(b0 %5, "f$0_3", b1 %2, b1 %2) ^16,                    // L7C4F0
    rd(w(0)) %2 ^12;
                // V%2 ^%4 V%5 ^%5 ^%2 ^%2

    load(dp, gl("A1"), d, 3) %9/28 ^21;

    nop();
    spill(s[1], b0 %9/28) ^21;
                // V%9 ^%9

    nop();
    nop();
    inners("f$1_2", b2 %2, b2 %2, b2 %2) ^22;                   // L7C4F0
                // ^%2 ^%2 ^%2

L("f$0_3") %20 %21;
        // for.cond.cleanup
    mul(b1 %20, b0 %29/1) %22 ^38,                              // L12C20F0
    fill(s[4], w) %29/1 ^10;
                // V%21 V%20 V%29 ^%20 ^%29

    nop();
    nop();      // V%22
    mul(b0 %22, b3 %21) %23 ^39;                                // L12C27F0
                // ^%22 ^%21

    nop();
    retn(b0 %23) ^40;                                           // L12C8F0
                // V%23 ^%23

L("f$1_2") %10 %11 %12;
        // for.body // loop=$2 header
    mul(b3 %11, b1 %26/0) %14 ^28,                              // L8C26F0
    loadus(b0 %28/9, 0, b3 %11, w, 3) %13 ^27,
    fill(s[0], w) %26/0 ^10,
    fill(s[1], d) %28/9 ^21;
                // V%11 V%10 V%12 V%26 V%28 ^%11 ^%28 ^%26 ^%11

    nop();
    nop();      // V%13 V%14
    mul(b0 %14, b1 %13) %15 ^29;                                // L9C23F0
                // ^%14 ^%13

    nop();
    nop();      // V%15
    fma(b0 %15, b6 %11, b7 %12) %17 ^31;                        // L10C15F0
                // ^%12 ^%11 ^%15

    add(b6 %11, 1) %18 ^32;                                     // L7C28F0
                // ^%11 V%18

    lsssb(b1 %18, b0 %27/4) %19 ^33,                            // L7C22F0
    add(b2 %15, b7 %10) %16 ^30,                                // L9C14F0
    backtr(b1 %19, "f$1_2", b0 %16, b4 %18, b2 %17) ^34,        // L7C4F0
    leaves("f$0_3", b2 %17, b0 %16) ^34,                        // L7C4F0
    fill(s[1], w) %27/4 ^14;
        // L7C24F0=/home/ivan/mill/specializer/src/test7.c
                // V%27 ^%18 ^%27 ^%10 ^%15 V%19 V%17 V%16 ^%19 ^%16 ^%18 ^%17 ^%17 ^%16

Your three ebbs are there; f$1_2 is the loop body. The back branch is backtr(), the exit is leaves(). These take belt arguments (%N…) which are passed to the formals at the start of the target ebb (the formals are listed after the L(…) label). The initial entry is the inners() op, which supplies the starting values for the formals.

Branch argument passing works like call argument passing: there is no actual movement of data, just a renumbering of the belt to put the actuals into the formals order. Absent a mispredict, a branch (including the back/inner/leave forms) takes zero time, including passing the arguments.

The imported loop-constant values “n” and “x” are first spilled to the scratchpad in the prologue, and then filled for each iteration. It would also be possible to pass the values around the loop as an addition formal and back() argument. The specializer did not do that in this case; the latency would be the same, entropy slightly less, and power usage hardware dependent.

Incidentally, the loop body here is nine cycles (count the semicolons); with pipelining enabled it drops to three.

Reply To: Loop compilation