Loop compilation

Author
Posts
Witold Baryluk
Participant
December 16, 2018 at 8:04 am
Post count: 33
#3365 |
Hi,
If I have a C code that does something like this (at bit more than extremely primitive trivial loop):
```
extern int n;
extern int* A1;

int f(int x, int y) {
  int tmp1 = 0;
  int tmp2 = 0;
  for (int i = 0; i < n; i++) {
    int mytmp = A1[i] * i;
    tmp1 += mytmp * x;
    tmp2 += mytmp * x * i;
  }
  return tmp1 * tmp2 * y;
}
```
and assuming no unrolling and pipelining and other optimization, my understanding is that it will be separated into 3 EBBs:
<init_ebb> % setup the belt with tmp1 and tmp2 for loop_ebb and epilog_ebb, but also with x and y.
<loop_ebb> % carry stuff over, do computations, and at the end restore belt into proper shape for loop_ebb reentry or epilog_ebb
<epilog_ebb> % trivial consume stuff from the belt and return
with loop_ebb needing to carry over the tmp1, tmp2, x, y, even if it temporarily used belt to store some additional value, or didn’t even use y.
My question: What kind of operation would be used at the exit of the loop_ebb to jump again to loop_ebb or epilog_ebb ? how the belt would be configured for loop_ebb, so the epilog_ebb can see properly all used variables? I also guess that the n, usually will be loaded from memory once by init_ebb, and put on belt too (lets assume the compiler figured out there is no aliasing, or the n to change during loop execution), and preserved across calls to loop_ebb and epilog_ebb, and ultimately ignored by epilog_ebb.
I understand that first step would be to convert this into SSA form, but at the same time, it feels the end result will look like a tail call style looping from functional languages, and the issue of moving arguments on belt to be on the same positions as before (so next EBB works correctly), could lead to a big per loop iteration overhead (especially if there is a lot of carry over values, even if they are not used or updated in the loop itself, like ‘y’ above.).
A discussion, or output of the current Mill conAsm could help too. 🙂
Thanks!
Veedrac
Participant
December 18, 2018 at 1:01 am
Post count: 25
#3374
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
```
    L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
        load(%A1, 0, $i, 4, 4) %a,
        mul(%x,  %i) %xi;

        nop; nop;

        mul(%xi, %i) %xii;

        nop; nop;

        mul(%a, %xi)  %tmp1inc,
        mul(%a, %xii) %tmp2inc;

        nop; nop;

        add1u(%i) %i_,
        sub1u(%nsubi) %nsubi_,
        eql(),
        add(%tmp1, %tmp1inc) %tmp1_,
        add(%tmp2, %tmp2inc) %tmp2_;
        brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
```
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
- This reply was modified 5 years, 7 months ago by Veedrac.
Veedrac
Participant
December 18, 2018 at 1:03 am
Post count: 25
#3376
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
```
    L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
        load(%A1, 0, $i, 4, 4) %a,
        mul(%x,  %i) %xi;

        nop; nop;

        mul(%xi, %i) %xii;

        nop; nop;

        mul(%a, %xi)  %tmp1inc,
        mul(%a, %xii) %tmp2inc;

        nop; nop;

        add1u(%i) %i_,
        sub1u(%nsubi) %nsubi_,
        eql(),
        add(%tmp1, %tmp1inc) %tmp1_,
        add(%tmp2, %tmp2inc) %tmp2_;
        brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
```
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
Veedrac
Participant
December 18, 2018 at 1:05 am
Post count: 25
#3377
(reposting due to forum issues; apologies if this spams)
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
```
    L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
        load(%A1, 0, $i, 4, 4) %a,
        mul(%x,  %i) %xi;

        nop; nop;

        mul(%xi, %i) %xii;

        nop; nop;

        mul(%a, %xi)  %tmp1inc,
        mul(%a, %xii) %tmp2inc;

        nop; nop;

        add1u(%i) %i_,
        sub1u(%nsubi) %nsubi_,
        eql(),
        add(%tmp1, %tmp1inc) %tmp1_,
        add(%tmp2, %tmp2inc) %tmp2_;
        brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
```
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
- Witold Baryluk
  Participant
  December 18, 2018 at 4:38 am
  Post count: 33
  #3378
  Hi Veedrac. This is really helpful! Thanks.
  So, when we also include the ‘y’, and not use spiller it will be something like this?
  L("init") load(&A1, 0, 0, 4, 0) %A1, or(x, 0) %x, xor(0, 0) %i, load(&n, 0, 0, 4, 0) %nsubi, xor(0, 0) %tmp1, xor(0, 0) %tmp2, or(y, 0) %y; % please ignore the exact order of ops here, load delays, or exceeding number of AGU and ALU FUs. <stalling few cycles for loads of A1 and nsubi to finish> eql(%nsubi); brtr("epilog", %A1 %x %i %nsubi %tmp1 %tmp2 %y); % %A1 and %nsubi are not needed, but whatever. L("loop") %A1 %x %i %nsubi %tmp1 %tmp2 %y; load(%A1, 0, $i, 4, 4) %a, % shouldn't be "load(%A1, 0, $i, 4, 5)"? mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_ %y); L("epilog") mul(%y, %tmp1) %tmp1y_; nop; nop; mul(%tmp1y_, %tmp2) %r_; nop; nop; ret(%r_);
  (with mul with y done first, so it is closer on belt, which might be useful if the loop is even larger).
  Is brtr("epilog", ...) even allowed like that? The entire thing is no longer EBB AFAIK, but I think hardware should be ok with this.
  I am thinking to write a simulator / dynamic recompiler for Mill-like ISA to be simulated (for research mostly), with input machine code mostly written by hand at this time.
  - Ivan Godard
    Keymaster
    April 1, 2019 at 7:49 am
    Post count: 689
    #3442
    Witold – Mill branches can only be to the head of an ebb, so you can’t both fall into the epilog and also branch to it. That’s why there’s a “leaves” op in the actual asm. Having the prologue branch around the loop body is not something the specializer does, it’s from LLVM and you should see it when compiling for other architectures too.
  - Ivan Godard
    Keymaster
    April 1, 2019 at 7:54 am
    Post count: 689
    #3443
    “I am thinking to write a simulator / dynamic recompiler for Mill-like ISA to be simulated (for research mostly), with input machine code mostly written by hand at this time.”
    If so, why not just join the Mill team? You’d have access to the real tools, and would get equity in the company.
    Be aware that hand writing Mill assembler is not recommended 🙂 It’s hard enough to read, even when you know the machine.
- Ivan Godard
  Keymaster
  April 1, 2019 at 7:45 am
  Post count: 689
  #3441
  Veedrac, you make an excellent code generator 🙂 The compiler got rid of the add by using a fma() op instead, and you’re using brtr rather than the loop branching forms, but otherwise it’s right on by my eyeball.
Veedrac
Participant
December 18, 2018 at 9:02 am
Post count: 25
#3380
So, when we also include the ‘y’, and not use spiller it will be something like this?
As I understand it, yes, basically. There are a few differences I know of.
You won’t use or(x, 0) to reorder ops, or xor(0, 0) to load zeros. Reordering would probably be done here by using a branch instruction encoding that allows for reordering, or the conform instruction (the Wiki isn’t up-to-date on this). Loading a zero would probably be done with rd, since that phases well.
eql takes two arguments unless it’s ganged (then it’s 0). I think you would have to rd in a zero or compare to %i, but I’m not sure. eql is also phased such that it can be in the same cycle as brtr; my semicolon was a typo.
- Veedrac
  Participant
  December 18, 2018 at 9:09 am
  Post count: 25
  #3381
  (not editing because that broke things last time)
  There’s also no reason to load multiple zeros in when using a conform-like-brtr (or conform) since it presumably lets you specify a belt item in multiple places.
  - Witold Baryluk
    Participant
    December 19, 2018 at 4:32 am
    Post count: 33
    #3383
    Thanks again Veedrac. I will read on branch belt reordering, (I am already reading on conform and rescue). I am guessing it is very similar to what call with multiple arguments and “new belt” is doing, but without actually creating new frame, but instead dropping multiple new things on the belt during branch (when taken).
    Thanks for the eql hint. I am just free styling most of this code, as I didn’t really go deep into wiki, because I wanted more conceptual ideas first clarified. 🙂
    As for the phasing and ganging, I am still eluded by exact semantic, but I will do check a wiki and talks again!
    As for the rd, immediate constants can be done with con. rd looks like a good choice for quick copying belt values (more compact than or, and works with vectors / floating points values on belt).
    Still not sure, why rd has ability to read from scratchpad, because fill does the same apparently. One is called “extended” scratchpad tho, so maybe rd on scratchpad is just for debugging, and thread state saving, without touching spilled stuff.
- Ivan Godard
  Keymaster
  April 1, 2019 at 7:59 am
  Post count: 689
  #3444
  Yes, sadly the Wiki has not kept up with the machine. We are very resource constrained until the next funding round.
Veedrac
Participant
December 19, 2018 at 7:19 am
Post count: 25
#3388
rd can’t copy belt values; the whole point of that phase is that it has no belt inputs. It has four purposes:
1. Dropping predefined “popular constants” (popCons) like [0, 1, 2, 3], π, e, √2. “The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.” con should work as well, but is significantly larger.
2. Operating on the in-core scratchpad, allocated with scratchf. spill and fill are for the extended scratchpad, which is stored in memory and allocated with exscratchf, and can be much larger.
3. Reading registers. These aren’t belt locations, they are things like hardware counters, threadIDs, and supposedly also “thisPointer” for method calls.
4. Stream handling, a magic secret thing for “cacheless bulk memory access” we’ve never heard the details of, as far as I know.
- Ivan Godard
  Keymaster
  April 1, 2019 at 8:02 am
  Post count: 689
  #3445
  rd() and wr() are no longer used to reach scratchpad; we now use another overload of spill/fill instead. It’s just a nomenclature change.
  Streams remain undisclosed (and incompletely implemented). There will be a talk someday.
Ivan Godard
Keymaster
April 1, 2019 at 6:53 am
Post count: 689
#3439
Please accept my apologies, folks. Somehow a recent update of the Forum software removed me from the notify list, so I was unaware of the discussion. I’ll be working through the backlog promptly.
Ivan

Keymaster

April 1, 2019 at 7:35 am

Post count: 689

Here’s what the compiler gives you today for your example (Silver, no pipelining or vectorization, –O2):

F("f") %0 %1;
    load(dp, gl("n"), w, 3) %4/27 ^14,
    scratchf(s[1]) ^11;
                // V%0 V%1

    spill(s[0], b0 %0/26) ^10,
    spill(s[4], b1 %1/29) ^10;
                // ^%0 ^%1

    spill(s[1], b0 %4/27) ^14;
        // L7C24F0=/home/ivan/mill/specializer/src/test7.c
                // V%4 ^%4

    nop();
    nop();
    gtrsb(b1 %4/27, 0) %5 ^15,                                  // L7C22F0
    brfls(b0 %5, "f$0_3", b1 %2, b1 %2) ^16,                    // L7C4F0
    rd(w(0)) %2 ^12;
                // V%2 ^%4 V%5 ^%5 ^%2 ^%2

    load(dp, gl("A1"), d, 3) %9/28 ^21;

    nop();
    spill(s[1], b0 %9/28) ^21;
                // V%9 ^%9

    nop();
    nop();
    inners("f$1_2", b2 %2, b2 %2, b2 %2) ^22;                   // L7C4F0
                // ^%2 ^%2 ^%2

L("f$0_3") %20 %21;
        // for.cond.cleanup
    mul(b1 %20, b0 %29/1) %22 ^38,                              // L12C20F0
    fill(s[4], w) %29/1 ^10;
                // V%21 V%20 V%29 ^%20 ^%29

    nop();
    nop();      // V%22
    mul(b0 %22, b3 %21) %23 ^39;                                // L12C27F0
                // ^%22 ^%21

    nop();
    retn(b0 %23) ^40;                                           // L12C8F0
                // V%23 ^%23

L("f$1_2") %10 %11 %12;
        // for.body // loop=$2 header
    mul(b3 %11, b1 %26/0) %14 ^28,                              // L8C26F0
    loadus(b0 %28/9, 0, b3 %11, w, 3) %13 ^27,
    fill(s[0], w) %26/0 ^10,
    fill(s[1], d) %28/9 ^21;
                // V%11 V%10 V%12 V%26 V%28 ^%11 ^%28 ^%26 ^%11

    nop();
    nop();      // V%13 V%14
    mul(b0 %14, b1 %13) %15 ^29;                                // L9C23F0
                // ^%14 ^%13

    nop();
    nop();      // V%15
    fma(b0 %15, b6 %11, b7 %12) %17 ^31;                        // L10C15F0
                // ^%12 ^%11 ^%15

    add(b6 %11, 1) %18 ^32;                                     // L7C28F0
                // ^%11 V%18

    lsssb(b1 %18, b0 %27/4) %19 ^33,                            // L7C22F0
    add(b2 %15, b7 %10) %16 ^30,                                // L9C14F0
    backtr(b1 %19, "f$1_2", b0 %16, b4 %18, b2 %17) ^34,        // L7C4F0
    leaves("f$0_3", b2 %17, b0 %16) ^34,                        // L7C4F0
    fill(s[1], w) %27/4 ^14;
        // L7C24F0=/home/ivan/mill/specializer/src/test7.c
                // V%27 ^%18 ^%27 ^%10 ^%15 V%19 V%17 V%16 ^%19 ^%16 ^%18 ^%17 ^%17 ^%16

Your three ebbs are there; f$1_2 is the loop body. The back branch is backtr(), the exit is leaves(). These take belt arguments (%N…) which are passed to the formals at the start of the target ebb (the formals are listed after the L(…) label). The initial entry is the inners() op, which supplies the starting values for the formals.

Branch argument passing works like call argument passing: there is no actual movement of data, just a renumbering of the belt to put the actuals into the formals order. Absent a mispredict, a branch (including the back/inner/leave forms) takes zero time, including passing the arguments.

The imported loop-constant values “n” and “x” are first spilled to the scratchpad in the prologue, and then filled for each iteration. It would also be possible to pass the values around the loop as an addition formal and back() argument. The specializer did not do that in this case; the latency would be the same, entropy slightly less, and power usage hardware dependent.

Incidentally, the loop body here is nine cycles (count the semicolons); with pipelining enabled it drops to three.

Author
Posts

You must be logged in to reply to this topic.