Mill Computing, Inc. Forums The Mill Tools Compilers Loop compilation

  • Author
    Posts
  • Witold Baryluk
    Participant
    Post count: 27
    #3365 |

    Hi,

    If I have a C code that does something like this (at bit more than extremely primitive trivial loop):

    
    extern int n;
    extern int* A1;
    
    int f(int x, int y) {
      int tmp1 = 0;
      int tmp2 = 0;
      for (int i = 0; i < n; i++) {
        int mytmp = A1[i] * i;
        tmp1 += mytmp * x;
        tmp2 += mytmp * x * i;
      }
      return tmp1 * tmp2 * y;
    }
    

    and assuming no unrolling and pipelining and other optimization, my understanding is that it will be separated into 3 EBBs:

    <init_ebb> % setup the belt with tmp1 and tmp2 for loop_ebb and epilog_ebb, but also with x and y.
    <loop_ebb> % carry stuff over, do computations, and at the end restore belt into proper shape for loop_ebb reentry or epilog_ebb
    <epilog_ebb> % trivial consume stuff from the belt and return

    with loop_ebb needing to carry over the tmp1, tmp2, x, y, even if it temporarily used belt to store some additional value, or didn’t even use y.

    My question: What kind of operation would be used at the exit of the loop_ebb to jump again to loop_ebb or epilog_ebb ? how the belt would be configured for loop_ebb, so the epilog_ebb can see properly all used variables? I also guess that the n, usually will be loaded from memory once by init_ebb, and put on belt too (lets assume the compiler figured out there is no aliasing, or the n to change during loop execution), and preserved across calls to loop_ebb and epilog_ebb, and ultimately ignored by epilog_ebb.

    I understand that first step would be to convert this into SSA form, but at the same time, it feels the end result will look like a tail call style looping from functional languages, and the issue of moving arguments on belt to be on the same positions as before (so next EBB works correctly), could lead to a big per loop iteration overhead (especially if there is a lot of carry over values, even if they are not used or updated in the loop itself, like ‘y’ above.).

    A discussion, or output of the current Mill conAsm could help too. 🙂

    Thanks!

  • Veedrac
    Participant
    Post count: 19

    (reposting due to forum issues; apologies if this spams)

    No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.

    
        L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
            load(%A1, 0, $i, 4, 4) %a,
            mul(%x,  %i) %xi;
    
            nop; nop;
    
            mul(%xi, %i) %xii;
    
            nop; nop;
    
            mul(%a, %xi)  %tmp1inc,
            mul(%a, %xii) %tmp2inc;
    
            nop; nop;
    
            add1u(%i) %i_,
            sub1u(%nsubi) %nsubi_,
            eql(),
            add(%tmp1, %tmp1inc) %tmp1_,
            add(%tmp2, %tmp2inc) %tmp2_;
            brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
    

    Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.

    • Witold Baryluk
      Participant
      Post count: 27

      Hi Veedrac. This is really helpful! Thanks.

      So, when we also include the ‘y’, and not use spiller it will be something like this?

      
      L("init")
      load(&A1, 0, 0, 4, 0) %A1, or(x, 0) %x, xor(0, 0) %i, load(&n, 0, 0, 4, 0) %nsubi, xor(0, 0) %tmp1, xor(0, 0) %tmp2, or(y, 0) %y;  % please ignore the exact order of ops here, load delays, or exceeding number of AGU and ALU FUs.
      
      <stalling few cycles for loads of A1 and nsubi to finish>
      
      eql(%nsubi);
      
      brtr("epilog", %A1 %x %i %nsubi %tmp1 %tmp2 %y);   % %A1 and %nsubi are not needed, but whatever.
      
      L("loop") %A1 %x %i %nsubi %tmp1 %tmp2 %y;
              load(%A1, 0, $i, 4, 4) %a,    % shouldn't be "load(%A1, 0, $i, 4, 5)"?
              mul(%x,  %i) %xi;
      
              nop; nop;
      
              mul(%xi, %i) %xii;
      
              nop; nop;
      
              mul(%a, %xi)  %tmp1inc,
              mul(%a, %xii) %tmp2inc;
      
              nop; nop;
      
              add1u(%i) %i_,
              sub1u(%nsubi) %nsubi_,
              eql(),
              add(%tmp1, %tmp1inc) %tmp1_,
              add(%tmp2, %tmp2inc) %tmp2_;
              brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_ %y);
      
      L("epilog")
      mul(%y, %tmp1) %tmp1y_;
      
      nop; nop;
      
      mul(%tmp1y_, %tmp2) %r_;
      
      nop; nop;
      
      ret(%r_);
      

      (with mul with y done first, so it is closer on belt, which might be useful if the loop is even larger).

      Is brtr("epilog", ...) even allowed like that? The entire thing is no longer EBB AFAIK, but I think hardware should be ok with this.

      I am thinking to write a simulator / dynamic recompiler for Mill-like ISA to be simulated (for research mostly), with input machine code mostly written by hand at this time.

  • Veedrac
    Participant
    Post count: 19

    So, when we also include the ‘y’, and not use spiller it will be something like this?

    As I understand it, yes, basically. There are a few differences I know of.

    You won’t use or(x, 0) to reorder ops, or xor(0, 0) to load zeros. Reordering would probably be done here by using a branch instruction encoding that allows for reordering, or the conform instruction (the Wiki isn’t up-to-date on this). Loading a zero would probably be done with rd, since that phases well.

    eql takes two arguments unless it’s ganged (then it’s 0). I think you would have to rd in a zero or compare to %i, but I’m not sure. eql is also phased such that it can be in the same cycle as brtr; my semicolon was a typo.

    • Veedrac
      Participant
      Post count: 19

      (not editing because that broke things last time)

      There’s also no reason to load multiple zeros in when using a conform-like-brtr (or conform) since it presumably lets you specify a belt item in multiple places.

      • Witold Baryluk
        Participant
        Post count: 27

        Thanks again Veedrac. I will read on branch belt reordering, (I am already reading on conform and rescue). I am guessing it is very similar to what call with multiple arguments and “new belt” is doing, but without actually creating new frame, but instead dropping multiple new things on the belt during branch (when taken).

        Thanks for the eql hint. I am just free styling most of this code, as I didn’t really go deep into wiki, because I wanted more conceptual ideas first clarified. 🙂

        As for the phasing and ganging, I am still eluded by exact semantic, but I will do check a wiki and talks again!

        As for the rd, immediate constants can be done with con. rd looks like a good choice for quick copying belt values (more compact than or, and works with vectors / floating points values on belt).

        Still not sure, why rd has ability to read from scratchpad, because fill does the same apparently. One is called “extended” scratchpad tho, so maybe rd on scratchpad is just for debugging, and thread state saving, without touching spilled stuff.

  • Veedrac
    Participant
    Post count: 19

    rd can’t copy belt values; the whole point of that phase is that it has no belt inputs. It has four purposes:

    1. Dropping predefined “popular constants” (popCons) like [0, 1, 2, 3], π, e, √2. “The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.” con should work as well, but is significantly larger.

    2. Operating on the in-core scratchpad, allocated with scratchf. spill and fill are for the extended scratchpad, which is stored in memory and allocated with exscratchf, and can be much larger.

    3. Reading registers. These aren’t belt locations, they are things like hardware counters, threadIDs, and supposedly also “thisPointer” for method calls.

    4. Stream handling, a magic secret thing for “cacheless bulk memory access” we’ve never heard the details of, as far as I know.

You must be logged in to reply to this topic.