Mill Computing, Inc. › Forums › The Mill › Tools › Compilers › Loop compilation
- AuthorPosts
- #3365 |
Hi,
If I have a C code that does something like this (at bit more than extremely primitive trivial loop):
extern int n; extern int* A1; int f(int x, int y) { int tmp1 = 0; int tmp2 = 0; for (int i = 0; i < n; i++) { int mytmp = A1[i] * i; tmp1 += mytmp * x; tmp2 += mytmp * x * i; } return tmp1 * tmp2 * y; }
and assuming no unrolling and pipelining and other optimization, my understanding is that it will be separated into 3 EBBs:
<init_ebb>
% setup the belt with tmp1 and tmp2 for loop_ebb and epilog_ebb, but also with x and y.
<loop_ebb>
% carry stuff over, do computations, and at the end restore belt into proper shape for loop_ebb reentry or epilog_ebb
<epilog_ebb>
% trivial consume stuff from the belt and returnwith loop_ebb needing to carry over the tmp1, tmp2, x, y, even if it temporarily used belt to store some additional value, or didn’t even use y.
My question: What kind of operation would be used at the exit of the loop_ebb to jump again to loop_ebb or epilog_ebb ? how the belt would be configured for loop_ebb, so the epilog_ebb can see properly all used variables? I also guess that the n, usually will be loaded from memory once by init_ebb, and put on belt too (lets assume the compiler figured out there is no aliasing, or the n to change during loop execution), and preserved across calls to loop_ebb and epilog_ebb, and ultimately ignored by epilog_ebb.
I understand that first step would be to convert this into SSA form, but at the same time, it feels the end result will look like a tail call style looping from functional languages, and the issue of moving arguments on belt to be on the same positions as before (so next EBB works correctly), could lead to a big per loop iteration overhead (especially if there is a lot of carry over values, even if they are not used or updated in the loop itself, like ‘y’ above.).
A discussion, or output of the current Mill conAsm could help too. 🙂
Thanks!
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
L("loop") %A1 %x %i %nsubi %tmp1 %tmp2; load(%A1, 0, $i, 4, 4) %a, mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
- This reply was modified 5 years, 11 months ago by Veedrac.
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
L("loop") %A1 %x %i %nsubi %tmp1 %tmp2; load(%A1, 0, $i, 4, 4) %a, mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
(reposting due to forum issues; apologies if this spams)
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
L("loop") %A1 %x %i %nsubi %tmp1 %tmp2; load(%A1, 0, $i, 4, 4) %a, mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
Hi Veedrac. This is really helpful! Thanks.
So, when we also include the ‘y’, and not use spiller it will be something like this?
L("init") load(&A1, 0, 0, 4, 0) %A1, or(x, 0) %x, xor(0, 0) %i, load(&n, 0, 0, 4, 0) %nsubi, xor(0, 0) %tmp1, xor(0, 0) %tmp2, or(y, 0) %y; % please ignore the exact order of ops here, load delays, or exceeding number of AGU and ALU FUs. <stalling few cycles for loads of A1 and nsubi to finish> eql(%nsubi); brtr("epilog", %A1 %x %i %nsubi %tmp1 %tmp2 %y); % %A1 and %nsubi are not needed, but whatever. L("loop") %A1 %x %i %nsubi %tmp1 %tmp2 %y; load(%A1, 0, $i, 4, 4) %a, % shouldn't be "load(%A1, 0, $i, 4, 5)"? mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_ %y); L("epilog") mul(%y, %tmp1) %tmp1y_; nop; nop; mul(%tmp1y_, %tmp2) %r_; nop; nop; ret(%r_);
(with mul with y done first, so it is closer on belt, which might be useful if the loop is even larger).
Is
brtr("epilog", ...)
even allowed like that? The entire thing is no longer EBB AFAIK, but I think hardware should be ok with this.I am thinking to write a simulator / dynamic recompiler for Mill-like ISA to be simulated (for research mostly), with input machine code mostly written by hand at this time.
Witold – Mill branches can only be to the head of an ebb, so you can’t both fall into the epilog and also branch to it. That’s why there’s a “leaves” op in the actual asm. Having the prologue branch around the loop body is not something the specializer does, it’s from LLVM and you should see it when compiling for other architectures too.
“I am thinking to write a simulator / dynamic recompiler for Mill-like ISA to be simulated (for research mostly), with input machine code mostly written by hand at this time.”
If so, why not just join the Mill team? You’d have access to the real tools, and would get equity in the company.
Be aware that hand writing Mill assembler is not recommended 🙂 It’s hard enough to read, even when you know the machine.
Veedrac, you make an excellent code generator 🙂 The compiler got rid of the add by using a fma() op instead, and you’re using brtr rather than the loop branching forms, but otherwise it’s right on by my eyeball.
So, when we also include the ‘y’, and not use spiller it will be something like this?
As I understand it, yes, basically. There are a few differences I know of.
You won’t use
or(x, 0)
to reorder ops, orxor(0, 0)
to load zeros. Reordering would probably be done here by using a branch instruction encoding that allows for reordering, or the conform instruction (the Wiki isn’t up-to-date on this). Loading a zero would probably be done withrd
, since that phases well.eql
takes two arguments unless it’s ganged (then it’s 0). I think you would have tord
in a zero or compare to%i
, but I’m not sure.eql
is also phased such that it can be in the same cycle asbrtr
; my semicolon was a typo.(not editing because that broke things last time)
There’s also no reason to load multiple zeros in when using a conform-like-brtr (or conform) since it presumably lets you specify a belt item in multiple places.
Thanks again Veedrac. I will read on branch belt reordering, (I am already reading on
conform
andrescue
). I am guessing it is very similar to what call with multiple arguments and “new belt” is doing, but without actually creating new frame, but instead dropping multiple new things on the belt during branch (when taken).Thanks for the
eql
hint. I am just free styling most of this code, as I didn’t really go deep into wiki, because I wanted more conceptual ideas first clarified. 🙂As for the phasing and ganging, I am still eluded by exact semantic, but I will do check a wiki and talks again!
As for the
rd
, immediate constants can be done withcon
.rd
looks like a good choice for quick copying belt values (more compact thanor
, and works with vectors / floating points values on belt).Still not sure, why
rd
has ability to read from scratchpad, becausefill
does the same apparently. One is called “extended” scratchpad tho, so mayberd
on scratchpad is just for debugging, and thread state saving, without touching spilled stuff.
Yes, sadly the Wiki has not kept up with the machine. We are very resource constrained until the next funding round.
rd
can’t copy belt values; the whole point of that phase is that it has no belt inputs. It has four purposes:1. Dropping predefined “popular constants” (popCons) like [0, 1, 2, 3], Ï€, e, √2. “The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.” con should work as well, but is significantly larger.
2. Operating on the in-core scratchpad, allocated with
scratchf
.spill
andfill
are for the extended scratchpad, which is stored in memory and allocated withexscratchf
, and can be much larger.3. Reading registers. These aren’t belt locations, they are things like hardware counters, threadIDs, and supposedly also “thisPointer” for method calls.
4. Stream handling, a magic secret thing for “cacheless bulk memory access” we’ve never heard the details of, as far as I know.
rd() and wr() are no longer used to reach scratchpad; we now use another overload of spill/fill instead. It’s just a nomenclature change.
Streams remain undisclosed (and incompletely implemented). There will be a talk someday.
Please accept my apologies, folks. Somehow a recent update of the Forum software removed me from the notify list, so I was unaware of the discussion. I’ll be working through the backlog promptly.
IvanHere’s what the compiler gives you today for your example (Silver, no pipelining or vectorization, –O2):
F("f") %0 %1; load(dp, gl("n"), w, 3) %4/27 ^14, scratchf(s[1]) ^11; // V%0 V%1 spill(s[0], b0 %0/26) ^10, spill(s[4], b1 %1/29) ^10; // ^%0 ^%1 spill(s[1], b0 %4/27) ^14; // L7C24F0=/home/ivan/mill/specializer/src/test7.c // V%4 ^%4 nop(); nop(); gtrsb(b1 %4/27, 0) %5 ^15, // L7C22F0 brfls(b0 %5, "f$0_3", b1 %2, b1 %2) ^16, // L7C4F0 rd(w(0)) %2 ^12; // V%2 ^%4 V%5 ^%5 ^%2 ^%2 load(dp, gl("A1"), d, 3) %9/28 ^21; nop(); spill(s[1], b0 %9/28) ^21; // V%9 ^%9 nop(); nop(); inners("f$1_2", b2 %2, b2 %2, b2 %2) ^22; // L7C4F0 // ^%2 ^%2 ^%2 L("f$0_3") %20 %21; // for.cond.cleanup mul(b1 %20, b0 %29/1) %22 ^38, // L12C20F0 fill(s[4], w) %29/1 ^10; // V%21 V%20 V%29 ^%20 ^%29 nop(); nop(); // V%22 mul(b0 %22, b3 %21) %23 ^39; // L12C27F0 // ^%22 ^%21 nop(); retn(b0 %23) ^40; // L12C8F0 // V%23 ^%23 L("f$1_2") %10 %11 %12; // for.body // loop=$2 header mul(b3 %11, b1 %26/0) %14 ^28, // L8C26F0 loadus(b0 %28/9, 0, b3 %11, w, 3) %13 ^27, fill(s[0], w) %26/0 ^10, fill(s[1], d) %28/9 ^21; // V%11 V%10 V%12 V%26 V%28 ^%11 ^%28 ^%26 ^%11 nop(); nop(); // V%13 V%14 mul(b0 %14, b1 %13) %15 ^29; // L9C23F0 // ^%14 ^%13 nop(); nop(); // V%15 fma(b0 %15, b6 %11, b7 %12) %17 ^31; // L10C15F0 // ^%12 ^%11 ^%15 add(b6 %11, 1) %18 ^32; // L7C28F0 // ^%11 V%18 lsssb(b1 %18, b0 %27/4) %19 ^33, // L7C22F0 add(b2 %15, b7 %10) %16 ^30, // L9C14F0 backtr(b1 %19, "f$1_2", b0 %16, b4 %18, b2 %17) ^34, // L7C4F0 leaves("f$0_3", b2 %17, b0 %16) ^34, // L7C4F0 fill(s[1], w) %27/4 ^14; // L7C24F0=/home/ivan/mill/specializer/src/test7.c // V%27 ^%18 ^%27 ^%10 ^%15 V%19 V%17 V%16 ^%19 ^%16 ^%18 ^%17 ^%17 ^%16
Your three ebbs are there; f$1_2 is the loop body. The back branch is backtr(), the exit is leaves(). These take belt arguments (%N…) which are passed to the formals at the start of the target ebb (the formals are listed after the L(…) label). The initial entry is the inners() op, which supplies the starting values for the formals.
Branch argument passing works like call argument passing: there is no actual movement of data, just a renumbering of the belt to put the actuals into the formals order. Absent a mispredict, a branch (including the back/inner/leave forms) takes zero time, including passing the arguments.
The imported loop-constant values “n” and “x” are first spilled to the scratchpad in the prologue, and then filled for each iteration. It would also be possible to pass the values around the loop as an addition formal and back() argument. The specializer did not do that in this case; the latency would be the same, entropy slightly less, and power usage hardware dependent.
Incidentally, the loop body here is nine cycles (count the semicolons); with pipelining enabled it drops to three.
- AuthorPosts
You must be logged in to reply to this topic.