No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
L("loop") %A1 %x %i %nsubi %tmp1 %tmp2; load(%A1, 0, $i, 4, 4) %a, mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.