switches

Tagged: switches

Author
Posts
Ivan Godard
Keymaster
March 28, 2017 at 11:50 am
Post count: 689
#2818 |
An exchange from comp.arch repeated here:
++++++++++++++++++++++++++++++++++++++++++++++++++++++
@Stephen Fuld:
The C switch statement, and equivalents in other languages, are interesting. They allow substantial economy of source code, and quite compact generated code, but they do have some issues. The indexed load of the target address is frequently a cache miss, and it is followed immediately by an indexed jump, that is almost always a branch mispredict. Which leads me to my question. Does the Mill have any tricks up it sleeve that make switch statements less costly?
Ivan Godard
Keymaster
March 28, 2017 at 11:51 am
Post count: 689
#2819
The compiler currently turns switches into a binary search branch tree. The case space is partitioned by lss-than compares down to the level at which equal-compares are better, and then equal-compares thereafter. The logic deals well with runs of consecutive values in a case, such as all alphabetic characters; less-than compares are cheap for that. We do not currently generate a jump table. Although early versions of the ISA had jump tables in the code adjacent the the instruction that used them, we took the operation out; code side jump tables seriously muck up the code data paths.
This strategy is tuned for the wide issue and multi-branch capability in larger Mill family members. For a member that is Silver or wider all the equal compares and their branches fit in one instruction, and a big hunk of the less-than tree will be in prior instructions of the same ebb. Consequently execution can get well down the tree before one of the branches is actually taken and search enters a sub-tree. As the whole ebb has only one prediction (where it will exit to) there are many fewer misses than would occur on a legacy architecture that does branch prediction in a search tree. A bit of arithmetic will show how many taken branches (on average) are necessary to get to the target case (assuming cases are equiprobable and sparse, which is the worst case) for any member configuration of ALU exu slots (for the compares) and branch flow lots (for the branches).
This works quite well for switches up to the size of say a lexer, but will obviously fail for something like a YACC grammar with cases in the thousands. For those we need to add a cost function and data-side dispatch tables (Mill branches already support label variables) to the specializer; we just haven’t gotten around to that yet.
- Ivan Godard
  Keymaster
  March 28, 2017 at 11:52 am
  Post count: 689
  #2821
  @EricP:
  In what way do code side jump tables muck up the code data paths?
  For example, if there was an indexed branch instruction immediately
  followed in the code by a table of displacements, how would that
  be worse than any other kind of jump?
  Eric
  - Ivan Godard
    Keymaster
    March 28, 2017 at 11:53 am
    Post count: 689
    #2822
    That was pretty much what our initial design did, except that it was a real switch op, not just an indexed branch, and bounds-checked the argument into the default case.
    The mucking arose because the Mill code-side is both heavily streamed and heavily pipelined. It’s a fire-hose and the code side has to do all it can to keep everything out of the way. The whle code side is done on a whole-cache-line basis. Recall that each instruction is variable length and may be up to a whole line long and may cross a line boundary. The instruction header says how long it is, so the front end of decode is an instruction shifter that is two lines wide and cracks the line stream into an instruction stream.
    The cracker line fetch sequence is controlled by the predictor, so a predicted taken transfer causes the lines containing the target to replace the lines containing the transfer in the cracker. After the cracker the line relationship is lost. A miss discards the inflight instruction stream and starts the cracker with a new line-pair just as if it had acted on a predicted transfer.
    In those streams there’s no place for an arbitrarily large data structure like a jump table; it won’t be fetched as part of the line stream, and whatever part of it was in the remainder of the line containing the switch op would have been discarded as the cracker was refilled from the predicted switch target. Moreover, we don’t even know that we have a switch until two cycles after the line-cracker, and the cracker is quite likely to have followed one or two predictions before we know – the table is probably no longer in the decode at all any more, and we have to be prepared to get it, or at least the desired address, from cache.
    Consequently, other than the potential value of maybe part of the table having been brought to the icache along with the instruction containing the switch, there’s no value in having the table adjacent to the switch op except that it gives a cheap way to find the address of the table.
    Now that we know the address of the table (having decoded the switch op enough to discover we need an address), we have to do an indexed load and pull up eight bytes that we can compare against the prediction we followed a couple of cycles ago to see if we went where we should have. That’s a round-trip to the cache. However, the icache is set up to do whole lines – and nothing but whole lines. So to get the case target address we have to pull up a line (needs an adder somewhere to add the table address and the case index), and isolate eight bytes out of that line, which means there’s a shifter somewhere too. Can’t use the ALUs or L/D address adders for the add; they are busy working down the predicted path. Can’t use the ALU shifters for the isolate; they aren’t line wide and are also busy. So we need an adder in the branch unit and an isolate shifter in the icache, equivalent to what is on the front of a dcache, and both adder and isolater are used for nothing but the rare case of a switch. And you have several more data paths to mux, and another command sugnal or packet for the icache to deal with
    The latency is bad too. We don’t know we have a switch op until D2, plus a cycle for the add, plus maybe two-three cycles to get the line from the i$1 (the table won’t be in the i$0 microcache – I don’t even want to think about that), plus a cycle for the isolater, plus a cycle back and a cycle for the success comparison, just to confirm that you did indeed predict correctly. Then what happens if the switch target (as predicted) immediately does a switch too? Is switch handling fully pipelined? Probably not – you will stall, but then you will be stalling anyway while you figure out the address.
    Muck.
    Nowadays the jump table (when we get around to it) is on the data side, which already has the address-adder for an indexed load and the isolaters in the data path from the D$1, so no logic need be added to support switches. The ISA has dynamic branches that take the target address from the belt, for which the prediction confirmation takes zero latency, just like any other branch. The indexed load is separate from the branch and so can be hosted into the future as far as the switch argument value is known. With any luck the pointer will already be on the belt by the time we need it in the dynamic branch, and if there’s a data dependence that prevents hoisting the load then we are still no worse off than a D$1 latency.
    No added logic, no added data paths, no added muxes, no high fixed latency, no stalls (if hoistable).
    No muck.
Ivan Godard
Keymaster
March 28, 2017 at 11:51 am
Post count: 689
#2820
p.s. For greater detail, here’s the comment from the specializer function that does this.
/*
Each of the ranges can be isolated by a “less than”
comparison, so the maximal number of branches is the size
of the ranges vector. A less-than compare requires one
branch per range, while an equality compare requires one
branch per non-default value. Consequently equality takes
one branch for two ranges when a non-default range of extent
one is adjacent to a default range wheres less-than would
take two branches; a non-default range of extent greater
than one takes one branch using less-than and a number of
branches equal to the extent using equality; and a
non-default branch of extent one adjacent to a non-default
of any size takes one branch in both less-than and
equality.
The worst case is a run of extent-one ranges. If there are
N ranges, then sequential equality needs N branches and
will execute N/2 of them. With less-than you need one
branch to split N to 2*(N/2), 2 to split 2*(N/2) into
4*(N/4) and so on. That is 2N-1 total to the bottom level,
and log2(N) levels. So less-than will always need more
branches than equality, because the bottom level of a
less-than is the same as an equality and there are all the
upper levels on top of that.
As for latency, the dividing point is when is log2(N)
smaller than N/2; the break-even is 8. So we less-than
down until the number of cases is eight or less, and do
sequential equality after that.
The job of this function is to partition
the ranges by adding tests on pred that branch to interior nodes
of the tree until the test is trivial and we can go to the leaf
ebbs given by the switch statement.
“home” is the original ebb that ended with the switch dispatch.
“node” is the root ebb that ends with the switch, or recursively
an inner node in the tree of ebbs that results from splitting
the case space with lss operations.
“pred” is the switch expression that is being tested; it will
be one argument to all the comparisons in the tree.
“ranges” is all the value ranges that must be resolved leafward
of home in the splitting tree; each has a leaf ebb and a
contiguous range of cases that go to that ebb.
*/
void switchPartition(ebbId home, ebbId node, dropId pred,
row ranges) {

goldbug

Participant

April 28, 2017 at 1:40 pm

Post count: 53

#2827

Do you think you could provide an example? How would this code be compiled?


int bar(int x) {
    switch(x) {
       case 0:
         return 1;
       case 2:
       case 3:
         return foo(x);
       case 5:
         return 5;
       case 6:
         return 6;
       case 7:
         return 7;
       case 100:
         break;
       default:
         return 4;
    }
    return 3;
}

This reply was modified 7 years, 3 months ago by goldbug.

Ivan Godard

Keymaster

May 15, 2017 at 9:24 pm

Post count: 689

#2833

Sorry for the delay in response; a few posts here got blackholed in my mailer somehow 🙁

I edited your example to provide a declaration for function foo, but it is otherwise unchanged. On Silver with the LLVM default and the specializer at –O1 (multiple ops per instruction but minimal other changes) you get this conAsm:

 F("bar") %0;
    br("bar$0_8"),
    rd(w(1ul)) %1;
        // L3C1F1=/home/ivan/mill/specializer/src/test1.c

L("bar$0_1");
    br("bar$0_2", b0 %2),
    rd(w(4ul)) %2;                                              // L18C1F1

L("bar$0_2") %3/0/1/2/4/5;
    retn(b0 %3/0/1/2/4/5);                                      // L21C1F1

L("bar$0_7");
    br("bar$0_2", b0 %5),
    rd(w(3ul)) %5;                                              // L20C1F1

L("bar$0_8");
    eql(b1 %0, 0) %7;

    brfl(b0 %7, "bar$0_9"),
    br("bar$0_2", b1 %1);

L("bar$0_9");
    lsss(b2 %0, 2) %8;

    brfl(b0 %8, "bar$0_10"),
    br("bar$0_1");

L("bar$0_10");
    eql(b3 %0, 4) %9;

    brfl(b0 %9, "bar$0_11"),
    br("bar$0_1");

L("bar$0_11");
    eql(b4 %0, 5) %10;

    brfl(b0 %10, "bar$0_12"),
    br("bar$0_2", b5 %0);

L("bar$0_12");
    eql(b5 %0, 6) %11;

    brfl(b0 %11, "bar$0_13"),
    br("bar$0_2", b6 %0);

L("bar$0_13");
    eql(b6 %0, 7) %12;

    brfl(b0 %12, "bar$0_14"),
    br("bar$0_2", b7 %0);

L("bar$0_14");
    lsss(b7 %0, 8) %13, nope();

    brfl(b0 %13, "bar$0_15");

    call1("foo", b8 %0) %4,                                     // L14C1F1
    br("bar$0_2", b0 %4);

L("bar$0_15");
    eql(b9 %0, b0 %15) %14,
    con(w(100ul)) %15;

    brtr(b0 %14, "bar$0_7"),
    br("bar$0_1");

This shows how the switch is turned into a cascade of tests for the various cases, nested cascades if there are enough cases. The code is poor because there are so many EBBs (extended basic blocks) and we would expect several mis-predictions as control worked it way to the actual working case. (Mis-predicts are cheap on the Mill, but far from free).

Working from the same genAsm from LLVM but with the specializer at –O3 you get this conAsm:

F("bar") %0;
    eql(b2 %0, 4) %9,
    lsss(b2 %0, 2) %8,
    eql(b2 %0, 0) %7,
    retntr(b0 %7, b3 %1),                                       // L21C1F1
    retntr(b1 %8, b4 %2),                                       // L21C1F1
    retntr(b2 %9, b4 %2),                                       // L21C1F1
    rd(w(4ul)) %2,                                              // L18C1F1
    rd(w(1ul)) %1;
        // L3C1F1=/home/ivan/mill/specializer/src/test1.c

    eql(b5 %0, 7) %12,
    eql(b5 %0, 6) %11,
    eql(b5 %0, 5) %10,
    retntr(b0 %10, b8 %0),                                      // L21C1F1
    retntr(b1 %11, b8 %0),                                      // L21C1F1
    retntr(b2 %12, b8 %0);                                      // L21C1F1

    lsss(b8 %0, 8) %13, nope(),
    calltr1(b0 %13, "foo", b9 %0) %4;                           // L14C1F1

    eql(b14 %0, b0 %15) %14,
    con(w(100ul)) %15,
    retntr(b6 %13, b5 %4),                                      // L21C1F1
    retntr(b0 %14, b2 %5),                                      // L21C1F1
    retnfl(b0 %14, b14 %2),                                     // L21C1F1
    rd(w(4ul)) %17,                                             // L18C1F1
    rd(w(4ul)) %16,                                             // L18C1F1
    rd(w(3ul)) %5;                                              // L20C1F1

All the excess EBBs have gone away, so there will be at most one misprediction in the switch, which is as good as a table-based switch can do, but without the possibility of a data miss on the table. As the whole code is four instructions, executing it will take at most four cycles (plus a miss if any) which is substantially better than the alternative.

Some of what the specializer does in –O3 is new, and if you read the code carefully you can find a couple of places where the result is less than optimal. Have fun finding them 🙂

This reply was modified 7 years, 2 months ago by Ivan Godard.

Veedrac

Participant

May 16, 2017 at 7:23 pm

Post count: 25

#2836

I had a play around with optimizing this with what I, probably incorrectly, think I understand about the hardware. Is the following a valid program for Silver?


F('bar') %0;
    con    (w(100ul))              %1,  //          F3 ↥
    rd     (w(1ul))                %2,  // R0          ↥
    rd     (w(3ul))                %3,  // R0          ↥
    rd     (w(4ul))                %4,  // R1          ↥
    eql    (b4 %0,  b3 %1)         %5,  //    E0       ▸
    gtrs   (b4 %0,      7)         %6,  //    E1       ▸
    gtrs   (b4 %0,      3)         %7,  //    E2       ▸
    eql    (b4 %0,      0)         %8,  //    E3       ▸
    pick   (b0 %8,  b6 %2, b6 %4)  %9,  //       P0    ?
    retntr (b4 %5,  b6 %3),             //          F0 ↦
    retntr (b3 %6,  b5 %4),             //          F1 ↦
    retntr (b2 %7,  b9 %0);             //          F2 ↦

    gtrs   (b9 %0,      1)         %10, //    E0       ▸
    calltr1(b0 %10, 'foo', b10 %0) %11, //          F0 ↱
    retnfl (b1 %10, b2 %9),             //          F1 ↦
    retn   (b1 %11);                    //          F2 ↦

Ivan Godard
Keymaster
May 16, 2017 at 9:37 pm
Post count: 689
#2837
I’m very impressed that you could write this; reading conAsm gives me hives, and writing it manually is much worse 🙂 Your code looks reasonable to casual inspection, which was all I could give it. Silver has four ALUs and three controlFUs, so you haven’t exceeded the available slot population. Using the pick to avoid another retntr is clever but doesn’t actually buy much, because you have a free control slot in the second instruction and pick encoding is not that much smaller than retn encoding.
You did catch the duplicate rd((w(4)) in the original, and that the second of the retntr/retnfl pair could be a simple retn, saving one argument morsel in the encoding.
Another thing you caught we have LLVM to thank for, not our specializer: several of the result values are the same as the corresponding selector value (cases 5/6/7), and could be collected by CSE into a group picked up by a lss() rather than eql()s, saving several rd/eql/retn sequences in the specializer output and letting you go from four to two instructions. Alas, LLVM doesn’t catch that and the author didn’t write it in the source; CSE/value numbering is not something to do in a mere code generator like the specializer. Still, it’s nice to know that there is optimization headroom still, even with the decent code we are now producing: you have an average ILP of 8 and 16 ops, while the original only got 6 ILP and 24 ops.
My thanks to @goldbug for an interesting test case. It not only shows how switches get encoded, but also demonstrates how phasing and width boosts the ILP.
And @veedrac – please go to millcomputing.com->about->join us and consider what you find there 🙂
- Veedrac
  Participant
  May 17, 2017 at 4:31 pm
  Post count: 25
  #2838
  Nah, man, conAsm is awesome (says they guy that’s written ~20 lines total). I know exactly where everything is and where it will be, and I can see exactly what resources are available to make it happen. Much nicer than doing the same thing to an OoO core.
  The wiki says returns with values can only go in slots F0, F1 and F2, and I’ve used those all in the second instruction, so I don’t see how I could avoid using pick. Actually, I’m tempted to say it’s the other way around: I have a pick slot I didn’t use, so I’m wasting resources!
  I like how this lets you treat the problem as a resource optimization problem, with the ability to move parts from an overburdened phase to another without playing guessing games with the hardware. Pick was an example, but I can think of a lot, like speculating more when there’s spare capacity. Writing an optimizer for this sounds like a ton of fun, but I’m pretty sure my employer wouldn’t want me helping the competition ;).
  - Ivan Godard
    Keymaster
    May 17, 2017 at 7:01 pm
    Post count: 689
    #2839
    Yes, you’re right, the pick was essential; I overlooked the call op in my (admitted cursory) look through.
    It’s not so much returns with values need Fn, as any op that uses the controlFU (of which there are three on Silver), including br, call, retn, inner, leave, setjmp, longjmp and a few NYF. (As a special case, a retn with no result that is the only op in its instruction can be encoded as a flow skinny, but that’s just an encoding optimization; it still occupies a controlFU slot.) However, a Mill can be configured with more flow slots than it has controlFUs. That’s true for Silver, which has four flow slots but only three carry controlFUs. The relevant part of the Silver spec is:
    c->flowSlots = newRow<slot>(4); c->flowSlots[0] << cacheFU << conFU << conformFU << controlFU << lsFU << lsbFU; c->flowSlots[1] << conFU << conformFU << controlFU << lsFU << lsbFU << miscFU; c->flowSlots[2] << conFU << conformFU << controlFU << lsFU << lsbFU << miscFU; c->flowSlots[3] << conFU << lsFU << lsbFU << miscFU;
    As you see, slot 4 lacks a controlFU. BTW, the provisioning of all the Mill test configurations (in the Metal series like Silver and elsewhere) can and will change as we tune for workload/market; don’t take this spec as stone-graven.
    As for writing conAsm manually: selecting the ops to match the source is quite easy because Mill ISA is high level and closely matches the HLL input; for any given source there’s usually only one way to do it on the core. Op placement (putting the selected ops into instructions with maximal density) is a little more difficult but not that bad. But then you have to determine the numeric value of belt references, which means you must in effect symbolically execute the program and keep track of where everything is on the belt, and that’s a real pain; I admit I did not check that part of your code. The pain is acute when there are in-flight ops started in one ebb that retire somewhere else after possibly several control transfers. Then lastly you must verify that a value you need hasn’t fallen off the belt, and insert spill/fill or rescue ops as necessary – and then go recalculate the belt numbers again. Software is much better at that sort of thing than are humans.
    Remember us when you want something more fun than your present employer 🙂 Mill Computing is a distributed virtual company; we have people all over the world.
goldbug
Participant
May 19, 2017 at 2:27 pm
Post count: 53
#2840
Is there a reference for conAsm? I can’t figure out what half the operations do.
I am very impressed that the compiler & specializer managed to condense my test case to just 4 instructions! WOW.
Ivan Godard
Keymaster
May 19, 2017 at 3:42 pm
Post count: 689
#2842
You will find some info about the ISA on the Mill wiki: http://millcomputing.com/wiki/Main_Page. Be aware that some of the info there is outdated, especially the details about the provisioning of the various configurations. Unfortunately, the person who had been working on the Wiki had personal issues that prevented him finishing, and no new person joining has picked it back up. If you or any reader wants to really get into the architecture and is able to explain what you find to others then go to millcomputing.com -> About -> Join us; we have a home for you 🙂 Much of the ISA material is mechanically generated (Python) from the specs that drive all our software, but there’s English text needed too.
As for the conAsm for your test case, the specializer output is decent, but what really impressed me was what Veedrac did with it, by hand and with as little conAsm documentation as there is. Kudos.
As for the ops in the various code samples:
con – arbitrary literal constant
rd – popCon, specially recognized literal constant
eql – integer equal compare
lsss – signed integer less compare
gtrs – signed integer greater compare
retn{tr,fl} – conditional return
retn – unconditional return
br{tr,fl} – conditional branch
br – unconditional branch
pick – select one of two, i.e. a ? b : c
calltr1 – conditional call with one belt result
goldbug
Participant
May 22, 2017 at 8:30 am
Post count: 53
#2843
I finally managed to digest that code, impressive indeed!
I can also see that it would be easy to make the entire function just 1 instruction in gold.
Veedrac also has the right idea about how to format conAsm. He puts the rd and con operations first which more closely resembles phasing, he spaces instructions so that it is easy to visually scan the code and he shows in comments the slot where the result go. I am not sure what the arrows at the end mean though.
- Ivan Godard
  Keymaster
  May 22, 2017 at 9:38 am
  Post count: 689
  #2844
  Gold has four controlFUs and four pick slots, so if the trailing two retns were pickified like Veedrac did, and the added control slot were used for the call, I think it fits. I’ll look into getting the specializer to do pickification.
- Ivan Godard
  Keymaster
  May 22, 2017 at 1:37 pm
  Post count: 689
  #2845
  FWIW, here’s what –O3 on gold gives you. It still needs work.
  F("bar") %0; lsss(b3 %0, 2) %8, eql(b3 %0, 100) %14, eql(b3 %0, 0) %7, retntr(b1 %7, b4 %1), // L22C1F1 retntr(b3 %8, b0 %17), // L22C1F1 pick(b1 %14, b4 %2, b5 %5) %17, rd(w(3ul)) %5, // L21C1F1 rd(w(4ul)) %2, // L19C1F1 rd(w(1ul)) %1; // L4C1F1=/home/ivan/mill/specializer/src/test3.c eql(b7 %0, 7) %12, eql(b7 %0, 6) %11, eql(b7 %0, 5) %10, eql(b7 %0, 4) %9, retntr(b0 %9, b4 %17), // L22C1F1 retntr(b1 %10, b11 %0), // L22C1F1 retntr(b2 %11, b11 %0), // L22C1F1 retntr(b3 %12, b11 %0); // L22C1F1 lsss(b13 %0, 8) %13, calltr1(b0 %13, "foo", b14 %0) %4, // L15C1F1 retntr(b1 %13, b0 %4), // L22C1F1 retn(b8 %17), // L22C1F1 rd(w(4ul)) %16, // L19C1F1 rd(w(4ul)) %15; // L19C1F1
  Gold has just enough ALUs (8 of 8), and enough popCon slots (5 of 8, although two are redundant) and pick slots (1 of 4) to do this in one instruction, but this code needs 9 of 4 control slots so three instructions is the best it can do without pickifying some of the retns. Even using all the pick slots we only drop the flow count to 6, or two instruction. To get to one, we have to recognize that the return value is the selector value (which I’m sure we’re not going to get LLVM to do).
  Still, an ILP of 7.6 ain’t too shabby 🙂
- Veedrac
  Participant
  May 23, 2017 at 1:39 am
  Post count: 25
  #2848
  @goldbug
  The letter-number pairs are the slots, and the arrows are the phase (reader ↥, compute ▸, call ↱, pick ?, writer ↧, conform =, transfer ↦). The symbol choice is from http://millcomputing.com/instructions.html, with the phase column collapsed.
  Though if I had my say it’d be reader ⤒, compute ⏩, call ↺, writer ↧, conform ♺, transfer ⤷.
  This reply was modified 7 years, 2 months ago by Veedrac.
  This reply was modified 7 years, 2 months ago by Veedrac.

Veedrac

Participant

May 23, 2017 at 12:53 am

Post count: 25

#2846

I also tried my hand at this guy for Gold, and got


// ILP of 5⅓, mostly because of the silly load latency :P.
F("parseval") %str;
    load(%str, 0, b) %l0,
    load(%str, 1, b) %l1, // I'd fill this with loads if the two loops had the
    load(%str, 2, b) %l2; // same cycle length, but alas they do not.

    eqlb(%l0, 45) %neg,
    addp(%str, 1, b) %strplus1,
    pick(%neg, %strplus1, %str) %start,
    pick(%neg, %l1, %l0) %L0,
    pick(%neg, %l2, %l1) %L1;

    // Now we can finally start doing fun stuff.
    rd(b(0)) %zero,
    load(%start, 2, b) %char,     // Not loading early means we take a stall in hexmode,
    eqlb(%L0,   48) %L00,         // but loading early means decimalmode gets early loads. :(
    eqlb(%L1,  120) %L1x,
    addp(%start, 2) %startplus2,
    pick(%L00, %L1x, %L00) %hexmode,
    innertr(%hexmode, "parseval$hexmode", %zero, %startplus2, %neg),
    inner(%decimalmode, "parseval$decimalmode", %zero, %startplus2, %L1, %L0, %neg);

// ILP of 9! Cycle length equals loop-caried dependency.
L("parseval$hexmode") %total %str %neg;
    // a wild %char appears, right on time!
    load  (%str, b, 2) %char,     // Ideally this load would be fetching further ahead,
    gequ  (%char,  48) %digitlo,  // but we want one byte every 2 cycles, whereas the
    lequ  (%char,  57) %digithi,  // other branch want one every 3 cycles, so you can't
    gequ  (%char,  97) %letterlo, // hoist the setup into the header.
    lequ  (%char, 102) %letterhi,
    sub   (%char,  48) %digitval,
    sub   (%char,  87) %letterval,
    shiftl(%total,  4) %shifttot,
    addp  (%str, 1, b) %newstr,
    pick  (%digitlo,  %digithi,  %digitlo)  %digit,
    pick  (%letterlo, %letterhi, %letterlo) %letter;

    orl   (%digit,    %letter)    %inrange,
    add   (%shifttot, %digitval)  %digittot,
    add   (%shifttot, %letterval) %lettertot,
    pick  (%digit, %digittot, %lettertot) %newtotal,
    conform(%newtotal, %newstr, %neg, %total, %char),
    leavefl(%inrange, "parseval$finish", %total, %char, %neg),
    br("parseval$hexmode");

// ILP of only 4⅓. Cycle length equals loop-caried dependency.
// Not much one can do when the multiply is slow... :(
L("parseval$decimalmode") %total %str %newchar %char %neg;
    // A wild %newnewchar appears... six cycles early.
    // That's pretty lame, but fixing it take a branch.
    load  (%str, b, 3) %newnewchar,
    gequ  (%char,  48) %digitlo,
    lequ  (%char,  57) %digithi,
    addp  (%str, 1, b) %newstr,
    sub   (%char,  48) %digitval,
    shiftl(%total,  1) %totalx2,
    shiftl(%total,  3) %totalx8,
    pick  (%digitlo, %digithi,  %digitlo) %digit,
    leavefl(%digit, "parseval$finish", %total, %char, %neg);

    // This is a very sad instruction.
    add   (%totalx2,  %totalx8)  %totalx10;

    add   (%totalx10, %digitval) %newtotal,
    rescue(%newtotal, %newstr, %newnewchar, %newchar, %neg),
    br("parseval$hexmode");

// ILP of 4½. Meh, it works.
L("parseval$finish") %total %char %neg;
    // a wild %fakechar appears
    eqlb(%char, 75) %K,
    eqlb(%char, 77) %M,
    neg(%total) %negtotal,
    pick(%neg, %negtotal, %total) %totalsigned;

    shiftl(%totalsigned, 10) %totalK,
    shiftl(%totalsigned, 20) %totalM,
    pick(%K, %totalK, %totalsigned) %totalelse,
    retntr(%M, %totalM),
    retn(%totalelse);

Interesting architecture indeed. I’ve not handled the busywork belt numbers, but I’ve checked they’re in a consistent position. I expect one’s better off having a branch before entering either loop where you flood it with correctly-delayed loads ahead of time; that way you don’t risk stalls in the loop itself, but it adds a cycle (and a jump) to every call to parseval. Delayed loads don’t work because the loop count is different, and tagged loads don’t work because they don’t pipeline.

This reply was modified 7 years, 2 months ago by Veedrac.

Ivan Godard
Keymaster
May 23, 2017 at 10:29 am
Post count: 689
#2851
I’m just blown away!
The great bulk of what you have done to improve the code is software pipelining, which the Mill is designed for. The pipelining done in our tool chain is still experimental (i.e. breaks on everything), so I can’t give you a comparison yet, but in-loop ILPs of 9 are actually low on high-end Mills because you can pipeline up overlapping iterations until you run out of FUs. Then there’s vectors…
Some minor notes on your code:
1) the “retire” operation (a.k.a. “Dave’s Device” here in-house) will let you get rid of all the loop setup (prologue) code, so there is no gain from sharing the prologue across the two loops.
2) “leave” discards all inflight operations; you don’t have to worry about multiplies or loads that you start in the loop dropping after you exit
3) the “conform” op is no more; it was obviated by letting branches supply a carry list. Similarly, the “rescue” in your code can be folded into the following branch op.
4) Long dependency chains in open (non-loop) code produce “sad” instructions even on a Mill 🙂
5) Your parseval$decimalMode EBB should end with a branch back to the same EBB; sending it to parseval$hexMode is a bug I think.
6) It is possible to write the hex loop to eat a new character every cycle; the code has to split the OR reduction in two, shifting by two nibbles out of phase with each other so the shifted nibbles interleave. The interleave can be running in the loop, or a final OR after it if the loop exports both partials; our pipeliner tries to move all reductions to after the loop, but has problems with that. Have fun 🙂
7) If you get hexMode to one byte per cycle, try the same on the decimalMode loop. Hint: unroll by three, then schedule on a torus so that corresponding operations on different iterations schedule at the same point. It’s fairly easy when there are no loop-carried dependencies, but it gets harder when, as here, the loop is a reduction.
8) and if you got all that, remember that Silver has two mulFUs and Gold has four, so you can in fact eat 2 or 4 bytes per cycle. You were regretting “only” 4.3 ILP; try ILP 19. Mill eats loops for breakfast, lunch and dinner (I just love to brag about the baby).

Veedrac
Participant
May 23, 2017 at 2:28 pm
Post count: 25
#2852
1. Sure, you can use retire to get you that fetch-ahead load if you don’t mind stalling while the loop warms up. Alternatively you can just fix the instruction set so I can do it the right way ;).
2. Neat, I knew there must be something like that. It was too obvious to overlook.
6, 7, 8. I explicitly chose not to unroll or vectorize to make the comparison more meaningful. I’m sure these things work, but I tried to stick to things an OoO core would be trying to do in hardware. No maths tricks beyond what I could believe stock GCC would give.
- This reply was modified 7 years, 2 months ago by Veedrac.
- Ivan Godard
  Keymaster
  May 23, 2017 at 4:02 pm
  Post count: 689
  #2854
  Re #1: the expectation of the ISA is that a loop will start one (or more) iterations every cycle, and will saturate the load slots; for a variety of hardware reasons load slots can’t just be cookie-cuttered out the way that ALUs can, and we expect load to be a scaling constraint. If it is, then a manual prologue is no better than a retireOp based prologue. Of source, loads will not always be the constraint: for those cases, what would you like to see in the ISA that would let you “do it right”?
Veedrac
Participant
May 23, 2017 at 5:15 pm
Post count: 25
#2855
I’d love to explain the thing you’re missing and how to fix it, but alas I expect there exist lawyers that would be upset if I did. You’re going to have to take what you’ve got.
Ivan Godard
Keymaster
May 23, 2017 at 6:08 pm
Post count: 689
#2856
Lawyers do get in the way of things, don’t they? We have our own things we can’t talk about yet too. 🙁
PeterH
Participant
May 28, 2017 at 10:02 pm
Post count: 41
#2858
At the risk of diverging the topic, I’m wondering at the impact of different optimization settings on code size. With the architectures I’m familiar with, -O3 and above have an unfortunate tendency to produce larger code in the interest of speed. But techniques like unwinding loops don’t appear to be desirable on the Mill.
- Ivan Godard
  Keymaster
  May 28, 2017 at 10:18 pm
  Post count: 689
  #2859
  There’s a –Os setting to specialize for space. At present it shuts off cloning speculable operations up into inbound control flow arcs, except in the special case where the EBB has only one op (commonly a retn), so there is no size bloat because the cloned op replaces the otherwise necessary branch.
  We don’t unroll loops at present, and actually have had quite a bit of argument with the LLVM code that really really wants to unroll. There is one case where we know we want to unroll, in loop pipelines in which there is enough hardware functionality to do more than one iteration concurrently and there’s no loop-carried dependencies. However, the specializer isn’t smart enough yet to do that.
Ghadi
Participant
July 14, 2017 at 8:56 am
Post count: 4
#2887
I’m very excited to see how to implement an interpreter minus megamorphic dispatch. Is the talk from Wednesday available somewhere?
- Ivan Godard
  Keymaster
  July 14, 2017 at 11:09 am
  Post count: 689
  #2888
  It usually takes a couple of weeks to do the video editing. There will be an announcement mail when it goes up.
goldbug
Participant
August 6, 2017 at 9:50 am
Post count: 53
#2917
So I come back to the site to check what’s new and I find a 1:20 hours lecture on how you compiled my code and even credited me for it.
Priceless!
Tx Ivan, you talks are really interesting and fun to watch.
- Ivan Godard
  Keymaster
  August 6, 2017 at 10:44 am
  Post count: 689
  #2918
  You’re welcome. It was a nice tidy test case that pointed up different aspects of the architecture, which is actually hard to do – most tests are too big or too complex to grok in the context of a talk. Feel free to cite the credit in your CV 🙂 Veedrac got a credit too, did you notice?
  - goldbug
    Participant
    August 6, 2017 at 11:05 am
    Post count: 53
    #2919
    I sure did, at the very end. Well deserved, his code was amazing!.
    I’ll keep an eye out for other topics and maybe I can do it again :).
goldbug
Participant
August 7, 2017 at 5:20 am
Post count: 53
#2920
The pick can be used to build a range check. For example, to check if (x >=4 && x <= 7) one could write:
```
F("bar") %0;
    rd     (w(0ul))               %1,
    geqs   (b1 %0, 4)             %2,
    leqs   (b2 %0, 7)             %3,
    pick   (b1 %2, b0 %3, b2 %1)  %4;
```
Range checks would be fairly common in switches. In this example, if (2 <= x <= 3) it should call foo(), and if (4 <= x <= 7) it can return x. It seems like a fairly simple optimization to add to the specializer instead of a bunch of eql. Range checks could also be useful to perform array bounds check.
The only issue is that, if I understand the phasing correctly, the result of the pick cannot be used in the calltr1 within the same instruction.
I must admit I scratched my head a bit about the fma operation. For the life of me I can’t imagine it being used that often. It seems like an oddly specific instruction to add, but maybe I am missing something. If you can have operations use more than one ALU, then perhaps a “between” operation would be useful that would do range check. As shown above it can easily be done with a pick, but there could be a gain if the result of the between operation was usable in calltr1.
- This reply was modified 6 years, 11 months ago by goldbug.
goldbug
Participant
August 7, 2017 at 6:52 am
Post count: 53
#2922
The pick operation can be used to build a range check. For example, to check if (x >=4 && x <= 7) one could write:
```
F("bar") %0;
    rd     (w(0ul))               %1,
    geqs   (b1 %0, 4)             %2,
    leqs   (b2 %0, 7)             %3,
    pick   (b1 %2, b0 %3, b2 %1)  %4;
```
Range checks would be fairly common in switches. In this example, if (2 <= x && x <= 3) it should call foo(), and if (4 <= x && x <= 7) it can return x. It seems like a fairly simple optimization to add to the specializer instead of a bunch of eql. Range checks could also be useful to perform array bounds check.
The only issue, if I understand phasing correctly, is that the result of a pick cannot be used in a calltr1 within the same instruction.
I must admit I scratched my head a bit about the fma operation. For the life of me I can’t imagine it being used that often. It seems like an oddly specific instruction to add, but maybe I am missing something. If you can have operations that use more than one ALU, then perhaps a “between” operation that does a range check would be useful. As shown above it can easily be done with a pick, but there could be a gain if the result of the between operation was usable in calltr1.
- This reply was modified 6 years, 11 months ago by goldbug.
- This reply was modified 6 years, 11 months ago by goldbug.
goldbug
Participant
August 7, 2017 at 8:20 am
Post count: 53
#2925
Sorry if it looks like I am spamming, the forum keeps deleting my post.
The pick operation can be used to build a range check. For example, to check if (x >=4 && x <= 7) one could write:
```
F("bar") %0;
    rd     (w(0ul))               %1,
    geqs   (b1 %0, 4)             %2,
    leqs   (b2 %0, 7)             %3,
    pick   (b1 %2, b0 %3, b2 %1)  %4;
```
Range checks would be fairly common in switches. In this example, if (2 <= x && x <= 3) it should call foo(), and if (4 <= x && x <= 7) it can return x. It seems like a fairly simple optimization to add to the specializer instead of a bunch of eql. Range checks could also be useful to perform array bounds check.
The only issue, if I understand phasing correctly, is that the result of a pick cannot be used in a calltr1 within the same instruction.
I must admit I scratched my head about the fma operation. For the life of me I can’t imagine it being used that often. It seems like an oddly specific instruction to add, but maybe I am missing something. If you can have operations that use more than one ALU, then perhaps a “between” operation that does a range check would be useful. As shown above it can easily be done with a pick, but there could be a gain if the result of the between operation was usable in calltr1.
- Ivan Godard
  Keymaster
  August 11, 2017 at 10:46 pm
  Post count: 689
  #2932
  Pick is a good way to implement the within() and without() functions (or their inverse includes() and excludes()). For a range check that involves branching rather than boolean result one can just use two branches. For calltr1 you need a single bool and the pick works. Yes, the predicates on predicated ops like calltr1 can see a pick; ain’t phasing wonderful?
  We don’t generate range checks in switches; just a chain of less-than compares. Each less-than peels off the next range in sorted order.
  - goldbug
    Participant
    August 14, 2017 at 6:38 am
    Post count: 53
    #2936
    That is cool that you can use the pick results in the calltr. In the documentation it sounds like the pick phase is after the call phase so I was not sure.
    The chain of less than is a great approach if you have a bunch of returns or branches, but it is not great when you have a bunch of calls. Consider this example:
    int foo(int x) { switch(x) { case 0: case 1: case 2: return bar(); case 3: case 4: case 5: return baz(); default: return 0; } }
    Since calls don’t stop execution, if you try to do a chain of lsss, you will end up calling bar multiple times. You can do this in one instruction if you checked ranges instead. Like this:
    F("foo") %0; rd (w(0ul)) %1, geqs (%0, 0) %2, leqs (%0, 2) %3, leqs (%0, 5) %4, pick (%2, %3, %1) %5, // range check x>=0 && x <= 2 pick (%3, %1, %4) %6, // range check !(x<=2) && x <= 5 same as x>=3 && x <=5 calltr1 (%5, "bar") %7, // call function only if in range a simple lsss would not be enough calltr1 (%6, "baz") %8, // call function only if in range a simple lsss would not be enough retntr (%5, %7), retntr (%6, %8), retn (%1);
    I would think that switching function calls is common, perhaps not with ranges though, so maybe I am overthinking it.
    - Ivan Godard
      Keymaster
      August 14, 2017 at 7:31 am
      Post count: 689
      #2938
      Pick phase is after call phase, and you got it right from the doc. For some reason I was thinking of branches, for which the predicate is evaluated in writerPhase, after the pick. Sorry for the brain freeze 🙁
      Hence your pick-and-call needs an extra instruction. As for ranges of calls, currently the compiler transforms your example into (in effect):
      if (x < 0) return 0; else if (x < 3) return bar(); else if (x < 6) return baz(); else return 0;
      Your pick-based solution is better than this, even with the extra instruction, but it’s unclear that the case is common enough to look for.
      - goldbug
        Participant
        August 14, 2017 at 8:47 am
        Post count: 53
        #2939
        I agree, it probably is not that common.
        Regarding phasing. It would be highly undesireable to speculate calls. Calls will probably be expensive, may have side effects, and might cause your current belt to spill/fill, unlike say an add or a eql that have no side effects.
        Therefore, would it not make more sense to move up the pick phase in front of call phase so that you can use picks to avoid calls? I assume the pick is currently after the calls because calls can return values, but if you think about it, you would never want to execute a call speculatively which greatly diminishes the value of picking call results.
        This is a much more generic and common occurrence than the switch ranges.
      - Ivan Godard
        Keymaster
        August 14, 2017 at 9:16 am
        Post count: 689
        #2941
        The machine has no physical ordering required here, so the choice of pick vs call order is arbitrary. We chose pick-after-call several years ago, before there were predicated calls, based whether we thought f()?x:y was more or less common than b?f():x. Perhaps we should re-examine that. The way our devtech works, it would be three one-line changes in the code base to switch, and all the tools would be using the other order.
        Or we might leave it as is until we have enough of a testbase for measurement both ways 🙂
      - goldbug
        Participant
        August 14, 2017 at 9:00 am
        Post count: 53
        #2940
        Hmmm, on second thought, maybe picking using a function result as selector is not that uncommon, for example:
        if (isEmpty(mylist)) return 3; else return 5;
        Sorry if I am wasting precious Ivan cycles.
goldbug
Participant
August 7, 2017 at 9:21 am
Post count: 53
#2926
I thought of a better way. The mill supports vector reducing operations like all and any. I don’t know the syntax for them, but perhaps the between operation could take a number and a vector and tell you if the number is in that range. for example:
```
F("bar") %0;
    con     (v(4ul, 7ul))         %1,  // whatever syntax to build a vector
    betw    (b1 %0, b0 %1)        %2;  // true if 4 <= x && x <= 7
```
Since the value is a vector, it can use only one functional unit to calculate if x is within the range. That would make switches ranges trivial to optimize.
- Ivan Godard
  Keymaster
  August 11, 2017 at 10:50 pm
  Post count: 689
  #2933
  The basic problem with using the vector hardware for scalar operations is what do you do when you want to do a vector betw()? We have gone to considerable trouble to make sure that every scalar operation can be applied element-wise to a vector. The reverse is not true: there are reducing vector operations (like any() and all()) that are not meaningful in scalar.
  - goldbug
    Participant
    August 14, 2017 at 7:06 am
    Post count: 53
    #2937
    I see the point. As it is the instruction set is beautiful for it’s uniformity and that betw operation would break that.
Author
Posts

You must be logged in to reply to this topic.