• Author
  • Ivan Godard
    Post count: 418
    #2818 |

    An exchange from comp.arch repeated here:
    @Stephen Fuld:
    The C switch statement, and equivalents in other languages, are interesting. They allow substantial economy of source code, and quite compact generated code, but they do have some issues. The indexed load of the target address is frequently a cache miss, and it is followed immediately by an indexed jump, that is almost always a branch mispredict. Which leads me to my question. Does the Mill have any tricks up it sleeve that make switch statements less costly?

  • Ivan Godard
    Post count: 418

    The compiler currently turns switches into a binary search branch tree. The case space is partitioned by lss-than compares down to the level at which equal-compares are better, and then equal-compares thereafter. The logic deals well with runs of consecutive values in a case, such as all alphabetic characters; less-than compares are cheap for that. We do not currently generate a jump table. Although early versions of the ISA had jump tables in the code adjacent the the instruction that used them, we took the operation out; code side jump tables seriously muck up the code data paths.

    This strategy is tuned for the wide issue and multi-branch capability in larger Mill family members. For a member that is Silver or wider all the equal compares and their branches fit in one instruction, and a big hunk of the less-than tree will be in prior instructions of the same ebb. Consequently execution can get well down the tree before one of the branches is actually taken and search enters a sub-tree. As the whole ebb has only one prediction (where it will exit to) there are many fewer misses than would occur on a legacy architecture that does branch prediction in a search tree. A bit of arithmetic will show how many taken branches (on average) are necessary to get to the target case (assuming cases are equiprobable and sparse, which is the worst case) for any member configuration of ALU exu slots (for the compares) and branch flow lots (for the branches).

    This works quite well for switches up to the size of say a lexer, but will obviously fail for something like a YACC grammar with cases in the thousands. For those we need to add a cost function and data-side dispatch tables (Mill branches already support label variables) to the specializer; we just haven’t gotten around to that yet.

    • Ivan Godard
      Post count: 418

      In what way do code side jump tables muck up the code data paths?
      For example, if there was an indexed branch instruction immediately
      followed in the code by a table of displacements, how would that
      be worse than any other kind of jump?


      • Ivan Godard
        Post count: 418

        That was pretty much what our initial design did, except that it was a real switch op, not just an indexed branch, and bounds-checked the argument into the default case.

        The mucking arose because the Mill code-side is both heavily streamed and heavily pipelined. It’s a fire-hose and the code side has to do all it can to keep everything out of the way. The whle code side is done on a whole-cache-line basis. Recall that each instruction is variable length and may be up to a whole line long and may cross a line boundary. The instruction header says how long it is, so the front end of decode is an instruction shifter that is two lines wide and cracks the line stream into an instruction stream.

        The cracker line fetch sequence is controlled by the predictor, so a predicted taken transfer causes the lines containing the target to replace the lines containing the transfer in the cracker. After the cracker the line relationship is lost. A miss discards the inflight instruction stream and starts the cracker with a new line-pair just as if it had acted on a predicted transfer.

        In those streams there’s no place for an arbitrarily large data structure like a jump table; it won’t be fetched as part of the line stream, and whatever part of it was in the remainder of the line containing the switch op would have been discarded as the cracker was refilled from the predicted switch target. Moreover, we don’t even know that we have a switch until two cycles after the line-cracker, and the cracker is quite likely to have followed one or two predictions before we know – the table is probably no longer in the decode at all any more, and we have to be prepared to get it, or at least the desired address, from cache.

        Consequently, other than the potential value of maybe part of the table having been brought to the icache along with the instruction containing the switch, there’s no value in having the table adjacent to the switch op except that it gives a cheap way to find the address of the table.

        Now that we know the address of the table (having decoded the switch op enough to discover we need an address), we have to do an indexed load and pull up eight bytes that we can compare against the prediction we followed a couple of cycles ago to see if we went where we should have. That’s a round-trip to the cache. However, the icache is set up to do whole lines – and nothing but whole lines. So to get the case target address we have to pull up a line (needs an adder somewhere to add the table address and the case index), and isolate eight bytes out of that line, which means there’s a shifter somewhere too. Can’t use the ALUs or L/D address adders for the add; they are busy working down the predicted path. Can’t use the ALU shifters for the isolate; they aren’t line wide and are also busy. So we need an adder in the branch unit and an isolate shifter in the icache, equivalent to what is on the front of a dcache, and both adder and isolater are used for nothing but the rare case of a switch. And you have several more data paths to mux, and another command sugnal or packet for the icache to deal with

        The latency is bad too. We don’t know we have a switch op until D2, plus a cycle for the add, plus maybe two-three cycles to get the line from the i$1 (the table won’t be in the i$0 microcache – I don’t even want to think about that), plus a cycle for the isolater, plus a cycle back and a cycle for the success comparison, just to confirm that you did indeed predict correctly. Then what happens if the switch target (as predicted) immediately does a switch too? Is switch handling fully pipelined? Probably not – you will stall, but then you will be stalling anyway while you figure out the address.


        Nowadays the jump table (when we get around to it) is on the data side, which already has the address-adder for an indexed load and the isolaters in the data path from the D$1, so no logic need be added to support switches. The ISA has dynamic branches that take the target address from the belt, for which the prediction confirmation takes zero latency, just like any other branch. The indexed load is separate from the branch and so can be hosted into the future as far as the switch argument value is known. With any luck the pointer will already be on the belt by the time we need it in the dynamic branch, and if there’s a data dependence that prevents hoisting the load then we are still no worse off than a D$1 latency.

        No added logic, no added data paths, no added muxes, no high fixed latency, no stalls (if hoistable).

        No muck.

  • Ivan Godard
    Post count: 418

    p.s. For greater detail, here’s the comment from the specializer function that does this.

    Each of the ranges can be isolated by a “less than”
    comparison, so the maximal number of branches is the size
    of the ranges vector. A less-than compare requires one
    branch per range, while an equality compare requires one
    branch per non-default value. Consequently equality takes
    one branch for two ranges when a non-default range of extent
    one is adjacent to a default range wheres less-than would
    take two branches; a non-default range of extent greater
    than one takes one branch using less-than and a number of
    branches equal to the extent using equality; and a
    non-default branch of extent one adjacent to a non-default
    of any size takes one branch in both less-than and

    The worst case is a run of extent-one ranges. If there are
    N ranges, then sequential equality needs N branches and
    will execute N/2 of them. With less-than you need one
    branch to split N to 2*(N/2), 2 to split 2*(N/2) into
    4*(N/4) and so on. That is 2N-1 total to the bottom level,
    and log2(N) levels. So less-than will always need more
    branches than equality, because the bottom level of a
    less-than is the same as an equality and there are all the
    upper levels on top of that.

    As for latency, the dividing point is when is log2(N)
    smaller than N/2; the break-even is 8. So we less-than
    down until the number of cases is eight or less, and do
    sequential equality after that.

    The job of this function is to partition
    the ranges by adding tests on pred that branch to interior nodes
    of the tree until the test is trivial and we can go to the leaf
    ebbs given by the switch statement.

    “home” is the original ebb that ended with the switch dispatch.

    “node” is the root ebb that ends with the switch, or recursively
    an inner node in the tree of ebbs that results from splitting
    the case space with lss operations.

    “pred” is the switch expression that is being tested; it will
    be one argument to all the comparisons in the tree.

    “ranges” is all the value ranges that must be resolved leafward
    of home in the splitting tree; each has a leaf ebb and a
    contiguous range of cases that go to that ebb.
    void switchPartition(ebbId home, ebbId node, dropId pred,
    row ranges) {

  • goldbug
    Post count: 6

    Do you think you could provide an example? How would this code be compiled?

    int bar(int x) {
        switch(x) {
           case 0:
             return 1;
           case 2:
           case 3:
             return foo(x);
           case 5:
             return 5;
           case 6:
             return 6;
           case 7:
             return 7;
           case 100:
             return 4;
        return 3;
    • This reply was modified 2 months ago by  goldbug.
    • Ivan Godard
      Post count: 418

      Sorry for the delay in response; a few posts here got blackholed in my mailer somehow ๐Ÿ™

      I edited your example to provide a declaration for function foo, but it is otherwise unchanged. On Silver with the LLVM default and the specializer at –O1 (multiple ops per instruction but minimal other changes) you get this conAsm:

       F("bar") %0;
          rd(w(1ul)) %1;
              // L3C1F1=/home/ivan/mill/specializer/src/test1.c
          br("bar$0_2", b0 %2),
          rd(w(4ul)) %2;                                              // L18C1F1
      L("bar$0_2") %3/0/1/2/4/5;
          retn(b0 %3/0/1/2/4/5);                                      // L21C1F1
          br("bar$0_2", b0 %5),
          rd(w(3ul)) %5;                                              // L20C1F1
          eql(b1 %0, 0) %7;
          brfl(b0 %7, "bar$0_9"),
          br("bar$0_2", b1 %1);
          lsss(b2 %0, 2) %8;
          brfl(b0 %8, "bar$0_10"),
          eql(b3 %0, 4) %9;
          brfl(b0 %9, "bar$0_11"),
          eql(b4 %0, 5) %10;
          brfl(b0 %10, "bar$0_12"),
          br("bar$0_2", b5 %0);
          eql(b5 %0, 6) %11;
          brfl(b0 %11, "bar$0_13"),
          br("bar$0_2", b6 %0);
          eql(b6 %0, 7) %12;
          brfl(b0 %12, "bar$0_14"),
          br("bar$0_2", b7 %0);
          lsss(b7 %0, 8) %13, nope();
          brfl(b0 %13, "bar$0_15");
          call1("foo", b8 %0) %4,                                     // L14C1F1
          br("bar$0_2", b0 %4);
          eql(b9 %0, b0 %15) %14,
          con(w(100ul)) %15;
          brtr(b0 %14, "bar$0_7"),

      This shows how the switch is turned into a cascade of tests for the various cases, nested cascades if there are enough cases. The code is poor because there are so many EBBs (extended basic blocks) and we would expect several mis-predictions as control worked it way to the actual working case. (Mis-predicts are cheap on the Mill, but far from free).

      Working from the same genAsm from LLVM but with the specializer at –O3 you get this conAsm:

      F("bar") %0;
          eql(b2 %0, 4) %9,
          lsss(b2 %0, 2) %8,
          eql(b2 %0, 0) %7,
          retntr(b0 %7, b3 %1),                                       // L21C1F1
          retntr(b1 %8, b4 %2),                                       // L21C1F1
          retntr(b2 %9, b4 %2),                                       // L21C1F1
          rd(w(4ul)) %2,                                              // L18C1F1
          rd(w(1ul)) %1;
              // L3C1F1=/home/ivan/mill/specializer/src/test1.c
          eql(b5 %0, 7) %12,
          eql(b5 %0, 6) %11,
          eql(b5 %0, 5) %10,
          retntr(b0 %10, b8 %0),                                      // L21C1F1
          retntr(b1 %11, b8 %0),                                      // L21C1F1
          retntr(b2 %12, b8 %0);                                      // L21C1F1
          lsss(b8 %0, 8) %13, nope(),
          calltr1(b0 %13, "foo", b9 %0) %4;                           // L14C1F1
          eql(b14 %0, b0 %15) %14,
          con(w(100ul)) %15,
          retntr(b6 %13, b5 %4),                                      // L21C1F1
          retntr(b0 %14, b2 %5),                                      // L21C1F1
          retnfl(b0 %14, b14 %2),                                     // L21C1F1
          rd(w(4ul)) %17,                                             // L18C1F1
          rd(w(4ul)) %16,                                             // L18C1F1
          rd(w(3ul)) %5;                                              // L20C1F1

      All the excess EBBs have gone away, so there will be at most one misprediction in the switch, which is as good as a table-based switch can do, but without the possibility of a data miss on the table. As the whole code is four instructions, executing it will take at most four cycles (plus a miss if any) which is substantially better than the alternative.

      Some of what the specializer does in –O3 is new, and if you read the code carefully you can find a couple of places where the result is less than optimal. Have fun finding them ๐Ÿ™‚

      • This reply was modified 1 month, 2 weeks ago by  Ivan Godard.
  • Veedrac
    Post count: 11

    I had a play around with optimizing this with what I, probably incorrectly, think I understand about the hardware. Is the following a valid program for Silver?

    F('bar') %0;
        con    (w(100ul))              %1,  //          F3 โ†ฅ
        rd     (w(1ul))                %2,  // R0          โ†ฅ
        rd     (w(3ul))                %3,  // R0          โ†ฅ
        rd     (w(4ul))                %4,  // R1          โ†ฅ
        eql    (b4 %0,  b3 %1)         %5,  //    E0       โ–ธ
        gtrs   (b4 %0,      7)         %6,  //    E1       โ–ธ
        gtrs   (b4 %0,      3)         %7,  //    E2       โ–ธ
        eql    (b4 %0,      0)         %8,  //    E3       โ–ธ
        pick   (b0 %8,  b6 %2, b6 %4)  %9,  //       P0    ?
        retntr (b4 %5,  b6 %3),             //          F0 โ†ฆ
        retntr (b3 %6,  b5 %4),             //          F1 โ†ฆ
        retntr (b2 %7,  b9 %0);             //          F2 โ†ฆ
        gtrs   (b9 %0,      1)         %10, //    E0       โ–ธ
        calltr1(b0 %10, 'foo', b10 %0) %11, //          F0 โ†ฑ
        retnfl (b1 %10, b2 %9),             //          F1 โ†ฆ
        retn   (b1 %11);                    //          F2 โ†ฆ
  • Ivan Godard
    Post count: 418

    I’m very impressed that you could write this; reading conAsm gives me hives, and writing it manually is much worse ๐Ÿ™‚ Your code looks reasonable to casual inspection, which was all I could give it. Silver has four ALUs and three controlFUs, so you haven’t exceeded the available slot population. Using the pick to avoid another retntr is clever but doesn’t actually buy much, because you have a free control slot in the second instruction and pick encoding is not that much smaller than retn encoding.

    You did catch the duplicate rd((w(4)) in the original, and that the second of the retntr/retnfl pair could be a simple retn, saving one argument morsel in the encoding.

    Another thing you caught we have LLVM to thank for, not our specializer: several of the result values are the same as the corresponding selector value (cases 5/6/7), and could be collected by CSE into a group picked up by a lss() rather than eql()s, saving several rd/eql/retn sequences in the specializer output and letting you go from four to two instructions. Alas, LLVM doesn’t catch that and the author didn’t write it in the source; CSE/value numbering is not something to do in a mere code generator like the specializer. Still, it’s nice to know that there is optimization headroom still, even with the decent code we are now producing: you have an average ILP of 8 and 16 ops, while the original only got 6 ILP and 24 ops.

    My thanks to @goldbug for an interesting test case. It not only shows how switches get encoded, but also demonstrates how phasing and width boosts the ILP.

    And @veedrac – please go to millcomputing.com->about->join us and consider what you find there ๐Ÿ™‚

    • Veedrac
      Post count: 11

      Nah, man, conAsm is awesome (says they guy that’s written ~20 lines total). I know exactly where everything is and where it will be, and I can see exactly what resources are available to make it happen. Much nicer than doing the same thing to an OoO core.

      The wiki says returns with values can only go in slots F0, F1 and F2, and I’ve used those all in the second instruction, so I don’t see how I could avoid using pick. Actually, I’m tempted to say it’s the other way around: I have a pick slot I didn’t use, so I’m wasting resources!

      I like how this lets you treat the problem as a resource optimization problem, with the ability to move parts from an overburdened phase to another without playing guessing games with the hardware. Pick was an example, but I can think of a lot, like speculating more when there’s spare capacity. Writing an optimizer for this sounds like a ton of fun, but I’m pretty sure my employer wouldn’t want me helping the competition ;).

      • Ivan Godard
        Post count: 418

        Yes, you’re right, the pick was essential; I overlooked the call op in my (admitted cursory) look through.

        It’s not so much returns with values need Fn, as any op that uses the controlFU (of which there are three on Silver), including br, call, retn, inner, leave, setjmp, longjmp and a few NYF. (As a special case, a retn with no result that is the only op in its instruction can be encoded as a flow skinny, but that’s just an encoding optimization; it still occupies a controlFU slot.) However, a Mill can be configured with more flow slots than it has controlFUs. That’s true for Silver, which has four flow slots but only three carry controlFUs. The relevant part of the Silver spec is:

                            c->flowSlots = newRow<slot>(4);
                            c->flowSlots[0] << cacheFU << conFU << conformFU <<
                                controlFU << lsFU << lsbFU;
                            c->flowSlots[1] << conFU << conformFU << controlFU <<
                                lsFU << lsbFU << miscFU;
                            c->flowSlots[2] << conFU << conformFU << controlFU <<
                                lsFU << lsbFU << miscFU;
                            c->flowSlots[3] << conFU << lsFU << lsbFU << miscFU;

        As you see, slot 4 lacks a controlFU. BTW, the provisioning of all the Mill test configurations (in the Metal series like Silver and elsewhere) can and will change as we tune for workload/market; don’t take this spec as stone-graven.

        As for writing conAsm manually: selecting the ops to match the source is quite easy because Mill ISA is high level and closely matches the HLL input; for any given source there’s usually only one way to do it on the core. Op placement (putting the selected ops into instructions with maximal density) is a little more difficult but not that bad. But then you have to determine the numeric value of belt references, which means you must in effect symbolically execute the program and keep track of where everything is on the belt, and that’s a real pain; I admit I did not check that part of your code. The pain is acute when there are in-flight ops started in one ebb that retire somewhere else after possibly several control transfers. Then lastly you must verify that a value you need hasn’t fallen off the belt, and insert spill/fill or rescue ops as necessary – and then go recalculate the belt numbers again. Software is much better at that sort of thing than are humans.

        Remember us when you want something more fun than your present employer ๐Ÿ™‚ Mill Computing is a distributed virtual company; we have people all over the world.

  • goldbug
    Post count: 6

    Is there a reference for conAsm? I can’t figure out what half the operations do.

    I am very impressed that the compiler & specializer managed to condense my test case to just 4 instructions! WOW.

  • Ivan Godard
    Post count: 418

    You will find some info about the ISA on the Mill wiki: http://millcomputing.com/wiki/Main_Page. Be aware that some of the info there is outdated, especially the details about the provisioning of the various configurations. Unfortunately, the person who had been working on the Wiki had personal issues that prevented him finishing, and no new person joining has picked it back up. If you or any reader wants to really get into the architecture and is able to explain what you find to others then go to millcomputing.com -> About -> Join us; we have a home for you ๐Ÿ™‚ Much of the ISA material is mechanically generated (Python) from the specs that drive all our software, but there’s English text needed too.

    As for the conAsm for your test case, the specializer output is decent, but what really impressed me was what Veedrac did with it, by hand and with as little conAsm documentation as there is. Kudos.

    As for the ops in the various code samples:
    con – arbitrary literal constant
    rd – popCon, specially recognized literal constant
    eql – integer equal compare
    lsss – signed integer less compare
    gtrs – signed integer greater compare
    retn{tr,fl} – conditional return
    retn – unconditional return
    br{tr,fl} – conditional branch
    br – unconditional branch
    pick – select one of two, i.e. a ? b : c
    calltr1 – conditional call with one belt result

  • goldbug
    Post count: 6

    I finally managed to digest that code, impressive indeed!

    I can also see that it would be easy to make the entire function just 1 instruction in gold.

    Veedrac also has the right idea about how to format conAsm. He puts the rd and con operations first which more closely resembles phasing, he spaces instructions so that it is easy to visually scan the code and he shows in comments the slot where the result go. I am not sure what the arrows at the end mean though.

    • Ivan Godard
      Post count: 418

      Gold has four controlFUs and four pick slots, so if the trailing two retns were pickified like Veedrac did, and the added control slot were used for the call, I think it fits. I’ll look into getting the specializer to do pickification.

    • Ivan Godard
      Post count: 418

      FWIW, here’s what –O3 on gold gives you. It still needs work.

      F("bar") %0;
          lsss(b3 %0, 2) %8,
          eql(b3 %0, 100) %14,
          eql(b3 %0, 0) %7,
          retntr(b1 %7, b4 %1),                                       // L22C1F1
          retntr(b3 %8, b0 %17),                                      // L22C1F1
          pick(b1 %14, b4 %2, b5 %5) %17,
          rd(w(3ul)) %5,                                              // L21C1F1
          rd(w(4ul)) %2,                                              // L19C1F1
          rd(w(1ul)) %1;
              // L4C1F1=/home/ivan/mill/specializer/src/test3.c
          eql(b7 %0, 7) %12,
          eql(b7 %0, 6) %11,
          eql(b7 %0, 5) %10,
          eql(b7 %0, 4) %9,
          retntr(b0 %9, b4 %17),                                      // L22C1F1
          retntr(b1 %10, b11 %0),                                     // L22C1F1
          retntr(b2 %11, b11 %0),                                     // L22C1F1
          retntr(b3 %12, b11 %0);                                     // L22C1F1
          lsss(b13 %0, 8) %13,
          calltr1(b0 %13, "foo", b14 %0) %4,                          // L15C1F1
          retntr(b1 %13, b0 %4),                                      // L22C1F1
          retn(b8 %17),                                               // L22C1F1
          rd(w(4ul)) %16,                                             // L19C1F1
          rd(w(4ul)) %15;                                             // L19C1F1

      Gold has just enough ALUs (8 of 8), and enough popCon slots (5 of 8, although two are redundant) and pick slots (1 of 4) to do this in one instruction, but this code needs 9 of 4 control slots so three instructions is the best it can do without pickifying some of the retns. Even using all the pick slots we only drop the flow count to 6, or two instruction. To get to one, we have to recognize that the return value is the selector value (which I’m sure we’re not going to get LLVM to do).

      Still, an ILP of 7.6 ain’t too shabby ๐Ÿ™‚

    • Veedrac
      Post count: 11


      The letter-number pairs are the slots, and the arrows are the phase (reader โ†ฅ, compute โ–ธ, call โ†ฑ, pick ?, writer โ†ง, conform =, transfer โ†ฆ). The symbol choice is from http://millcomputing.com/instructions.html, with the phase column collapsed.

      Though if I had my say it’d be reader โค’, compute โฉ, call โ†บ, writer โ†ง, conform โ™บ, transfer โคท.

      • This reply was modified 1 month, 1 week ago by  Veedrac.
      • This reply was modified 1 month, 1 week ago by  Veedrac.
  • Veedrac
    Post count: 11

    I also tried my hand at this guy for Gold, and got

    // ILP of 5โ…“, mostly because of the silly load latency :P.
    F("parseval") %str;
        load(%str, 0, b) %l0,
        load(%str, 1, b) %l1, // I'd fill this with loads if the two loops had the
        load(%str, 2, b) %l2; // same cycle length, but alas they do not.
        eqlb(%l0, 45) %neg,
        addp(%str, 1, b) %strplus1,
        pick(%neg, %strplus1, %str) %start,
        pick(%neg, %l1, %l0) %L0,
        pick(%neg, %l2, %l1) %L1;
        // Now we can finally start doing fun stuff.
        rd(b(0)) %zero,
        load(%start, 2, b) %char,     // Not loading early means we take a stall in hexmode,
        eqlb(%L0,   48) %L00,         // but loading early means decimalmode gets early loads. :(
        eqlb(%L1,  120) %L1x,
        addp(%start, 2) %startplus2,
        pick(%L00, %L1x, %L00) %hexmode,
        innertr(%hexmode, "parseval$hexmode", %zero, %startplus2, %neg),
        inner(%decimalmode, "parseval$decimalmode", %zero, %startplus2, %L1, %L0, %neg);
    // ILP of 9! Cycle length equals loop-caried dependency.
    L("parseval$hexmode") %total %str %neg;
        // a wild %char appears, right on time!
        load  (%str, b, 2) %char,     // Ideally this load would be fetching further ahead,
        gequ  (%char,  48) %digitlo,  // but we want one byte every 2 cycles, whereas the
        lequ  (%char,  57) %digithi,  // other branch want one every 3 cycles, so you can't
        gequ  (%char,  97) %letterlo, // hoist the setup into the header.
        lequ  (%char, 102) %letterhi,
        sub   (%char,  48) %digitval,
        sub   (%char,  87) %letterval,
        shiftl(%total,  4) %shifttot,
        addp  (%str, 1, b) %newstr,
        pick  (%digitlo,  %digithi,  %digitlo)  %digit,
        pick  (%letterlo, %letterhi, %letterlo) %letter;
        orl   (%digit,    %letter)    %inrange,
        add   (%shifttot, %digitval)  %digittot,
        add   (%shifttot, %letterval) %lettertot,
        pick  (%digit, %digittot, %lettertot) %newtotal,
        conform(%newtotal, %newstr, %neg, %total, %char),
        leavefl(%inrange, "parseval$finish", %total, %char, %neg),
    // ILP of only 4โ…“. Cycle length equals loop-caried dependency.
    // Not much one can do when the multiply is slow... :(
    L("parseval$decimalmode") %total %str %newchar %char %neg;
        // A wild %newnewchar appears... six cycles early.
        // That's pretty lame, but fixing it take a branch.
        load  (%str, b, 3) %newnewchar,
        gequ  (%char,  48) %digitlo,
        lequ  (%char,  57) %digithi,
        addp  (%str, 1, b) %newstr,
        sub   (%char,  48) %digitval,
        shiftl(%total,  1) %totalx2,
        shiftl(%total,  3) %totalx8,
        pick  (%digitlo, %digithi,  %digitlo) %digit,
        leavefl(%digit, "parseval$finish", %total, %char, %neg);
        // This is a very sad instruction.
        add   (%totalx2,  %totalx8)  %totalx10;
        add   (%totalx10, %digitval) %newtotal,
        rescue(%newtotal, %newstr, %newnewchar, %newchar, %neg),
    // ILP of 4ยฝ. Meh, it works.
    L("parseval$finish") %total %char %neg;
        // a wild %fakechar appears
        eqlb(%char, 75) %K,
        eqlb(%char, 77) %M,
        neg(%total) %negtotal,
        pick(%neg, %negtotal, %total) %totalsigned;
        shiftl(%totalsigned, 10) %totalK,
        shiftl(%totalsigned, 20) %totalM,
        pick(%K, %totalK, %totalsigned) %totalelse,
        retntr(%M, %totalM),

    Interesting architecture indeed. I’ve not handled the busywork belt numbers, but I’ve checked they’re in a consistent position. I expect one’s better off having a branch before entering either loop where you flood it with correctly-delayed loads ahead of time; that way you don’t risk stalls in the loop itself, but it adds a cycle (and a jump) to every call to parseval. Delayed loads don’t work because the loop count is different, and tagged loads don’t work because they don’t pipeline.

    • This reply was modified 1 month, 1 week ago by  Veedrac.
    • Ivan Godard
      Post count: 418

      I’m just blown away!

      The great bulk of what you have done to improve the code is software pipelining, which the Mill is designed for. The pipelining done in our tool chain is still experimental (i.e. breaks on everything), so I can’t give you a comparison yet, but in-loop ILPs of 9 are actually low on high-end Mills because you can pipeline up overlapping iterations until you run out of FUs. Then there’s vectors…

      Some minor notes on your code:
      1) the “retire” operation (a.k.a. “Dave’s Device” here in-house) will let you get rid of all the loop setup (prologue) code, so there is no gain from sharing the prologue across the two loops.
      2) “leave” discards all inflight operations; you don’t have to worry about multiplies or loads that you start in the loop dropping after you exit
      3) the “conform” op is no more; it was obviated by letting branches supply a carry list. Similarly, the “rescue” in your code can be folded into the following branch op.
      4) Long dependency chains in open (non-loop) code produce “sad” instructions even on a Mill ๐Ÿ™‚
      5) Your parseval$decimalMode EBB should end with a branch back to the same EBB; sending it to parseval$hexMode is a bug I think.
      6) It is possible to write the hex loop to eat a new character every cycle; the code has to split the OR reduction in two, shifting by two nibbles out of phase with each other so the shifted nibbles interleave. The interleave can be running in the loop, or a final OR after it if the loop exports both partials; our pipeliner tries to move all reductions to after the loop, but has problems with that. Have fun ๐Ÿ™‚
      7) If you get hexMode to one byte per cycle, try the same on the decimalMode loop. Hint: unroll by three, then schedule on a torus so that corresponding operations on different iterations schedule at the same point. It’s fairly easy when there are no loop-carried dependencies, but it gets harder when, as here, the loop is a reduction.
      8) and if you got all that, remember that Silver has two mulFUs and Gold has four, so you can in fact eat 2 or 4 bytes per cycle. You were regretting “only” 4.3 ILP; try ILP 19. Mill eats loops for breakfast, lunch and dinner (I just love to brag about the baby).

  • Veedrac
    Post count: 11

    1. Sure, you can use retire to get you that fetch-ahead load if you don’t mind stalling while the loop warms up. Alternatively you can just fix the instruction set so I can do it the right way ;).

    2. Neat, I knew there must be something like that. It was too obvious to overlook.

    6, 7, 8. I explicitly chose not to unroll or vectorize to make the comparison more meaningful. I’m sure these things work, but I tried to stick to things an OoO core would be trying to do in hardware. No maths tricks beyond what I could believe stock GCC would give.

    • This reply was modified 1 month ago by  Veedrac.
    • Ivan Godard
      Post count: 418

      Re #1: the expectation of the ISA is that a loop will start one (or more) iterations every cycle, and will saturate the load slots; for a variety of hardware reasons load slots can’t just be cookie-cuttered out the way that ALUs can, and we expect load to be a scaling constraint. If it is, then a manual prologue is no better than a retireOp based prologue. Of source, loads will not always be the constraint: for those cases, what would you like to see in the ISA that would let you “do it right”?

  • Veedrac
    Post count: 11

    I’d love to explain the thing you’re missing and how to fix it, but alas I expect there exist lawyers that would be upset if I did. You’re going to have to take what you’ve got.

  • Ivan Godard
    Post count: 418

    Lawyers do get in the way of things, don’t they? We have our own things we can’t talk about yet too. ๐Ÿ™

  • PeterH
    Post count: 33

    At the risk of diverging the topic, I’m wondering at the impact of different optimization settings on code size. With the architectures I’m familiar with, -O3 and above have an unfortunate tendency to produce larger code in the interest of speed. But techniques like unwinding loops don’t appear to be desirable on the Mill.

    • Ivan Godard
      Post count: 418

      There’s a –Os setting to specialize for space. At present it shuts off cloning speculable operations up into inbound control flow arcs, except in the special case where the EBB has only one op (commonly a retn), so there is no size bloat because the cloned op replaces the otherwise necessary branch.

      We don’t unroll loops at present, and actually have had quite a bit of argument with the LLVM code that really really wants to unroll. There is one case where we know we want to unroll, in loop pipelines in which there is enough hardware functionality to do more than one iteration concurrently and there’s no loop-carried dependencies. However, the specializer isn’t smart enough yet to do that.

You must be logged in to reply to this topic.