Page 19 – Mill Computing, Inc

Forum Replies Created

Viewing 15 posts - 271 through 285 (of 674 total)

Previous 1 … 18 19 20 … 45 Next

Author
Posts
Ivan Godard
Keymaster
May 23, 2017 at 6:08 pm
Post count: 689
in reply to: switches #2856
Lawyers do get in the way of things, don’t they? We have our own things we can’t talk about yet too. 🙁
Ivan Godard
Keymaster
May 19, 2017 at 3:42 pm
Post count: 689
in reply to: switches #2842
You will find some info about the ISA on the Mill wiki: http://millcomputing.com/wiki/Main_Page. Be aware that some of the info there is outdated, especially the details about the provisioning of the various configurations. Unfortunately, the person who had been working on the Wiki had personal issues that prevented him finishing, and no new person joining has picked it back up. If you or any reader wants to really get into the architecture and is able to explain what you find to others then go to millcomputing.com -> About -> Join us; we have a home for you 🙂 Much of the ISA material is mechanically generated (Python) from the specs that drive all our software, but there’s English text needed too.
As for the conAsm for your test case, the specializer output is decent, but what really impressed me was what Veedrac did with it, by hand and with as little conAsm documentation as there is. Kudos.
As for the ops in the various code samples:
con – arbitrary literal constant
rd – popCon, specially recognized literal constant
eql – integer equal compare
lsss – signed integer less compare
gtrs – signed integer greater compare
retn{tr,fl} – conditional return
retn – unconditional return
br{tr,fl} – conditional branch
br – unconditional branch
pick – select one of two, i.e. a ? b : c
calltr1 – conditional call with one belt result
Ivan Godard
Keymaster
May 16, 2017 at 9:37 pm
Post count: 689
in reply to: switches #2837
I’m very impressed that you could write this; reading conAsm gives me hives, and writing it manually is much worse 🙂 Your code looks reasonable to casual inspection, which was all I could give it. Silver has four ALUs and three controlFUs, so you haven’t exceeded the available slot population. Using the pick to avoid another retntr is clever but doesn’t actually buy much, because you have a free control slot in the second instruction and pick encoding is not that much smaller than retn encoding.
You did catch the duplicate rd((w(4)) in the original, and that the second of the retntr/retnfl pair could be a simple retn, saving one argument morsel in the encoding.
Another thing you caught we have LLVM to thank for, not our specializer: several of the result values are the same as the corresponding selector value (cases 5/6/7), and could be collected by CSE into a group picked up by a lss() rather than eql()s, saving several rd/eql/retn sequences in the specializer output and letting you go from four to two instructions. Alas, LLVM doesn’t catch that and the author didn’t write it in the source; CSE/value numbering is not something to do in a mere code generator like the specializer. Still, it’s nice to know that there is optimization headroom still, even with the decent code we are now producing: you have an average ILP of 8 and 16 ops, while the original only got 6 ILP and 24 ops.
My thanks to @goldbug for an interesting test case. It not only shows how switches get encoded, but also demonstrates how phasing and width boosts the ILP.
And @veedrac – please go to millcomputing.com->about->join us and consider what you find there 🙂
Ivan Godard
Keymaster
May 15, 2017 at 9:51 pm
Post count: 689
in reply to: LLVM and pointers #2835
Sorry for the delay in response; several posts here got blackholed by my mailer somehow 🙁
To answer your question: It depends on whether the program is using bounded pointers or not.
Bounded pointers constrain address arithmetic to the rules of C and C++: you can only use pointer arithmetic to move within an allocation or to one past the end of an array. This catches wild addresses that wind up pointing to areas for which the program has valid permissions; segv checking via page table or the Mill’s PLB can’t catch such wild addresses. The effect of bounded pointers is similar to what tools like Purify or valgrind do, but at hardware speeds and always on. Bounded pointers are the default, but you can override all or parts of legacy programs to let them do nasties to what they think is a flat address.
Bounds checking can only be done when the hardware knows that the value is a pointer, not an integer. So if you turn your pointer into an integer, diddle it, turn it back to a pointer, and dereference it, then the hardware will interpret its diddleship as an address. What happens then depends on the bit value, but is unlikely to be very useful. However, if you first convert the bounded pointer to an unbounded one, before diddling it, and do not overflow, then the resulting address will act like the equivalent address on a conventional legacy machine that doesn’t have the Mill security/reliability features. Of course it is an unbound pointer at that point; if you want it be again bounded for further use then you have to tell it what allocation you want to bind it to. There are ops for all of that.
However, the largest problem with LLVM’s conflation of pointer and integer is not pointer arithmetic, it is pointer comparisons. It is quite possible for two pointers to refer to the same object but not be bitwise identical. Bounded pointers provide an example: one pointer may be bound to a larger allocation, while another is bound to a contained sub-object. Both may refer to the same byte address, but the bounds information is different. Integer equal doesn’t work for that.
Ivan Godard
Keymaster
March 28, 2017 at 11:51 am
Post count: 689
in reply to: switches #2820
p.s. For greater detail, here’s the comment from the specializer function that does this.
/*
Each of the ranges can be isolated by a “less than”
comparison, so the maximal number of branches is the size
of the ranges vector. A less-than compare requires one
branch per range, while an equality compare requires one
branch per non-default value. Consequently equality takes
one branch for two ranges when a non-default range of extent
one is adjacent to a default range wheres less-than would
take two branches; a non-default range of extent greater
than one takes one branch using less-than and a number of
branches equal to the extent using equality; and a
non-default branch of extent one adjacent to a non-default
of any size takes one branch in both less-than and
equality.
The worst case is a run of extent-one ranges. If there are
N ranges, then sequential equality needs N branches and
will execute N/2 of them. With less-than you need one
branch to split N to 2*(N/2), 2 to split 2*(N/2) into
4*(N/4) and so on. That is 2N-1 total to the bottom level,
and log2(N) levels. So less-than will always need more
branches than equality, because the bottom level of a
less-than is the same as an equality and there are all the
upper levels on top of that.
As for latency, the dividing point is when is log2(N)
smaller than N/2; the break-even is 8. So we less-than
down until the number of cases is eight or less, and do
sequential equality after that.
The job of this function is to partition
the ranges by adding tests on pred that branch to interior nodes
of the tree until the test is trivial and we can go to the leaf
ebbs given by the switch statement.
“home” is the original ebb that ended with the switch dispatch.
“node” is the root ebb that ends with the switch, or recursively
an inner node in the tree of ebbs that results from splitting
the case space with lss operations.
“pred” is the switch expression that is being tested; it will
be one argument to all the comparisons in the tree.
“ranges” is all the value ranges that must be resolved leafward
of home in the splitting tree; each has a leaf ebb and a
contiguous range of cases that go to that ebb.
*/
void switchPartition(ebbId home, ebbId node, dropId pred,
row ranges) {
Ivan Godard
Keymaster
March 28, 2017 at 11:51 am
Post count: 689
in reply to: switches #2819
The compiler currently turns switches into a binary search branch tree. The case space is partitioned by lss-than compares down to the level at which equal-compares are better, and then equal-compares thereafter. The logic deals well with runs of consecutive values in a case, such as all alphabetic characters; less-than compares are cheap for that. We do not currently generate a jump table. Although early versions of the ISA had jump tables in the code adjacent the the instruction that used them, we took the operation out; code side jump tables seriously muck up the code data paths.
This strategy is tuned for the wide issue and multi-branch capability in larger Mill family members. For a member that is Silver or wider all the equal compares and their branches fit in one instruction, and a big hunk of the less-than tree will be in prior instructions of the same ebb. Consequently execution can get well down the tree before one of the branches is actually taken and search enters a sub-tree. As the whole ebb has only one prediction (where it will exit to) there are many fewer misses than would occur on a legacy architecture that does branch prediction in a search tree. A bit of arithmetic will show how many taken branches (on average) are necessary to get to the target case (assuming cases are equiprobable and sparse, which is the worst case) for any member configuration of ALU exu slots (for the compares) and branch flow lots (for the branches).
This works quite well for switches up to the size of say a lexer, but will obviously fail for something like a YACC grammar with cases in the thousands. For those we need to add a cost function and data-side dispatch tables (Mill branches already support label variables) to the specializer; we just haven’t gotten around to that yet.
- Ivan Godard
  Keymaster
  March 28, 2017 at 11:52 am
  Post count: 689
  in reply to: switches #2821
  @EricP:
  In what way do code side jump tables muck up the code data paths?
  For example, if there was an indexed branch instruction immediately
  followed in the code by a table of displacements, how would that
  be worse than any other kind of jump?
  Eric
  - Ivan Godard
    Keymaster
    March 28, 2017 at 11:53 am
    Post count: 689
    in reply to: switches #2822
    That was pretty much what our initial design did, except that it was a real switch op, not just an indexed branch, and bounds-checked the argument into the default case.
    The mucking arose because the Mill code-side is both heavily streamed and heavily pipelined. It’s a fire-hose and the code side has to do all it can to keep everything out of the way. The whle code side is done on a whole-cache-line basis. Recall that each instruction is variable length and may be up to a whole line long and may cross a line boundary. The instruction header says how long it is, so the front end of decode is an instruction shifter that is two lines wide and cracks the line stream into an instruction stream.
    The cracker line fetch sequence is controlled by the predictor, so a predicted taken transfer causes the lines containing the target to replace the lines containing the transfer in the cracker. After the cracker the line relationship is lost. A miss discards the inflight instruction stream and starts the cracker with a new line-pair just as if it had acted on a predicted transfer.
    In those streams there’s no place for an arbitrarily large data structure like a jump table; it won’t be fetched as part of the line stream, and whatever part of it was in the remainder of the line containing the switch op would have been discarded as the cracker was refilled from the predicted switch target. Moreover, we don’t even know that we have a switch until two cycles after the line-cracker, and the cracker is quite likely to have followed one or two predictions before we know – the table is probably no longer in the decode at all any more, and we have to be prepared to get it, or at least the desired address, from cache.
    Consequently, other than the potential value of maybe part of the table having been brought to the icache along with the instruction containing the switch, there’s no value in having the table adjacent to the switch op except that it gives a cheap way to find the address of the table.
    Now that we know the address of the table (having decoded the switch op enough to discover we need an address), we have to do an indexed load and pull up eight bytes that we can compare against the prediction we followed a couple of cycles ago to see if we went where we should have. That’s a round-trip to the cache. However, the icache is set up to do whole lines – and nothing but whole lines. So to get the case target address we have to pull up a line (needs an adder somewhere to add the table address and the case index), and isolate eight bytes out of that line, which means there’s a shifter somewhere too. Can’t use the ALUs or L/D address adders for the add; they are busy working down the predicted path. Can’t use the ALU shifters for the isolate; they aren’t line wide and are also busy. So we need an adder in the branch unit and an isolate shifter in the icache, equivalent to what is on the front of a dcache, and both adder and isolater are used for nothing but the rare case of a switch. And you have several more data paths to mux, and another command sugnal or packet for the icache to deal with
    The latency is bad too. We don’t know we have a switch op until D2, plus a cycle for the add, plus maybe two-three cycles to get the line from the i$1 (the table won’t be in the i$0 microcache – I don’t even want to think about that), plus a cycle for the isolater, plus a cycle back and a cycle for the success comparison, just to confirm that you did indeed predict correctly. Then what happens if the switch target (as predicted) immediately does a switch too? Is switch handling fully pipelined? Probably not – you will stall, but then you will be stalling anyway while you figure out the address.
    Muck.
    Nowadays the jump table (when we get around to it) is on the data side, which already has the address-adder for an indexed load and the isolaters in the data path from the D$1, so no logic need be added to support switches. The ISA has dynamic branches that take the target address from the belt, for which the prediction confirmation takes zero latency, just like any other branch. The indexed load is separate from the branch and so can be hosted into the future as far as the switch argument value is known. With any luck the pointer will already be on the belt by the time we need it in the dynamic branch, and if there’s a data dependence that prevents hoisting the load then we are still no worse off than a D$1 latency.
    No added logic, no added data paths, no added muxes, no high fixed latency, no stalls (if hoistable).
    No muck.
Ivan Godard
Keymaster
May 23, 2017 at 4:02 pm
Post count: 689
in reply to: switches #2854
Re #1: the expectation of the ISA is that a loop will start one (or more) iterations every cycle, and will saturate the load slots; for a variety of hardware reasons load slots can’t just be cookie-cuttered out the way that ALUs can, and we expect load to be a scaling constraint. If it is, then a manual prologue is no better than a retireOp based prologue. Of source, loads will not always be the constraint: for those cases, what would you like to see in the ISA that would let you “do it right”?
Ivan Godard
Keymaster
May 23, 2017 at 10:29 am
Post count: 689
in reply to: switches #2851
I’m just blown away!
The great bulk of what you have done to improve the code is software pipelining, which the Mill is designed for. The pipelining done in our tool chain is still experimental (i.e. breaks on everything), so I can’t give you a comparison yet, but in-loop ILPs of 9 are actually low on high-end Mills because you can pipeline up overlapping iterations until you run out of FUs. Then there’s vectors…
Some minor notes on your code:
1) the “retire” operation (a.k.a. “Dave’s Device” here in-house) will let you get rid of all the loop setup (prologue) code, so there is no gain from sharing the prologue across the two loops.
2) “leave” discards all inflight operations; you don’t have to worry about multiplies or loads that you start in the loop dropping after you exit
3) the “conform” op is no more; it was obviated by letting branches supply a carry list. Similarly, the “rescue” in your code can be folded into the following branch op.
4) Long dependency chains in open (non-loop) code produce “sad” instructions even on a Mill 🙂
5) Your parseval$decimalMode EBB should end with a branch back to the same EBB; sending it to parseval$hexMode is a bug I think.
6) It is possible to write the hex loop to eat a new character every cycle; the code has to split the OR reduction in two, shifting by two nibbles out of phase with each other so the shifted nibbles interleave. The interleave can be running in the loop, or a final OR after it if the loop exports both partials; our pipeliner tries to move all reductions to after the loop, but has problems with that. Have fun 🙂
7) If you get hexMode to one byte per cycle, try the same on the decimalMode loop. Hint: unroll by three, then schedule on a torus so that corresponding operations on different iterations schedule at the same point. It’s fairly easy when there are no loop-carried dependencies, but it gets harder when, as here, the loop is a reduction.
8) and if you got all that, remember that Silver has two mulFUs and Gold has four, so you can in fact eat 2 or 4 bytes per cycle. You were regretting “only” 4.3 ILP; try ILP 19. Mill eats loops for breakfast, lunch and dinner (I just love to brag about the baby).

Ivan Godard

Keymaster

May 22, 2017 at 1:37 pm

Post count: 689

in reply to: switches #2845

FWIW, here’s what –O3 on gold gives you. It still needs work.

F("bar") %0;
    lsss(b3 %0, 2) %8,
    eql(b3 %0, 100) %14,
    eql(b3 %0, 0) %7,
    retntr(b1 %7, b4 %1),                                       // L22C1F1
    retntr(b3 %8, b0 %17),                                      // L22C1F1
    pick(b1 %14, b4 %2, b5 %5) %17,
    rd(w(3ul)) %5,                                              // L21C1F1
    rd(w(4ul)) %2,                                              // L19C1F1
    rd(w(1ul)) %1;
        // L4C1F1=/home/ivan/mill/specializer/src/test3.c

    eql(b7 %0, 7) %12,
    eql(b7 %0, 6) %11,
    eql(b7 %0, 5) %10,
    eql(b7 %0, 4) %9,
    retntr(b0 %9, b4 %17),                                      // L22C1F1
    retntr(b1 %10, b11 %0),                                     // L22C1F1
    retntr(b2 %11, b11 %0),                                     // L22C1F1
    retntr(b3 %12, b11 %0);                                     // L22C1F1

    lsss(b13 %0, 8) %13,
    calltr1(b0 %13, "foo", b14 %0) %4,                          // L15C1F1
    retntr(b1 %13, b0 %4),                                      // L22C1F1
    retn(b8 %17),                                               // L22C1F1
    rd(w(4ul)) %16,                                             // L19C1F1
    rd(w(4ul)) %15;                                             // L19C1F1

Gold has just enough ALUs (8 of 8), and enough popCon slots (5 of 8, although two are redundant) and pick slots (1 of 4) to do this in one instruction, but this code needs 9 of 4 control slots so three instructions is the best it can do without pickifying some of the retns. Even using all the pick slots we only drop the flow count to 6, or two instruction. To get to one, we have to recognize that the return value is the selector value (which I’m sure we’re not going to get LLVM to do).

Still, an ILP of 7.6 ain’t too shabby 🙂

Ivan Godard
Keymaster
May 22, 2017 at 9:38 am
Post count: 689
in reply to: switches #2844
Gold has four controlFUs and four pick slots, so if the trailing two retns were pickified like Veedrac did, and the added control slot were used for the call, I think it fits. I’ll look into getting the specializer to do pickification.
Ivan Godard
Keymaster
May 17, 2017 at 7:01 pm
Post count: 689
in reply to: switches #2839
Yes, you’re right, the pick was essential; I overlooked the call op in my (admitted cursory) look through.
It’s not so much returns with values need Fn, as any op that uses the controlFU (of which there are three on Silver), including br, call, retn, inner, leave, setjmp, longjmp and a few NYF. (As a special case, a retn with no result that is the only op in its instruction can be encoded as a flow skinny, but that’s just an encoding optimization; it still occupies a controlFU slot.) However, a Mill can be configured with more flow slots than it has controlFUs. That’s true for Silver, which has four flow slots but only three carry controlFUs. The relevant part of the Silver spec is:
```
                    c->flowSlots = newRow<slot>(4);
                    c->flowSlots[0] << cacheFU << conFU << conformFU <<
                        controlFU << lsFU << lsbFU;
                    c->flowSlots[1] << conFU << conformFU << controlFU <<
                        lsFU << lsbFU << miscFU;
                    c->flowSlots[2] << conFU << conformFU << controlFU <<
                        lsFU << lsbFU << miscFU;
                    c->flowSlots[3] << conFU << lsFU << lsbFU << miscFU;
```
As you see, slot 4 lacks a controlFU. BTW, the provisioning of all the Mill test configurations (in the Metal series like Silver and elsewhere) can and will change as we tune for workload/market; don’t take this spec as stone-graven.
As for writing conAsm manually: selecting the ops to match the source is quite easy because Mill ISA is high level and closely matches the HLL input; for any given source there’s usually only one way to do it on the core. Op placement (putting the selected ops into instructions with maximal density) is a little more difficult but not that bad. But then you have to determine the numeric value of belt references, which means you must in effect symbolically execute the program and keep track of where everything is on the belt, and that’s a real pain; I admit I did not check that part of your code. The pain is acute when there are in-flight ops started in one ebb that retire somewhere else after possibly several control transfers. Then lastly you must verify that a value you need hasn’t fallen off the belt, and insert spill/fill or rescue ops as necessary – and then go recalculate the belt numbers again. Software is much better at that sort of thing than are humans.
Remember us when you want something more fun than your present employer 🙂 Mill Computing is a distributed virtual company; we have people all over the world.

Ivan Godard

Keymaster

May 15, 2017 at 9:24 pm

Post count: 689

in reply to: switches #2833

Sorry for the delay in response; a few posts here got blackholed in my mailer somehow 🙁

I edited your example to provide a declaration for function foo, but it is otherwise unchanged. On Silver with the LLVM default and the specializer at –O1 (multiple ops per instruction but minimal other changes) you get this conAsm:

 F("bar") %0;
    br("bar$0_8"),
    rd(w(1ul)) %1;
        // L3C1F1=/home/ivan/mill/specializer/src/test1.c

L("bar$0_1");
    br("bar$0_2", b0 %2),
    rd(w(4ul)) %2;                                              // L18C1F1

L("bar$0_2") %3/0/1/2/4/5;
    retn(b0 %3/0/1/2/4/5);                                      // L21C1F1

L("bar$0_7");
    br("bar$0_2", b0 %5),
    rd(w(3ul)) %5;                                              // L20C1F1

L("bar$0_8");
    eql(b1 %0, 0) %7;

    brfl(b0 %7, "bar$0_9"),
    br("bar$0_2", b1 %1);

L("bar$0_9");
    lsss(b2 %0, 2) %8;

    brfl(b0 %8, "bar$0_10"),
    br("bar$0_1");

L("bar$0_10");
    eql(b3 %0, 4) %9;

    brfl(b0 %9, "bar$0_11"),
    br("bar$0_1");

L("bar$0_11");
    eql(b4 %0, 5) %10;

    brfl(b0 %10, "bar$0_12"),
    br("bar$0_2", b5 %0);

L("bar$0_12");
    eql(b5 %0, 6) %11;

    brfl(b0 %11, "bar$0_13"),
    br("bar$0_2", b6 %0);

L("bar$0_13");
    eql(b6 %0, 7) %12;

    brfl(b0 %12, "bar$0_14"),
    br("bar$0_2", b7 %0);

L("bar$0_14");
    lsss(b7 %0, 8) %13, nope();

    brfl(b0 %13, "bar$0_15");

    call1("foo", b8 %0) %4,                                     // L14C1F1
    br("bar$0_2", b0 %4);

L("bar$0_15");
    eql(b9 %0, b0 %15) %14,
    con(w(100ul)) %15;

    brtr(b0 %14, "bar$0_7"),
    br("bar$0_1");

This shows how the switch is turned into a cascade of tests for the various cases, nested cascades if there are enough cases. The code is poor because there are so many EBBs (extended basic blocks) and we would expect several mis-predictions as control worked it way to the actual working case. (Mis-predicts are cheap on the Mill, but far from free).

Working from the same genAsm from LLVM but with the specializer at –O3 you get this conAsm:

F("bar") %0;
    eql(b2 %0, 4) %9,
    lsss(b2 %0, 2) %8,
    eql(b2 %0, 0) %7,
    retntr(b0 %7, b3 %1),                                       // L21C1F1
    retntr(b1 %8, b4 %2),                                       // L21C1F1
    retntr(b2 %9, b4 %2),                                       // L21C1F1
    rd(w(4ul)) %2,                                              // L18C1F1
    rd(w(1ul)) %1;
        // L3C1F1=/home/ivan/mill/specializer/src/test1.c

    eql(b5 %0, 7) %12,
    eql(b5 %0, 6) %11,
    eql(b5 %0, 5) %10,
    retntr(b0 %10, b8 %0),                                      // L21C1F1
    retntr(b1 %11, b8 %0),                                      // L21C1F1
    retntr(b2 %12, b8 %0);                                      // L21C1F1

    lsss(b8 %0, 8) %13, nope(),
    calltr1(b0 %13, "foo", b9 %0) %4;                           // L14C1F1

    eql(b14 %0, b0 %15) %14,
    con(w(100ul)) %15,
    retntr(b6 %13, b5 %4),                                      // L21C1F1
    retntr(b0 %14, b2 %5),                                      // L21C1F1
    retnfl(b0 %14, b14 %2),                                     // L21C1F1
    rd(w(4ul)) %17,                                             // L18C1F1
    rd(w(4ul)) %16,                                             // L18C1F1
    rd(w(3ul)) %5;                                              // L20C1F1

All the excess EBBs have gone away, so there will be at most one misprediction in the switch, which is as good as a table-based switch can do, but without the possibility of a data miss on the table. As the whole code is four instructions, executing it will take at most four cycles (plus a miss if any) which is substantially better than the alternative.

Some of what the specializer does in –O3 is new, and if you read the code carefully you can find a couple of places where the result is less than optimal. Have fun finding them 🙂

This reply was modified 7 years, 8 months ago by Ivan Godard.

Ivan Godard
Keymaster
March 17, 2017 at 10:33 am
Post count: 689
in reply to: Control-flow folding #2813
(Please put one question per post, rather than adding more by edits that may cross with responses)
About numbers: we’re working on it, but we have nothing big enough yet to really exercise the predictor and especially the inter-run history mechanism. Something like mySQL would do, or a Windows port maybe, but for now we claim benefits based on first principles and seat-of-the-pants, not on measurement. And yes, we are even more impatient about that than you are 🙂
Author
Posts

Viewing 15 posts - 271 through 285 (of 674 total)

Previous 1 … 18 19 20 … 45 Next