Forum Replies Created

Viewing 8 posts - 16 through 23 (of 23 total)
  • Author
    Posts
  • Veedrac
    Participant
    Post count: 25
    in reply to: switches #2846

    I also tried my hand at this guy for Gold, and got

    
    // ILP of 5⅓, mostly because of the silly load latency :P.
    F("parseval") %str;
        load(%str, 0, b) %l0,
        load(%str, 1, b) %l1, // I'd fill this with loads if the two loops had the
        load(%str, 2, b) %l2; // same cycle length, but alas they do not.
    
        eqlb(%l0, 45) %neg,
        addp(%str, 1, b) %strplus1,
        pick(%neg, %strplus1, %str) %start,
        pick(%neg, %l1, %l0) %L0,
        pick(%neg, %l2, %l1) %L1;
    
        // Now we can finally start doing fun stuff.
        rd(b(0)) %zero,
        load(%start, 2, b) %char,     // Not loading early means we take a stall in hexmode,
        eqlb(%L0,   48) %L00,         // but loading early means decimalmode gets early loads. :(
        eqlb(%L1,  120) %L1x,
        addp(%start, 2) %startplus2,
        pick(%L00, %L1x, %L00) %hexmode,
        innertr(%hexmode, "parseval$hexmode", %zero, %startplus2, %neg),
        inner(%decimalmode, "parseval$decimalmode", %zero, %startplus2, %L1, %L0, %neg);
    
    // ILP of 9! Cycle length equals loop-caried dependency.
    L("parseval$hexmode") %total %str %neg;
        // a wild %char appears, right on time!
        load  (%str, b, 2) %char,     // Ideally this load would be fetching further ahead,
        gequ  (%char,  48) %digitlo,  // but we want one byte every 2 cycles, whereas the
        lequ  (%char,  57) %digithi,  // other branch want one every 3 cycles, so you can't
        gequ  (%char,  97) %letterlo, // hoist the setup into the header.
        lequ  (%char, 102) %letterhi,
        sub   (%char,  48) %digitval,
        sub   (%char,  87) %letterval,
        shiftl(%total,  4) %shifttot,
        addp  (%str, 1, b) %newstr,
        pick  (%digitlo,  %digithi,  %digitlo)  %digit,
        pick  (%letterlo, %letterhi, %letterlo) %letter;
    
        orl   (%digit,    %letter)    %inrange,
        add   (%shifttot, %digitval)  %digittot,
        add   (%shifttot, %letterval) %lettertot,
        pick  (%digit, %digittot, %lettertot) %newtotal,
        conform(%newtotal, %newstr, %neg, %total, %char),
        leavefl(%inrange, "parseval$finish", %total, %char, %neg),
        br("parseval$hexmode");
    
    // ILP of only 4⅓. Cycle length equals loop-caried dependency.
    // Not much one can do when the multiply is slow... :(
    L("parseval$decimalmode") %total %str %newchar %char %neg;
        // A wild %newnewchar appears... six cycles early.
        // That's pretty lame, but fixing it take a branch.
        load  (%str, b, 3) %newnewchar,
        gequ  (%char,  48) %digitlo,
        lequ  (%char,  57) %digithi,
        addp  (%str, 1, b) %newstr,
        sub   (%char,  48) %digitval,
        shiftl(%total,  1) %totalx2,
        shiftl(%total,  3) %totalx8,
        pick  (%digitlo, %digithi,  %digitlo) %digit,
        leavefl(%digit, "parseval$finish", %total, %char, %neg);
    
        // This is a very sad instruction.
        add   (%totalx2,  %totalx8)  %totalx10;
    
        add   (%totalx10, %digitval) %newtotal,
        rescue(%newtotal, %newstr, %newnewchar, %newchar, %neg),
        br("parseval$hexmode");
    
    // ILP of 4½. Meh, it works.
    L("parseval$finish") %total %char %neg;
        // a wild %fakechar appears
        eqlb(%char, 75) %K,
        eqlb(%char, 77) %M,
        neg(%total) %negtotal,
        pick(%neg, %negtotal, %total) %totalsigned;
    
        shiftl(%totalsigned, 10) %totalK,
        shiftl(%totalsigned, 20) %totalM,
        pick(%K, %totalK, %totalsigned) %totalelse,
        retntr(%M, %totalM),
        retn(%totalelse);
    

    Interesting architecture indeed. I’ve not handled the busywork belt numbers, but I’ve checked they’re in a consistent position. I expect one’s better off having a branch before entering either loop where you flood it with correctly-delayed loads ahead of time; that way you don’t risk stalls in the loop itself, but it adds a cycle (and a jump) to every call to parseval. Delayed loads don’t work because the loop count is different, and tagged loads don’t work because they don’t pipeline.

    • This reply was modified 7 years, 7 months ago by  Veedrac.
  • Veedrac
    Participant
    Post count: 25
    in reply to: switches #2836

    I had a play around with optimizing this with what I, probably incorrectly, think I understand about the hardware. Is the following a valid program for Silver?

    
    F('bar') %0;
        con    (w(100ul))              %1,  //          F3 ↥
        rd     (w(1ul))                %2,  // R0          ↥
        rd     (w(3ul))                %3,  // R0          ↥
        rd     (w(4ul))                %4,  // R1          ↥
        eql    (b4 %0,  b3 %1)         %5,  //    E0       ▸
        gtrs   (b4 %0,      7)         %6,  //    E1       ▸
        gtrs   (b4 %0,      3)         %7,  //    E2       ▸
        eql    (b4 %0,      0)         %8,  //    E3       ▸
        pick   (b0 %8,  b6 %2, b6 %4)  %9,  //       P0    ?
        retntr (b4 %5,  b6 %3),             //          F0 ↦
        retntr (b3 %6,  b5 %4),             //          F1 ↦
        retntr (b2 %7,  b9 %0);             //          F2 ↦
    
        gtrs   (b9 %0,      1)         %10, //    E0       ▸
        calltr1(b0 %10, 'foo', b10 %0) %11, //          F0 ↱
        retnfl (b1 %10, b2 %9),             //          F1 ↦
        retn   (b1 %11);                    //          F2 ↦
    
  • Veedrac
    Participant
    Post count: 25
    in reply to: switches #2848

    @goldbug

    The letter-number pairs are the slots, and the arrows are the phase (reader ↥, compute ▸, call ↱, pick ?, writer ↧, conform =, transfer ↦). The symbol choice is from http://millcomputing.com/instructions.html, with the phase column collapsed.

    Though if I had my say it’d be reader ⤒, compute ⏩, call ↺, writer ↧, conform ♺, transfer ⤷.

    • This reply was modified 7 years, 7 months ago by  Veedrac.
    • This reply was modified 7 years, 7 months ago by  Veedrac.
  • Veedrac
    Participant
    Post count: 25
    in reply to: switches #2838

    Nah, man, conAsm is awesome (says they guy that’s written ~20 lines total). I know exactly where everything is and where it will be, and I can see exactly what resources are available to make it happen. Much nicer than doing the same thing to an OoO core.

    The wiki says returns with values can only go in slots F0, F1 and F2, and I’ve used those all in the second instruction, so I don’t see how I could avoid using pick. Actually, I’m tempted to say it’s the other way around: I have a pick slot I didn’t use, so I’m wasting resources!

    I like how this lets you treat the problem as a resource optimization problem, with the ability to move parts from an overburdened phase to another without playing guessing games with the hardware. Pick was an example, but I can think of a lot, like speculating more when there’s spare capacity. Writing an optimizer for this sounds like a ton of fun, but I’m pretty sure my employer wouldn’t want me helping the competition ;).

  • Veedrac
    Participant
    Post count: 25

    All (but one) of the previous years have videos linked, so I imagine we just need to be patient. The Wayback Machine is hinting that it might take a while.

  • Veedrac
    Participant
    Post count: 25
    in reply to: Execution #1751

    It seems you should use <pre> tags for code on the Wiki. I’ve edited it in.

  • Veedrac
    Participant
    Post count: 25

    By far the most of the operations in the Mill instruction set can be speculated. What this means is, if an operand to the operation is None or Nar, all the operation does is to make the result None or NaR, or combine the status flags in the case of floating point operations in the result. This is even true for the load operation.

    From http://millcomputing.com/wiki/Speculation#None_and_NaR

  • Veedrac
    Participant
    Post count: 25

    I imagine you would do something like

    add b0, b1
    pick cond, b0, None
    add b0, 1

    “None” effectively absorbs all computation without error. So you would have:

    [12, 11, cond=1, ...]
    [23, 12, 11, cond=1, ...] (add b0, b1)
    [23, 23, 12, 11, cond=1, ...] (pick cond, b0, None)
    [24, 23, 23, 12, 11, cond=1, ...] (add b0, 1)

    or

    [12, 11, cond=0, ...]
    [23, 12, 11, cond=0, ...] (add b0, b1)
    [None, 23, 12, 11, cond=0, ...] (pick cond, b0, None)
    [None, None, 23, 12, 11, cond=0, ...] (add b0, 1)

    It’s not obvious how to turn None into its value, though. You could do

    add b0, b1
    # store
    pick cond, b0, None
    add b0, 1
    # store (may do nothing)
    # load

    but that seems like a lot of overhead to do very little.

    Obviously, though, in this case the best option is just

    add b0, b1
    add b0, cond

    🙂

    • This reply was modified 10 years, 1 month ago by  Veedrac.
    • This reply was modified 10 years, 1 month ago by  Veedrac.
    • This reply was modified 10 years, 1 month ago by  Veedrac.
Viewing 8 posts - 16 through 23 (of 23 total)