Forum Replies Created

Viewing 15 posts - 1 through 15 (of 23 total)
  • Author
    Posts
  • Veedrac
    Participant
    Post count: 25

    The issue with Tachyum is that their sales pitch isn’t that compelling any more because of new Arm server chips. At best they might have slightly better throughput on a slightly smaller die at slightly lower power, but the Ampere Altra already comes with more cores (soon even 128 cores in the Altra Max), and a 3.3 GHz OoO will be more consistent than a 4.0 GHz Tachyum, especially as the Tachyum only looks 4-wide, even if Tachyum claim to win some benchmarks here and there. 20% differences won’t be enough to steal market share from established architectures.

    This is not to say the Tachyum Prodigy isn’t extremely cool. They’ve progressed very quickly and I’m a fan of architectural diversity.

  • Veedrac
    Participant
    Post count: 25

    This is really neat, nice. Some comments on the things I’ve found out about the arch:

    con(v(0xe0, 0xf0, 0xf8)) is a length-3 vector? This is a little confusing, it needs to be 128 bits IIUC.

    andlu(%first, %prefmask) won’t work since the Mill doesn’t splat automatically.

    I don’t think you can return immediates as per retntr(%onebyte, %first, 1).

    andlu’s immediate is morsel-sized, so andlu(%cont, 0xc0) won’t fit, and I don’t think the Mill will splat immediates either.

    smearx(%picked) will return two elements, so you can dump the any(%picked).

    con(v(0, 0, 0, 0)) can be a rd() of the appropriate constant.

    Overall I don’t know if SIMD was the right choice; using pick and interleaving the different paths would probably be faster.

  • Veedrac
    Participant
    Post count: 25

    I don’t know details, but there are some trivial counters listed http://millcomputing.com/wiki/Registers.

    There are so many cool things a Mill is uniquely capable of in this space relative to an OoO machine; it would be a shame not to hear that they have something interesting planned.

  • Veedrac
    Participant
    Post count: 25
    in reply to: Loop compilation #3388

    rd can’t copy belt values; the whole point of that phase is that it has no belt inputs. It has four purposes:

    1. Dropping predefined “popular constants” (popCons) like [0, 1, 2, 3], π, e, √2. “The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.” con should work as well, but is significantly larger.

    2. Operating on the in-core scratchpad, allocated with scratchf. spill and fill are for the extended scratchpad, which is stored in memory and allocated with exscratchf, and can be much larger.

    3. Reading registers. These aren’t belt locations, they are things like hardware counters, threadIDs, and supposedly also “thisPointer” for method calls.

    4. Stream handling, a magic secret thing for “cacheless bulk memory access” we’ve never heard the details of, as far as I know.

  • Veedrac
    Participant
    Post count: 25
    in reply to: Loop compilation #3380

    So, when we also include the ‘y’, and not use spiller it will be something like this?

    As I understand it, yes, basically. There are a few differences I know of.

    You won’t use or(x, 0) to reorder ops, or xor(0, 0) to load zeros. Reordering would probably be done here by using a branch instruction encoding that allows for reordering, or the conform instruction (the Wiki isn’t up-to-date on this). Loading a zero would probably be done with rd, since that phases well.

    eql takes two arguments unless it’s ganged (then it’s 0). I think you would have to rd in a zero or compare to %i, but I’m not sure. eql is also phased such that it can be in the same cycle as brtr; my semicolon was a typo.

    • Veedrac
      Participant
      Post count: 25
      in reply to: Loop compilation #3381

      (not editing because that broke things last time)

      There’s also no reason to load multiple zeros in when using a conform-like-brtr (or conform) since it presumably lets you specify a belt item in multiple places.

  • Veedrac
    Participant
    Post count: 25
    in reply to: Loop compilation #3377

    (reposting due to forum issues; apologies if this spams)

    No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.

    
        L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
            load(%A1, 0, $i, 4, 4) %a,
            mul(%x,  %i) %xi;
    
            nop; nop;
    
            mul(%xi, %i) %xii;
    
            nop; nop;
    
            mul(%a, %xi)  %tmp1inc,
            mul(%a, %xii) %tmp2inc;
    
            nop; nop;
    
            add1u(%i) %i_,
            sub1u(%nsubi) %nsubi_,
            eql(),
            add(%tmp1, %tmp1inc) %tmp1_,
            add(%tmp2, %tmp2inc) %tmp2_;
            brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
    

    Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.

  • Veedrac
    Participant
    Post count: 25
    in reply to: Loop compilation #3376

    No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.

    
        L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
            load(%A1, 0, $i, 4, 4) %a,
            mul(%x,  %i) %xi;
    
            nop; nop;
    
            mul(%xi, %i) %xii;
    
            nop; nop;
    
            mul(%a, %xi)  %tmp1inc,
            mul(%a, %xii) %tmp2inc;
    
            nop; nop;
    
            add1u(%i) %i_,
            sub1u(%nsubi) %nsubi_,
            eql(),
            add(%tmp1, %tmp1inc) %tmp1_,
            add(%tmp2, %tmp2inc) %tmp2_;
            brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
    

    Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.

  • Veedrac
    Participant
    Post count: 25
    in reply to: Loop compilation #3374

    No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.

    
        L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
            load(%A1, 0, $i, 4, 4) %a,
            mul(%x,  %i) %xi;
    
            nop; nop;
    
            mul(%xi, %i) %xii;
    
            nop; nop;
    
            mul(%a, %xi)  %tmp1inc,
            mul(%a, %xii) %tmp2inc;
    
            nop; nop;
    
            add1u(%i) %i_,
            sub1u(%nsubi) %nsubi_,
            eql(),
            add(%tmp1, %tmp1inc) %tmp1_,
            add(%tmp2, %tmp2inc) %tmp2_;
            brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
    

    Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.

    • This reply was modified 6 years ago by  Veedrac.
  • Veedrac
    Participant
    Post count: 25
    in reply to: switches #2855

    I’d love to explain the thing you’re missing and how to fix it, but alas I expect there exist lawyers that would be upset if I did. You’re going to have to take what you’ve got.

  • Veedrac
    Participant
    Post count: 25
    in reply to: switches #2852

    1. Sure, you can use retire to get you that fetch-ahead load if you don’t mind stalling while the loop warms up. Alternatively you can just fix the instruction set so I can do it the right way ;).

    2. Neat, I knew there must be something like that. It was too obvious to overlook.

    6, 7, 8. I explicitly chose not to unroll or vectorize to make the comparison more meaningful. I’m sure these things work, but I tried to stick to things an OoO core would be trying to do in hardware. No maths tricks beyond what I could believe stock GCC would give.

    • This reply was modified 7 years, 7 months ago by  Veedrac.
  • Veedrac
    Participant
    Post count: 25

    The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.

    In a top-end OoO it’s perhaps more true to say the real win is filling the reorder buffer. Your predictor doesn’t have to be that good to deal with fetching code in time, given caches are getting so large and looping is so common, but if you want to fill a 560ish instruction ROB, you need to be predicting 560 instructions ahead, and predictor quality and performance is intensely important for that. But yeah, that’s not so relevant for a Mill.

  • Veedrac
    Participant
    Post count: 25
    in reply to: news? #3493

    A2O, not A20. Presumably shorthand for ‘apples-to-oranges’.

  • Veedrac
    Participant
    Post count: 25

    I can’t imagine that a Tin can only retire three values a cycle

    According to the Wiki, Tin peaks at five: two constant loads (flow slots 0/1), one operation (exu slot 0), one condition code (exu slot 1), and a pick.

  • Veedrac
    Participant
    Post count: 25

    ignore this

    • This reply was modified 5 years, 12 months ago by  Veedrac.
Viewing 15 posts - 1 through 15 (of 23 total)