Mill Computing, Inc

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 23 total)

1 2 Next

Author
Posts
Veedrac
Participant
September 13, 2020 at 12:27 pm
Post count: 25
in reply to: Is the Tachyum Prodigy perceived as a potential market threat? #3602
The issue with Tachyum is that their sales pitch isn’t that compelling any more because of new Arm server chips. At best they might have slightly better throughput on a slightly smaller die at slightly lower power, but the Ampere Altra already comes with more cores (soon even 128 cores in the Altra Max), and a 3.3 GHz OoO will be more consistent than a 4.0 GHz Tachyum, especially as the Tachyum only looks 4-wide, even if Tachyum claim to win some benchmarks here and there. 20% differences won’t be enough to steal market share from established architectures.
This is not to say the Tachyum Prodigy isn’t extremely cool. They’ve progressed very quickly and I’m a fan of architectural diversity.
Veedrac
Participant
December 28, 2018 at 10:13 pm
Post count: 25
in reply to: UTF-8 Decode Routine #3417
This is really neat, nice. Some comments on the things I’ve found out about the arch:
con(v(0xe0, 0xf0, 0xf8)) is a length-3 vector? This is a little confusing, it needs to be 128 bits IIUC.
andlu(%first, %prefmask) won’t work since the Mill doesn’t splat automatically.
I don’t think you can return immediates as per retntr(%onebyte, %first, 1).
andlu’s immediate is morsel-sized, so andlu(%cont, 0xc0) won’t fit, and I don’t think the Mill will splat immediates either.
smearx(%picked) will return two elements, so you can dump the any(%picked).
con(v(0, 0, 0, 0)) can be a rd() of the appropriate constant.
Overall I don’t know if SIMD was the right choice; using pick and interleaving the different paths would probably be faster.
Veedrac
Participant
December 23, 2018 at 2:38 am
Post count: 25
in reply to: Performance counters #3394
I don’t know details, but there are some trivial counters listed http://millcomputing.com/wiki/Registers.
There are so many cool things a Mill is uniquely capable of in this space relative to an OoO machine; it would be a shame not to hear that they have something interesting planned.
Veedrac
Participant
December 19, 2018 at 7:19 am
Post count: 25
in reply to: Loop compilation #3388
rd can’t copy belt values; the whole point of that phase is that it has no belt inputs. It has four purposes:
1. Dropping predefined “popular constants” (popCons) like [0, 1, 2, 3], π, e, √2. “The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.” con should work as well, but is significantly larger.
2. Operating on the in-core scratchpad, allocated with scratchf. spill and fill are for the extended scratchpad, which is stored in memory and allocated with exscratchf, and can be much larger.
3. Reading registers. These aren’t belt locations, they are things like hardware counters, threadIDs, and supposedly also “thisPointer” for method calls.
4. Stream handling, a magic secret thing for “cacheless bulk memory access” we’ve never heard the details of, as far as I know.
Veedrac
Participant
December 18, 2018 at 9:02 am
Post count: 25
in reply to: Loop compilation #3380
So, when we also include the ‘y’, and not use spiller it will be something like this?
As I understand it, yes, basically. There are a few differences I know of.
You won’t use or(x, 0) to reorder ops, or xor(0, 0) to load zeros. Reordering would probably be done here by using a branch instruction encoding that allows for reordering, or the conform instruction (the Wiki isn’t up-to-date on this). Loading a zero would probably be done with rd, since that phases well.
eql takes two arguments unless it’s ganged (then it’s 0). I think you would have to rd in a zero or compare to %i, but I’m not sure. eql is also phased such that it can be in the same cycle as brtr; my semicolon was a typo.
- Veedrac
  Participant
  December 18, 2018 at 9:09 am
  Post count: 25
  in reply to: Loop compilation #3381
  (not editing because that broke things last time)
  There’s also no reason to load multiple zeros in when using a conform-like-brtr (or conform) since it presumably lets you specify a belt item in multiple places.
Veedrac
Participant
December 18, 2018 at 1:05 am
Post count: 25
in reply to: Loop compilation #3377
(reposting due to forum issues; apologies if this spams)
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
```
    L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
        load(%A1, 0, $i, 4, 4) %a,
        mul(%x,  %i) %xi;

        nop; nop;

        mul(%xi, %i) %xii;

        nop; nop;

        mul(%a, %xi)  %tmp1inc,
        mul(%a, %xii) %tmp2inc;

        nop; nop;

        add1u(%i) %i_,
        sub1u(%nsubi) %nsubi_,
        eql(),
        add(%tmp1, %tmp1inc) %tmp1_,
        add(%tmp2, %tmp2inc) %tmp2_;
        brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
```
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
Veedrac
Participant
December 18, 2018 at 1:03 am
Post count: 25
in reply to: Loop compilation #3376
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
```
    L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
        load(%A1, 0, $i, 4, 4) %a,
        mul(%x,  %i) %xi;

        nop; nop;

        mul(%xi, %i) %xii;

        nop; nop;

        mul(%a, %xi)  %tmp1inc,
        mul(%a, %xii) %tmp2inc;

        nop; nop;

        add1u(%i) %i_,
        sub1u(%nsubi) %nsubi_,
        eql(),
        add(%tmp1, %tmp1inc) %tmp1_,
        add(%tmp2, %tmp2inc) %tmp2_;
        brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
```
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
Veedrac
Participant
December 18, 2018 at 1:01 am
Post count: 25
in reply to: Loop compilation #3374
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
```
    L("loop") %A1 %x %i %nsubi %tmp1 %tmp2;
        load(%A1, 0, $i, 4, 4) %a,
        mul(%x,  %i) %xi;

        nop; nop;

        mul(%xi, %i) %xii;

        nop; nop;

        mul(%a, %xi)  %tmp1inc,
        mul(%a, %xii) %tmp2inc;

        nop; nop;

        add1u(%i) %i_,
        sub1u(%nsubi) %nsubi_,
        eql(),
        add(%tmp1, %tmp1inc) %tmp1_,
        add(%tmp2, %tmp2inc) %tmp2_;
        brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
```
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
- This reply was modified 5 years, 4 months ago by Veedrac.
Veedrac
Participant
May 23, 2017 at 5:15 pm
Post count: 25
in reply to: switches #2855
I’d love to explain the thing you’re missing and how to fix it, but alas I expect there exist lawyers that would be upset if I did. You’re going to have to take what you’ve got.
Veedrac
Participant
May 23, 2017 at 2:28 pm
Post count: 25
in reply to: switches #2852
1. Sure, you can use retire to get you that fetch-ahead load if you don’t mind stalling while the loop warms up. Alternatively you can just fix the instruction set so I can do it the right way ;).
2. Neat, I knew there must be something like that. It was too obvious to overlook.
6, 7, 8. I explicitly chose not to unroll or vectorize to make the comparison more meaningful. I’m sure these things work, but I tried to stick to things an OoO core would be trying to do in hardware. No maths tricks beyond what I could believe stock GCC would give.
- This reply was modified 6 years, 11 months ago by Veedrac.
Veedrac
Participant
August 5, 2020 at 6:19 am
Post count: 25
in reply to: Meltdown and Spectre #3556
The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.
In a top-end OoO it’s perhaps more true to say the real win is filling the reorder buffer. Your predictor doesn’t have to be that good to deal with fetching code in time, given caches are getting so large and looping is so common, but if you want to fill a 560ish instruction ROB, you need to be predicting 560 instructions ahead, and predictor quality and performance is intensely important for that. But yeah, that’s not so relevant for a Mill.
Veedrac
Participant
July 21, 2019 at 5:33 pm
Post count: 25
in reply to: news? #3493
A2O, not A20. Presumably shorthand for ‘apples-to-oranges’.
Veedrac
Participant
February 5, 2019 at 12:15 pm
Post count: 25
in reply to: Scratchpad design decision #3423
I can’t imagine that a Tin can only retire three values a cycle
According to the Wiki, Tin peaks at five: two constant loads (flow slots 0/1), one operation (exu slot 0), one condition code (exu slot 1), and a pick.
Veedrac
Participant
December 23, 2018 at 5:22 pm
Post count: 25
in reply to: Performance counters #3397
ignore this
- This reply was modified 5 years, 3 months ago by Veedrac.
Author
Posts

Viewing 15 posts - 1 through 15 (of 23 total)

1 2 Next

Veedrac

Forum Replies Created