Forum Replies Created
- AuthorPosts
The issue with Tachyum is that their sales pitch isn’t that compelling any more because of new Arm server chips. At best they might have slightly better throughput on a slightly smaller die at slightly lower power, but the Ampere Altra already comes with more cores (soon even 128 cores in the Altra Max), and a 3.3 GHz OoO will be more consistent than a 4.0 GHz Tachyum, especially as the Tachyum only looks 4-wide, even if Tachyum claim to win some benchmarks here and there. 20% differences won’t be enough to steal market share from established architectures.
This is not to say the Tachyum Prodigy isn’t extremely cool. They’ve progressed very quickly and I’m a fan of architectural diversity.
- in reply to: UTF-8 Decode Routine #3417
This is really neat, nice. Some comments on the things I’ve found out about the arch:
con(v(0xe0, 0xf0, 0xf8)) is a length-3 vector? This is a little confusing, it needs to be 128 bits IIUC.
andlu(%first, %prefmask) won’t work since the Mill doesn’t splat automatically.
I don’t think you can return immediates as per retntr(%onebyte, %first, 1).
andlu’s immediate is morsel-sized, so andlu(%cont, 0xc0) won’t fit, and I don’t think the Mill will splat immediates either.
smearx(%picked) will return two elements, so you can dump the any(%picked).
con(v(0, 0, 0, 0)) can be a rd() of the appropriate constant.
Overall I don’t know if SIMD was the right choice; using pick and interleaving the different paths would probably be faster.
- in reply to: Performance counters #3394
I don’t know details, but there are some trivial counters listed http://millcomputing.com/wiki/Registers.
There are so many cool things a Mill is uniquely capable of in this space relative to an OoO machine; it would be a shame not to hear that they have something interesting planned.
- in reply to: Loop compilation #3388
rd
can’t copy belt values; the whole point of that phase is that it has no belt inputs. It has four purposes:1. Dropping predefined “popular constants” (popCons) like [0, 1, 2, 3], π, e, √2. “The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.” con should work as well, but is significantly larger.
2. Operating on the in-core scratchpad, allocated with
scratchf
.spill
andfill
are for the extended scratchpad, which is stored in memory and allocated withexscratchf
, and can be much larger.3. Reading registers. These aren’t belt locations, they are things like hardware counters, threadIDs, and supposedly also “thisPointer” for method calls.
4. Stream handling, a magic secret thing for “cacheless bulk memory access” we’ve never heard the details of, as far as I know.
- in reply to: Loop compilation #3380
So, when we also include the ‘y’, and not use spiller it will be something like this?
As I understand it, yes, basically. There are a few differences I know of.
You won’t use
or(x, 0)
to reorder ops, orxor(0, 0)
to load zeros. Reordering would probably be done here by using a branch instruction encoding that allows for reordering, or the conform instruction (the Wiki isn’t up-to-date on this). Loading a zero would probably be done withrd
, since that phases well.eql
takes two arguments unless it’s ganged (then it’s 0). I think you would have tord
in a zero or compare to%i
, but I’m not sure.eql
is also phased such that it can be in the same cycle asbrtr
; my semicolon was a typo.- in reply to: Loop compilation #3381
(not editing because that broke things last time)
There’s also no reason to load multiple zeros in when using a conform-like-brtr (or conform) since it presumably lets you specify a belt item in multiple places.
- in reply to: Loop compilation #3377
(reposting due to forum issues; apologies if this spams)
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
L("loop") %A1 %x %i %nsubi %tmp1 %tmp2; load(%A1, 0, $i, 4, 4) %a, mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
- in reply to: Loop compilation #3376
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
L("loop") %A1 %x %i %nsubi %tmp1 %tmp2; load(%A1, 0, $i, 4, 4) %a, mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
- in reply to: Loop compilation #3374
No optimizations means you have a hideous long dependency chain in your loop (load-mul-mul-mul-add). This is easy to fix but for sake of demonstration I’ll stop at reordering the multiplies. The loop would look something like so, with differences down to a lack of an up-to-date reference, and that I don’t really get how nops work.
L("loop") %A1 %x %i %nsubi %tmp1 %tmp2; load(%A1, 0, $i, 4, 4) %a, mul(%x, %i) %xi; nop; nop; mul(%xi, %i) %xii; nop; nop; mul(%a, %xi) %tmp1inc, mul(%a, %xii) %tmp2inc; nop; nop; add1u(%i) %i_, sub1u(%nsubi) %nsubi_, eql(), add(%tmp1, %tmp1inc) %tmp1_, add(%tmp2, %tmp2inc) %tmp2_; brtr("loop", %A1 %x %i_ %nsubi_ %tmp1_ %tmp2_);
Shuffling the arguments around is done during the branch. It’s fast because it’s hardware magic. IIUC for something big like Gold the basic idea is that there’s a remapping phase where your logical belt names get mapped to physical locations, akin to an OoO’s register renaming, but 32ish wide instead of 368ish as on Skylake. The branch would just rewrite these bytes using what amounts to a SIMD permute instruction. A small latency in this process isn’t a problem because branches are predicted and belt pushes are determined statically. The instruction can be encoded with a small bitmask in most cases so also shouldn’t be an issue.
- This reply was modified 5 years, 11 months ago by Veedrac.
1. Sure, you can use retire to get you that fetch-ahead load if you don’t mind stalling while the loop warms up. Alternatively you can just fix the instruction set so I can do it the right way ;).
2. Neat, I knew there must be something like that. It was too obvious to overlook.
6, 7, 8. I explicitly chose not to unroll or vectorize to make the comparison more meaningful. I’m sure these things work, but I tried to stick to things an OoO core would be trying to do in hardware. No maths tricks beyond what I could believe stock GCC would give.
- This reply was modified 7 years, 6 months ago by Veedrac.
- in reply to: Meltdown and Spectre #3556
The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.
In a top-end OoO it’s perhaps more true to say the real win is filling the reorder buffer. Your predictor doesn’t have to be that good to deal with fetching code in time, given caches are getting so large and looping is so common, but if you want to fill a 560ish instruction ROB, you need to be predicting 560 instructions ahead, and predictor quality and performance is intensely important for that. But yeah, that’s not so relevant for a Mill.
- in reply to: Scratchpad design decision #3423
I can’t imagine that a Tin can only retire three values a cycle
According to the Wiki, Tin peaks at five: two constant loads (flow slots 0/1), one operation (exu slot 0), one condition code (exu slot 1), and a pick.
- in reply to: Performance counters #3397
ignore this
- This reply was modified 5 years, 11 months ago by Veedrac.
- AuthorPosts