Scratchpad design decision

Tagged: belt, rescue

Author
Posts
NoDot
Participant
December 22, 2017 at 1:32 am
Post count: 6
#3119 |
Last I heard (and according to the wiki) the Scratchpad structure is addressed as an array of bytes. Bytes that also store metadata. The size is member-dependent, but IIRC, it’s intended to store about ten entries. Larger members therefore need more Scratchpad bytes for their larger vectors.
Maxi, IIRC, was the extreme of this, needing about 2KB for its Scratchpad because of the large vector size and use. Yet still storing only about ten entries.
I assume (ha!) it was discussed-but-discarded, so may I ask why the Scratchpad doesn’t simply use entries like the belt?
It seems logical to me, considering the belt already is agnostic to the width of the entry. Jumping back to byte addressing does not follow. Especially given the Scratchpad preserves metadata. (Are entries power-of-two-plus-some? Or is it stored side-along in some way? Or does it vary? NYF?)
Ivan Godard
Keymaster
December 22, 2017 at 1:57 am
Post count: 689
#3120
The number of entries is a member config decision; ten would be small.
Storing by entry number would require a mapping from entry number to position, or (trivial mapping) with all entries being maximal size. We pack them (saving space/increasing capacity) and reference by start byte number. The byte number needs more bits to encode than an entry number would, but the scratch ops are otherwise small and currently we just burn the entropy. The belt uses full sized entries and doesn’t try to pack because the actual belt is a set of latches/regs that must be full width anyway.
The choice for scratch implementation is left to the hardware guys, and might be different in different members due to hardware/cost/power considerations.
- NoDot
  Participant
  December 22, 2017 at 2:40 am
  Post count: 6
  #3121
  Storing by entry number would require a mapping from entry number to position, or (trivial mapping) with all entries being maximal size.
  How odd. I would expect that, in practice, older belt items would leave the output latches for some space in the spiller – entries that haven’t fallen off yet and need to still be kept around, but have still been around for a few cycles; or things spilled back in after some nested function calls. I expected that an entry-based Scratchpad could simply be an extension of those locations.
  But this is likely just gut-feeling not matching reality.
  - Ivan Godard
    Keymaster
    December 22, 2017 at 10:42 am
    Post count: 689
    #3124
    We can’t leave scratchpad-usage data in the spiller because the data is both spatially and temporally random access, while the spiller is at heart just a glorified stack. Items can be left in the scratchpad for arbitrarily long times without increasing the latency of later access, whereas items in the spiller eventually migrate to memory and get memory latency.
    Instead we want the scratchpad to have uniform latency and simple random access, without the expensive mux crossbar needed for spiller access even to limited depth. So really scratch acts, and is mostly implemented like, register files in conventional memory. The differences include the inclusion of metadata, the self-defining data widths, and the packing at byte granularity.
NoDot
Participant
December 22, 2017 at 12:53 pm
Post count: 6
#3125
(We’re slipping away from the topic, but it’s been answered, so…)
I think this is a communication issue. I assume there simply aren’t enough output latches in a single ALU to hold a belt’s worth of outputs. Therefore either those results disappear before they would drop off the belt, or there’s a place they get sent to to wait for the belt to advance them away.
(This would likely require a pathological or artificial case to cause, but I think it deserves an answer.)
I would assume that such a place would be part of the Spiller system-its top layer, perhaps. And that an entry-based Scratchpad would be a logical extension of such a structure.
- Ivan Godard
  Keymaster
  December 22, 2017 at 1:42 pm
  Post count: 689
  #3126
  It’s a lifetime issue. The scratchpad is not a simple extension of the belt, it’s a repository for values with long or indeterminate lifetimes. The spill op copies a value from the belt to scratch, and that same value may move into the spiller if there were a call while it’s still live. But a value computed before a loop and used after the loop (and maybe in the loop) has an unknown lifetime, so we need to save it for the duration of the loop. Mill execution makes values with great abandon, and we can’t save them all as if there were an infinite belt. So we need a way for the compiler to tell the hardware that a particular value is of continuing interest, and be able the request it again later. That’s the spill and fill ops.
  In contrast the spiller saves everything in-flight and on the belt, but that’s a tiny population compared to everything that has ever been on the belt, which it the potential population for the scratchpad. Different expected lifetimes, different reference patterns, different latency, complexity, and power constraints -> different mechanisms.
NoDot
Participant
December 22, 2017 at 2:05 pm
Post count: 6
#3127
You… aren’t answering the question I’m asking, but I think I see where you’re getting confused.
I get that the Scratchpad is a different place than the belt. I do. It isn’t a extension of the belt, it’s a different physical location.
I said “an entry-based Scratchpad would be a logical extension of such a structure”. Allow me to expand and unbox:
I an only talking about storing and addressing an entry-based list of items. That the same mechanism that stores and addresses those still-live-but-older values would be the same or similar to the mechanism that stores or addresses the items in an entry-based Scratchpad.
This entry-based Scratchpad would still be populated by items sent there specifically. It would just share part of its mechanism with where older-but-still-live belt items stay until falling off.
And physically locating those places in the Spiller system is what I assumed. (In retrospect, the Scratchpad as a whole likely can’t be fit in the Spiller without pointlessly ballooning the size and scope of that subsystem.)
edit: By “sharing the same mechanism” I meant they would share the same design, not the same physical one. That the design would already need to exist and giving the Scratchpad its own copy would be “simple.” Your first post says it’s not worth the effort, though.
- This reply was modified 6 years, 7 months ago by NoDot. Reason: Additional disambiguation and correcting spelling mistakes
- Ivan Godard
  Keymaster
  December 22, 2017 at 4:19 pm
  Post count: 689
  #3129
  I’ll try again. The belt uses temporal addressing; the scratchpad uses spatial addressing. There are two addresses involved in a spill/fill: the belt address of the value to be spilled somewhere, and the “somewhere” address needed to choose what value to fill. The present Mill uses temporal for the first, and like any reference the spill must execute before its target drops off the belt. If scratch were part of the spiller then fill would need a (arbitrarily large) address to look into the spiller history to find the value.
  You can’t use temporal addressing for long- or indefinite-lived values because the temporal address range is unbounded. Hardware doesn’t do unbounded. With spatial addressing the address range is bounded by the configured size of the scratchpad. Hardware does that, although the tool chain must deal with running out of the bounds.
  Perhaps you are thinking of a scheme whereby the spill op would push the value into a side stack and the fill would use a stack offset rather than a temporal reference to address it. That’s possible, but the stack management hardware is more than is needed for a simple regfile-like array of values. And, returning to the first question, one would need either maximal sized entries, or a map from entry number to packed byte offset, or make the stack byte addressable.
  I’m not saying that one couldn’t put the scratchpad in the belt space so that scratch entries could sit in the same latches as belt operands. But the addressing logic to get such a scratch entry back into the space where adds and such could address it is too expensive because it would push up the size of the crossbar. So we keep the addresses space separate.
NoDot
Participant
December 22, 2017 at 4:50 pm
Post count: 6
#3130
I’m sorry that I sound far more ignorant than I am. Apparently I’m bad a communication. Sorry.
So this:
The belt uses temporal addressing; the scratchpad uses spatial addressing. There are two addresses involved in a spill/fill: the belt address of the value to be spilled somewhere, and the “somewhere” address needed to choose what value to fill. The present Mill uses temporal for the first, and like any reference the spill must execute before its target drops off the belt.
and this:
But the addressing logic to get such a scratch entry back into the space where adds and such could address it is too expensive because it would push up the size of the crossbar. So we keep the addresses space separate.
These are all things I know, have known, or figured were the case. I have never been confused on them, but it seems I sounded that way. Again, sorry.
Perhaps you are thinking of a scheme whereby the spill op would push the value into a side stack and the fill would use a stack offset rather than a temporal reference to address it.
While I have considered a stack or second belt for the Scratchpad (as I’m sure you have), no, I was thinking of a spatially-addressed, disjoint Scratchpad addressed by entry rather than one of those. I figured something similar was already present in the hardware for another purpose (holding old-but-still-present items on the belt).
Therefore either the team never thought of it or there were complications that prevented its use. And if the later, I wondered what those were. And I now have an answer to that.
The only question left is whether this holding place for “out of the output latches but not yet fallen off” belt entries exists or not. And if so, whether its mechanism would actually be similar to what a spatially-addressed, entry-based Scratchpad would require for mapping entries to its storage.
(I thought the Spiller would be a sensible place for this storage space-saving transfer time on function call spill/fill. Hence our entire Spiller digression.)
Ivan Godard
Keymaster
December 22, 2017 at 5:22 pm
Post count: 689
#3131
The spiller holds ongoing program state as of a stall, mispreduct, call or return event. That state includes much more than just the then-current belt. Belt values, in their latches, are moved lazily into the spiller and its internal SRAM and eventually migrate to the spillets in DRAM. However, these are relatively simple to handle. The most difficult part of the spiller deals with in-flights, which do not yet exist at event time, but will be produced eventually and must then be captured for subsequent replay. That requires temporal ordering information that is not an address, but may be thought of as a stream or pipe.
So there is a part of the spiller that does indeed hold full operands (possibly compressed at hardware option), but this is not addressable in the sense that DRAM or scratchpad is. Instead the operands (not necessarily contiguous) are organized for ordered replay. As the “address” changes continuously during replay and the operands will have random and varying other state intermixed, it does not seem practical to try to use spiller hardware for the functionality that is the present scratchpad.
Thomas D
Participant
January 27, 2019 at 11:18 pm
Post count: 24
#3418
I’ve been plotting to write a belt virtual machine and thinking about its consequences. I think that for virtual machines, the stack machine will always rule due to ease of programming (that is, it is an easy target because the compiler doesn’t care how deep a computation goes, it just keeps dropping things on the stack and the machine takes care of it).
The questions I have are probably proprietary, but here goes:
How did you decide on a 32 entry belt (for Gold, and 8 entry belt for Tin)?
Why a scratchpad instead of a second (slower) belt?
How was the size of the scratchpad decided on?
I’ve been tossing around two ideas of alternative to a scratchpad. One is to have four belts, with the opcode deciding which belt the result drops onto (and all four belts addressable for inputs). (That sounds horribly complicated for hardware, but easy for a VM). The second is adding a second (and maybe third) belt that results are moved onto, forming a hierarchy of gradually slower belts. As you’ve probably thought of these ideas, what gotchas am I not seeing?
- jessta
  Participant
  February 2, 2019 at 8:54 am
  Post count: 1
  #3419
  This isn’t an official answer, but I think at least some of your questions are answered in the Belt talk.
  * How did you decide on a 32 entry belt (for Gold, and 8 entry belt for Tin)?
  The scratchpad has a three cycle spill to fill latency, if you spill a value you won’t be able to get it back for 3 cycles because of this the length of the belt is set so that nearly everything lives for three
  cycles on the belt. So the length of the belt needs to be 3 times the number of results that can be produced by functional units in one instruction for that family member.
  
  * Why a scratchpad instead of a second (slower) belt?
  The belt is quite different from the scratchpad, the scratchpad only has two operations “spill a value from the belt and put it in the scratchpad”(spill) and “take a value from the scratchpad and put it on the belt”(fill).
  * How was the size of the scratchpad decided on?
  The videos mention that the scratchpad is on chip memory, that can be spilled out to caches and eventually out to DRAM if necessary. The size of a scratchpad available for a certain function is allocated by the specializer for that function. The function says how much scratchpad it needs upfront. The size of the available on chip memory is the same cost/speed trade off you make when buying DRAM, buy as much as you can afford so that typical program use won’t need to swap memory out to disk.
  I’m not sure I understand specifically what you mean by a ‘slower belt’. The belt described in the videos is per function call. Every function call gets it’s own empty belt and scratchpad, it’s caller’s belt(and scratchpad) is still around and the caller’s caller’s belt is also still around. You could say that the belts of callers were ‘slower belts’ than the belt of the current function call as they don’t change or move until the current function call completes and returns it’s result.
Thomas D
Participant
February 4, 2019 at 9:26 pm
Post count: 24
#3422
The scratchpad has a three cycle spill to fill latency, if you spill a value you won’t be able to get it back for 3 cycles because of this the length of the belt is set so that nearly everything lives for three
cycles on the belt. So the length of the belt needs to be 3 times the number of results that can be produced by functional units in one instruction for that family member.
That makes sense, but, I can’t imagine that a Tin can only retire three values a cycle, though. Then again, maybe I just suck at understanding real hardware.
The belt is quite different from the scratchpad
…
I’m not sure I understand specifically what you mean by a ‘slower belt’.
If you think of the belt abstraction: you’ve got this conveyor belt that values go onto and you pull some off, operate on them, and put the result on the belt. The newest results go on the front of the belt and the oldest results fall of the back of the belt. Now, imagine two of these belts. A spill operation moves a value onto the slower belt. It is the only reason the slower belt moves. The fill operation takes a value off the slow belt and puts it back onto the fast belt. The ALU (etc) operates off the fast belt. Values cycle on that belt quickly: it is fast. The slow belt only changes when we need to rename something as being slow.
The only thing I see with this is that people will find pathological algorithms which require an inane amount of working set to run.
The size of the available on chip memory is the same cost/speed trade off you make when buying DRAM.
Tin has only 128 bytes of scratchpad, and Gold has 512. Why so small? I realize that the scratchpad isn’t expected to be used frequently. Then again, maybe the Tin should have more Scratchpad to make up for its lack of Belt.
- Veedrac
  Participant
  February 5, 2019 at 12:15 pm
  Post count: 25
  #3423
  I can’t imagine that a Tin can only retire three values a cycle
  According to the Wiki, Tin peaks at five: two constant loads (flow slots 0/1), one operation (exu slot 0), one condition code (exu slot 1), and a pick.
- NDxTreme
  Participant
  April 6, 2019 at 7:47 pm
  Post count: 2
  #3451
  My understanding is there is no belt, when you get down to the hardware level. Any “belts” you add is a new structure to add to the Mill. The current belt is essentially the result slots from the different arithmetic units, which allows for the reduction of hardware characterized by the register laden designs found in tradition cpus.
- Ivan Godard
  Keymaster
  April 7, 2019 at 3:43 am
  Post count: 689
  #3452
  A “slow belt” is an interesting idea. The problem is congruence over control flow. When the execution diverges at a branch, the various subsequent paths may drop different numbers of operands, so when the paths rejoins later the older belt items may have been pushed further along on one path than another. The two (or more) possible belt layouts at the join point must be brought into congruence so that in subsequent references each operand has a unique belt offset for its temporal address. The “rescue” and branch operations make things congruent. On the fast belt, the lifetimes are brief enough that there are relatively few operands that need to be shuffled to establish congruence, so the instruction bit space for arguments in rescue and branches is relatively small.
  However, the scratchpad, or a slow belt, is for long-lived values by definition, and the lifetime of its working set would extend over much control flow. A slow belt would have to be kept congruent, just as the fast belt must be. However, the working set at the scratchpad lifetime level is much larger than the fast belt working set, so slow rescues or other slow reshuffling ops would need to name operands from a much larger name space, and would need many more bits to encode. A maximal fast rescue on a Silver (fast belt 16) is 16*4 = 64 bits, which is easily encodable. Meanwhile a slow rescue on a 256-position space would need 256*8 = 2048 bits, which we can’t encode. In addition the logical-to-physical mapping hardware for the slow namespace would need to be much bigger than that of the fast belt, likely even bigger than the logical-to-rename-register mapping hardware of a legacy OOO machine.
  By the way, a maximal rescue on a 32-belt member needs 160 bits, which needs four flow slots to encode; on a 64-belt a maximal needs 384 bits which is past anything we have configured or have thought to configure; and more than that is right out in the present encoding. It would be nice to be able to encode such ultra-wide constants, not only for wide rescue but also for things like bit-matrix multiply, but we don’t support that at present.
  In exchange for the expensive congruence, a slow belt would have a more compact “spill” operation than scratchpad spills do, because slow-belt spills would not need to express the destination address. However, because the slow belt is populated by long-lived values that are going to be referenced many times during their lifetime, a slow belt would have a larger fill/spill (read/write) ratio than the fast belt, which reduces the value of the compact slow spill.
  Our scratchpad alternative uses spatial addressing, which moves the logical-to-physical mapping to the compiler. As a result the scratchpad is always congruent and control flow can be ignored regardless of the scratchpad size. The spill op needs to carry an address, but scratch addresses are small (likely 8 to 16 bits in different configurations), and spill is as relatively rare in scratch as it would be in a slow belt.
  All in all, conceptually a “slow belt” is possible, but I doubt it would be practical. Definitely worth thinking about though; thank you.
  - NoDot
    Participant
    April 8, 2019 at 4:29 pm
    Post count: 6
    #3456
    The problem is congruence over control flow.
    I’m sad I didn’t think of this before. I’l take note.
    A maximal fast rescue on a Silver (fast belt 16) is 16*4 = 64 bits, which is easily encodable. Meanwhile a slow rescue on a 256-position space would need 256*8 = 2048 bits, which we can’t encode.
    256 bytes stores 32 64-bit values (ignoring metadata), but do you need 256 positions? I would (naively) expect the slow belt to have the same number of entries as the main belt.
    However, because the slow belt is populated by long-lived values that are going to be referenced many times during their lifetime, a slow belt would have a larger fill/spill (read/write) ratio than the fast belt
    I does? I’m afraid I don’t follow.
    - Ivan Godard
      Keymaster
      April 8, 2019 at 7:33 pm
      Post count: 689
      #3457
      256 bytes stores 32 64-bit values (ignoring metadata), but do you need 256 positions? I would (naively) expect the slow belt to have the same number of entries as the main belt.
      You need as many positions as the working set you design for has distinct operands, regardless of size. Working set is per algorithm, not per member, although market segmentation will let us downgrade a little in the smaller members. We haven’t paid much attention to tuning this, but seat-of-the-pants suggest that 64 would be good, maybe 127 better.
      However, because the slow belt is populated by long-lived values that are going to be referenced many times during their lifetime, a slow belt would have a larger fill/spill (read/write) ratio than the fast belt
      I does? I’m afraid I don’t follow.
      You need one spill per operand, and one or more fills per operand. You need a new fill for each reference that no longer has the previous fill reachable. Generally, longer lived values are referenced more times than shorter-lived values, and are more likely to need a new fill rather than re-using a previous fill. As the spill count is constant, and the slow belt can be expected to have more references than the fast belt (and hence more fills) the ratio is different, which drives the fill bandwidth provision among other things.
  - Old coder
    Participant
    April 13, 2019 at 10:39 am
    Post count: 4
    #3458
    First, I am sure I do not fully understand the architecture (but I like it as far as I do).
    Conceptually I see the belt at an array, while I know technically its really not.
    And I see scratchpad also conceptually as an array of some sort.
    When it comes to numbers falling off and later recovering them I can imagine them landing to some other space where these can be recovered from (or not). With memory as the final backstop, all fine so far.
    Now, I did read a comment here that recovering/addressing values would take so many bits to encode and I was wondering why that would need to be. Sure more positions and selective recovery/fetching would indeed cost bits, but that is not the most compact method of encoding I thought. With that background some ideas crossed my find and like to see your opinion on them.
    Idea one:
    Specify a “sector” + “some offset bit-mask” to recover/select multiple values that are near each-other (logically).
    Idea two (more latency):
    Specify one or several “big” sectors that have values that need recovering soon.
    Then use the prior idea to pick specific values within.
    Sectors:
    Sectors could be static relative to something else and instead of a small bit-mask, a single offset could be used instead for fetching single values. Both methods assume values clustered/produced together will likely be needed around the same time. If correct, this could simplify the encoding problem. A compiler could also group constants together based on use, supporting compact addressing schemes.
    It has some similarities with the old Segment + Offset kind of addressing x86 processors used to do back in the days and near pointers in C. If its for selecting values near to each-other or accessing some part of a conceptual array where part of the addressing is constant, it will be very compact.
    This reply was modified 5 years, 3 months ago by Old coder.
    This reply was modified 5 years, 3 months ago by Old coder. Reason: Better formulating my points and correcting typo's
    - Ivan Godard
      Keymaster
      April 13, 2019 at 10:09 pm
      Post count: 689
      #3461
      I’m not sure I’ve understood your suggestions; if my response here seems inapt to you then please post again.
      First, as to “sector”. This seems to be a notion where there are multiple distinct address spaces, with one of them being designated as “current” some how, so one can address within a sector using fewer bits than would be required to address across sectors. If that’s right, then in effect we already have such sectoring: “sectors” are created and destroyed automatically by the call/return mechanism, so each called function has its own private sector. This is a little different from your sectors though, because there is no inter-sector addressing; a function is stuck with its sector and cannot switch to another.
      Then, as to the “chain of waterfalls” idea, where each space “falls off the end” into the next space. This won’t really help, because eventually everything ever produced at the top will eventually fall over every waterfall – which means that the path to DRAM (the last) would have to have the same bandwidth as the rate of result production at the top, in the core; the whole machine would run at memory speed, which in core terms is sloooow. The chain is also pointless, because most operands are finished with before they fall off, so there’s no need to preserve them over the subsequent falls.
      So there needs to be some way to designate which operands are to be discarded and which passed on to the next level. The Mill belt by default discards everything, with the program explicitly saving any still alive. It could be done the other way, saving everything except what is explicitly discarded, but then one would still need to say where things are to be saved, and yes, an auto-save to a belt-like next level is possible. However, that level in turn would have to be kept congruent over control flow, which costs bits too. Which is the better trade-off depends on the likelihood of control flow over the life of an operand on the second level and the size of that level; looking at the code we are trying to execute, we decided that using spatial addressing on the second (scratchpad) level worked best.
      Then, you suggest a “save many” and “restore many” so as to reduce the save/restore entropy, analogous to the register bulk save/restore of legacy ISAs. A “many” operation would be easier on a microcoded machine (Mill doesn’t use microcode) but could certainly be implemented on a non-micro physical architecture. However, to be useful the lifetimes of the operands in the “many” would need to be similar, so they could be saved and restored as a group. While the start of lives in a group would be similar (everything needing saving that was produced since the last save), the end of lives turn out to be very different, almost random, obviating a “restore many”. This suggests that a “save many” and “restore one” design would be possible. It’s not clear to me whether the entropy saving of a “save many” operation would justify its hardware complication; needs thinking about.
      Then about sector+offset addressing. The value of this depends on whether multiple references are contiguous enough to pay the cost of switching sectors (and keeping the “current sector” congruent over control flow). Looking into that and doing actual measurement would be a good intern project, but by eyeball based on the code that we have running now I’m inclined to think that the reference patterns are too random to offer enough clustering value. It would also complicate the software, which is why sectoring has dropped out of use in legacy architectures.
      - Old coder
        Participant
        April 14, 2019 at 4:18 am
        Post count: 4
        #3462
        Thank you for your swift reply.
        You were very close to the mark on addressing what I meant to ask (except for the waterfall part).
        My primary points were handling multiple values at once and the sector/page/segment/block, …
        (give it a name) addressing to be able to be compact with respect to details.
        It assumes there will be address locality similarities to exploit, but this is just an assumption on my part. It mainly came up when there was talk about the size of the scratchpad. Bigger size means harder time to encode and any locality if you will could be exploited to lessen the negative side of increased size.
        As a slight variation on the multiple values in one go theme, would a “touch” like operation that specifies multiple belt positions (maybe just in the last N positions) that are needed soon again help in any way? Say the belt has 32 positions, and the compiler knows some of the last 8 values will be needed shortly again, explicitly copying them as fresh belt values might be more compact than explicitly saving/restoring them elsewhere. It would be compiler controlled removal of “junk” from the belt by duplicating good values as new. Conceptually a belt could be increased in size with only this operation being able to operate on the extended part and nowhere else. It might even allow for a smaller directly addressed belt in the process, saving bits in every operand encoding. I probably should read up on the existing instructions to understand some more, I am currently shooting a bit from the hip here.
        Also, never having worked with an architecture that exposed a scratchpad to me, I am wondering how its typically used. If it would be for constants that are needed a few times during execution of code, I imagine normal memory addressing and the cache system would work just fine. Is there a typical cutoff point where the scratchpad starts to benefit and at what sizes?
        This reply was modified 5 years, 3 months ago by Old coder. Reason: Typo's and clarification
        This reply was modified 5 years, 3 months ago by staff.
      - Ivan Godard
        Keymaster
        April 14, 2019 at 10:20 am
        Post count: 689
        #3475
        As a slight variation on the multiple values in one go theme, would a “touch” like operation that specifies multiple belt positions (maybe just in the last N positions) that are needed soon again help in any way?
        That’s what “rescue” does.
        Also, never having worked with an architecture that exposed a scratchpad to me, I am wondering how its typically used. If it would be for constants that are needed a few times during execution of code, I imagine normal memory addressing and the cache system would work just fine. Is there a typical cutoff point where the scratchpad starts to benefit and at what sizes?
        The compiler doesn’t load anything twice from memory/cache that it knows is unchanged; memory bandwidth is limited and cache is a power hog. Generally literal constants and some LEAs are simply redone if needed a second time, rather than being run off to scratch. We should probably revisit this for quad FP literals, assuming that a spill/fill is cheaper than a 16-byte literal, but we don’t have much real experience with quad yet. Some programs have memory arrays of constants that they load with explicit addressing; LLVM is a little erratic in how it treats those loads (to constexpr or not), but we just take what LLVM gives us.
        This reply was modified 5 years, 3 months ago by Ivan Godard.
      - Old coder
        Participant
        April 14, 2019 at 5:31 am
        Post count: 4
        #3464
        Thanks for your quick reply.
        I seems that after editing my previous response a few times to correct spelling and improve wording, it vanished. This is probably some automated action. With this response I let you know it exists and focus a bit more.
        After looking at the instruction set, I realized that what I suggested in my response is essentially what is currently the “rescue” operation :).
        It seems my line of though regarding addressing is apparently converging with what is already happening. Now I wonder why the “rescue” operation would not be made to operate on a larger belt than all other operations. Is having a larger belt very hardware intensive even if only rescue can operate on the oldest half (or even just +8 or +16 positions)?
        This reply was modified 5 years, 3 months ago by Old coder. Reason: Better formulating my points and correcting typo's
        This reply was modified 5 years, 3 months ago by Old coder. Reason: Better formulating my points and correcting typo's
      - Ivan Godard
        Keymaster
        April 14, 2019 at 8:43 am
        Post count: 689
        #3474
        Rescue does in fact have twice the range of other references. phases are an abstraction, and there are actually no intra-cycle data movements. Consequently, at the end of the cycle (when rescue takes place) there are the belt’s worth of operand that started the cycle plus however many were dropped during the cycle physically available, which is up to twice belt size. Congratulations; I don’t think anyone before has spotted that to ask about it.
      - Old coder
        Participant
        April 14, 2019 at 11:04 am
        Post count: 4
        #3476
        Blushes 🙂
        Thanks again for your clear answer.
        I am looking forward to more videos on the architecture, in there are many clever solutions to typical problems. It sometimes blows me away. Especially the code stream in two directions and associated split caches and the concept of NaR which is very powerful. But it doesn’t stop there…it all around pure innovation and that I like a lot! I hope to see it in action one day!
      - Ivan Godard
        Keymaster
        April 17, 2019 at 9:43 am
        Post count: 689
        #3479
        We expect that the next talk will be less about how the architecture works and more about how well it works – measured performance numbers, guesstimates for area and power, that kind of thing.
      - LarryP
        Participant
        May 6, 2019 at 4:07 am
        Post count: 78
        #3482
        @ivan,
        Your explanation of rescue’s need for a reach-back of twice the belt length clarifies a cryptic (to me, at least) comment you made (in the belt lecture, I think) about needing (operand/metadata) buffers for up to twice the belt length.
- Ivan Godard
  Keymaster
  April 7, 2019 at 3:57 am
  Post count: 689
  #3453
  Tin has only 128 bytes of scratchpad, and Gold has 512. Why so small? I realize that the scratchpad isn’t expected to be used frequently. Then again, maybe the Tin should have more Scratchpad to make up for its lack of Belt.
  These sizes are more a matter of marketing than technology. The scratch size is a performance limit, not an architectural limit; on any member a program whose working set runs out of scratch can use the (practically) unbounded DRAM address space via the extended spill/fill ops, they just cost more.
  Any given program on any given member will find some architectural limit to increased performance: FU count, cache sizes, retire bandwidth, etc. – and scratchpad size. So for any given configuration there will be customers who will tell us “if you would only increase the size of this feature then my program would run so much faster; pretty please with sugar!”. But as we scale any member’s configuration pretty soon we get to the performance of the next bigger member. So we’ll say “just buy the bigger member”, to which they will say “but that costs more money!”.
  Everybody wants Gold performance at a Tin price. We have to set the configurations to provide reasonable coverage of the price/performance spectrum. The currently defined sizes of scratch on the various members are just placeholders, and will certainly be tuned as the members go to market. But the tuning will be driven by market segmentation considerations, not technical ones. Welcome to the high-tech biz.
Thomas D
Participant
April 7, 2019 at 3:08 pm
Post count: 24
#3454
Thanks for the answer.
So for any given configuration there will be customers who will tell us “if you would only increase the size of this feature then my program would run so much faster; pretty please with sugar!”. But as we scale any member’s configuration pretty soon we get to the performance of the next bigger member. So we’ll say “just buy the bigger member”, to which they will say “but that costs more money!”.
The best response to this is: “Not as much money as licensing the lesser member and having TSMC fab it with your specific changes.” (Well, if you are open to such an arrangement. Are you looking to be an Intel or an AMD?)
- Ivan Godard
  Keymaster
  April 7, 2019 at 5:15 pm
  Post count: 689
  #3455
  More of an Intel vs ARM split; we don’t expect to ever do our own fabbing. Of Intel vs. ARM we plan to model Intel and sell chips, although we’d be happy to do custom chips if the price and NRE were right. Of course, no business plan survives contact with the enemy (apologies Moltke the Elder).
Author
Posts