Mill Computing, Inc. Forums The Mill Architecture Scratchpad design decision

  • Author
    Posts
  • NoDot
    Participant
    Post count: 5
    #3119 |

    Last I heard (and according to the wiki) the Scratchpad structure is addressed as an array of bytes. Bytes that also store metadata. The size is member-dependent, but IIRC, it’s intended to store about ten entries. Larger members therefore need more Scratchpad bytes for their larger vectors.

    Maxi, IIRC, was the extreme of this, needing about 2KB for its Scratchpad because of the large vector size and use. Yet still storing only about ten entries.

    I assume (ha!) it was discussed-but-discarded, so may I ask why the Scratchpad doesn’t simply use entries like the belt?

    It seems logical to me, considering the belt already is agnostic to the width of the entry. Jumping back to byte addressing does not follow. Especially given the Scratchpad preserves metadata. (Are entries power-of-two-plus-some? Or is it stored side-along in some way? Or does it vary? NYF?)

  • Ivan Godard
    Keymaster
    Post count: 518

    The number of entries is a member config decision; ten would be small.

    Storing by entry number would require a mapping from entry number to position, or (trivial mapping) with all entries being maximal size. We pack them (saving space/increasing capacity) and reference by start byte number. The byte number needs more bits to encode than an entry number would, but the scratch ops are otherwise small and currently we just burn the entropy. The belt uses full sized entries and doesn’t try to pack because the actual belt is a set of latches/regs that must be full width anyway.

    The choice for scratch implementation is left to the hardware guys, and might be different in different members due to hardware/cost/power considerations.

    • NoDot
      Participant
      Post count: 5

      Storing by entry number would require a mapping from entry number to position, or (trivial mapping) with all entries being maximal size.

      How odd. I would expect that, in practice, older belt items would leave the output latches for some space in the spiller – entries that haven’t fallen off yet and need to still be kept around, but have still been around for a few cycles; or things spilled back in after some nested function calls. I expected that an entry-based Scratchpad could simply be an extension of those locations.

      But this is likely just gut-feeling not matching reality.

      • Ivan Godard
        Keymaster
        Post count: 518

        We can’t leave scratchpad-usage data in the spiller because the data is both spatially and temporally random access, while the spiller is at heart just a glorified stack. Items can be left in the scratchpad for arbitrarily long times without increasing the latency of later access, whereas items in the spiller eventually migrate to memory and get memory latency.

        Instead we want the scratchpad to have uniform latency and simple random access, without the expensive mux crossbar needed for spiller access even to limited depth. So really scratch acts, and is mostly implemented like, register files in conventional memory. The differences include the inclusion of metadata, the self-defining data widths, and the packing at byte granularity.

  • NoDot
    Participant
    Post count: 5

    (We’re slipping away from the topic, but it’s been answered, so…)

    I think this is a communication issue. I assume there simply aren’t enough output latches in a single ALU to hold a belt’s worth of outputs. Therefore either those results disappear before they would drop off the belt, or there’s a place they get sent to to wait for the belt to advance them away.

    (This would likely require a pathological or artificial case to cause, but I think it deserves an answer.)

    I would assume that such a place would be part of the Spiller system-its top layer, perhaps. And that an entry-based Scratchpad would be a logical extension of such a structure.

    • Ivan Godard
      Keymaster
      Post count: 518

      It’s a lifetime issue. The scratchpad is not a simple extension of the belt, it’s a repository for values with long or indeterminate lifetimes. The spill op copies a value from the belt to scratch, and that same value may move into the spiller if there were a call while it’s still live. But a value computed before a loop and used after the loop (and maybe in the loop) has an unknown lifetime, so we need to save it for the duration of the loop. Mill execution makes values with great abandon, and we can’t save them all as if there were an infinite belt. So we need a way for the compiler to tell the hardware that a particular value is of continuing interest, and be able the request it again later. That’s the spill and fill ops.

      In contrast the spiller saves everything in-flight and on the belt, but that’s a tiny population compared to everything that has ever been on the belt, which it the potential population for the scratchpad. Different expected lifetimes, different reference patterns, different latency, complexity, and power constraints -> different mechanisms.

  • NoDot
    Participant
    Post count: 5

    You… aren’t answering the question I’m asking, but I think I see where you’re getting confused.

    I get that the Scratchpad is a different place than the belt. I do. It isn’t a extension of the belt, it’s a different physical location.

    I said “an entry-based Scratchpad would be a logical extension of such a structure”. Allow me to expand and unbox:

    I an only talking about storing and addressing an entry-based list of items. That the same mechanism that stores and addresses those still-live-but-older values would be the same or similar to the mechanism that stores or addresses the items in an entry-based Scratchpad.

    This entry-based Scratchpad would still be populated by items sent there specifically. It would just share part of its mechanism with where older-but-still-live belt items stay until falling off.

    And physically locating those places in the Spiller system is what I assumed. (In retrospect, the Scratchpad as a whole likely can’t be fit in the Spiller without pointlessly ballooning the size and scope of that subsystem.)

    edit: By “sharing the same mechanism” I meant they would share the same design, not the same physical one. That the design would already need to exist and giving the Scratchpad its own copy would be “simple.” Your first post says it’s not worth the effort, though.

    • This reply was modified 1 year, 2 months ago by  NoDot. Reason: Additional disambiguation and correcting spelling mistakes
    • Ivan Godard
      Keymaster
      Post count: 518

      I’ll try again. The belt uses temporal addressing; the scratchpad uses spatial addressing. There are two addresses involved in a spill/fill: the belt address of the value to be spilled somewhere, and the “somewhere” address needed to choose what value to fill. The present Mill uses temporal for the first, and like any reference the spill must execute before its target drops off the belt. If scratch were part of the spiller then fill would need a (arbitrarily large) address to look into the spiller history to find the value.

      You can’t use temporal addressing for long- or indefinite-lived values because the temporal address range is unbounded. Hardware doesn’t do unbounded. With spatial addressing the address range is bounded by the configured size of the scratchpad. Hardware does that, although the tool chain must deal with running out of the bounds.

      Perhaps you are thinking of a scheme whereby the spill op would push the value into a side stack and the fill would use a stack offset rather than a temporal reference to address it. That’s possible, but the stack management hardware is more than is needed for a simple regfile-like array of values. And, returning to the first question, one would need either maximal sized entries, or a map from entry number to packed byte offset, or make the stack byte addressable.

      I’m not saying that one couldn’t put the scratchpad in the belt space so that scratch entries could sit in the same latches as belt operands. But the addressing logic to get such a scratch entry back into the space where adds and such could address it is too expensive because it would push up the size of the crossbar. So we keep the addresses space separate.

  • NoDot
    Participant
    Post count: 5

    I’m sorry that I sound far more ignorant than I am. Apparently I’m bad a communication. Sorry.

    So this:

    The belt uses temporal addressing; the scratchpad uses spatial addressing. There are two addresses involved in a spill/fill: the belt address of the value to be spilled somewhere, and the “somewhere” address needed to choose what value to fill. The present Mill uses temporal for the first, and like any reference the spill must execute before its target drops off the belt.

    and this:

    But the addressing logic to get such a scratch entry back into the space where adds and such could address it is too expensive because it would push up the size of the crossbar. So we keep the addresses space separate.

    These are all things I know, have known, or figured were the case. I have never been confused on them, but it seems I sounded that way. Again, sorry.

    Perhaps you are thinking of a scheme whereby the spill op would push the value into a side stack and the fill would use a stack offset rather than a temporal reference to address it.

    While I have considered a stack or second belt for the Scratchpad (as I’m sure you have), no, I was thinking of a spatially-addressed, disjoint Scratchpad addressed by entry rather than one of those. I figured something similar was already present in the hardware for another purpose (holding old-but-still-present items on the belt).

    Therefore either the team never thought of it or there were complications that prevented its use. And if the later, I wondered what those were. And I now have an answer to that.

    The only question left is whether this holding place for “out of the output latches but not yet fallen off” belt entries exists or not. And if so, whether its mechanism would actually be similar to what a spatially-addressed, entry-based Scratchpad would require for mapping entries to its storage.

    (I thought the Spiller would be a sensible place for this storage space-saving transfer time on function call spill/fill. Hence our entire Spiller digression.)

  • Ivan Godard
    Keymaster
    Post count: 518

    The spiller holds ongoing program state as of a stall, mispreduct, call or return event. That state includes much more than just the then-current belt. Belt values, in their latches, are moved lazily into the spiller and its internal SRAM and eventually migrate to the spillets in DRAM. However, these are relatively simple to handle. The most difficult part of the spiller deals with in-flights, which do not yet exist at event time, but will be produced eventually and must then be captured for subsequent replay. That requires temporal ordering information that is not an address, but may be thought of as a stream or pipe.

    So there is a part of the spiller that does indeed hold full operands (possibly compressed at hardware option), but this is not addressable in the sense that DRAM or scratchpad is. Instead the operands (not necessarily contiguous) are organized for ordered replay. As the “address” changes continuously during replay and the operands will have random and varying other state intermixed, it does not seem practical to try to use spiller hardware for the functionality that is the present scratchpad.

  • Thomas D
    Participant
    Post count: 8

    I’ve been plotting to write a belt virtual machine and thinking about its consequences. I think that for virtual machines, the stack machine will always rule due to ease of programming (that is, it is an easy target because the compiler doesn’t care how deep a computation goes, it just keeps dropping things on the stack and the machine takes care of it).

    The questions I have are probably proprietary, but here goes:
    How did you decide on a 32 entry belt (for Gold, and 8 entry belt for Tin)?
    Why a scratchpad instead of a second (slower) belt?
    How was the size of the scratchpad decided on?

    I’ve been tossing around two ideas of alternative to a scratchpad. One is to have four belts, with the opcode deciding which belt the result drops onto (and all four belts addressable for inputs). (That sounds horribly complicated for hardware, but easy for a VM). The second is adding a second (and maybe third) belt that results are moved onto, forming a hierarchy of gradually slower belts. As you’ve probably thought of these ideas, what gotchas am I not seeing?

    • jessta
      Participant
      Post count: 1

      This isn’t an official answer, but I think at least some of your questions are answered in the Belt talk.

      * How did you decide on a 32 entry belt (for Gold, and 8 entry belt for Tin)?
      The scratchpad has a three cycle spill to fill latency, if you spill a value you won’t be able to get it back for 3 cycles because of this the length of the belt is set so that nearly everything lives for three
      cycles on the belt. So the length of the belt needs to be 3 times the number of results that can be produced by functional units in one instruction for that family member.

      * Why a scratchpad instead of a second (slower) belt?
      The belt is quite different from the scratchpad, the scratchpad only has two operations “spill a value from the belt and put it in the scratchpad”(spill) and “take a value from the scratchpad and put it on the belt”(fill).

      * How was the size of the scratchpad decided on?
      The videos mention that the scratchpad is on chip memory, that can be spilled out to caches and eventually out to DRAM if necessary. The size of a scratchpad available for a certain function is allocated by the specializer for that function. The function says how much scratchpad it needs upfront. The size of the available on chip memory is the same cost/speed trade off you make when buying DRAM, buy as much as you can afford so that typical program use won’t need to swap memory out to disk.

      I’m not sure I understand specifically what you mean by a ‘slower belt’. The belt described in the videos is per function call. Every function call gets it’s own empty belt and scratchpad, it’s caller’s belt(and scratchpad) is still around and the caller’s caller’s belt is also still around. You could say that the belts of callers were ‘slower belts’ than the belt of the current function call as they don’t change or move until the current function call completes and returns it’s result.

  • Thomas D
    Participant
    Post count: 8

    The scratchpad has a three cycle spill to fill latency, if you spill a value you won’t be able to get it back for 3 cycles because of this the length of the belt is set so that nearly everything lives for three
    cycles on the belt. So the length of the belt needs to be 3 times the number of results that can be produced by functional units in one instruction for that family member.

    That makes sense, but, I can’t imagine that a Tin can only retire three values a cycle, though. Then again, maybe I just suck at understanding real hardware.

    The belt is quite different from the scratchpad

    I’m not sure I understand specifically what you mean by a ‘slower belt’.

    If you think of the belt abstraction: you’ve got this conveyor belt that values go onto and you pull some off, operate on them, and put the result on the belt. The newest results go on the front of the belt and the oldest results fall of the back of the belt. Now, imagine two of these belts. A spill operation moves a value onto the slower belt. It is the only reason the slower belt moves. The fill operation takes a value off the slow belt and puts it back onto the fast belt. The ALU (etc) operates off the fast belt. Values cycle on that belt quickly: it is fast. The slow belt only changes when we need to rename something as being slow.

    The only thing I see with this is that people will find pathological algorithms which require an inane amount of working set to run.

    The size of the available on chip memory is the same cost/speed trade off you make when buying DRAM.

    Tin has only 128 bytes of scratchpad, and Gold has 512. Why so small? I realize that the scratchpad isn’t expected to be used frequently. Then again, maybe the Tin should have more Scratchpad to make up for its lack of Belt.

    • Veedrac
      Participant
      Post count: 19

      I can’t imagine that a Tin can only retire three values a cycle

      According to the Wiki, Tin peaks at five: two constant loads (flow slots 0/1), one operation (exu slot 0), one condition code (exu slot 1), and a pick.

You must be logged in to reply to this topic.