Mill Computing, Inc. Forums The Mill Architecture spiller work optimisation

Tagged: ,

  • Author
    Posts
  • mhkool
    Participant
    Post count: 7
    #2111 |

    Since function calls and loops happen often, the spiller has a lot of work moving belt values around.
    The work of the spiller can be reduced if the CPU discards belt positions that will not be reused, so that the spiller does not have to move/copy them.

    Considering that 80% of the belt values are used once, the instructions are very wide and that there are likely slots with noop instructions, is it worth the effort to introduce a forget <belt-position> instruction which can replace noops? So after a forget b3, the spiller does not look at b3 when a new call or inner is executed. For long belts and many unused belt positions, an alternative might be sweep <belt-position> which forgets all belt positions from <belt-position> to the end of the belt.
    An alternative to the forget instruction may be that each instruction gets an operand flag saying “this is the last use of the operand” doing an implicit “forget”. So addusfrgt1 b2,b4 does the usual addus and also a forget b4.

    One can go one step further and remove belt positions with a remove <beltposition> instruction. So a remove b3 literally removes a belt position and the last item on the belt becomes a None (only in the unlikely event that there are no other ops that put values on the belt). The remove instruction has the additional effect that some values can stay on the belt longer before they fall off the belt and may reduce the use of the scratchpad. I know, maybe I am asking too much and the “remove” instruction may be too diffcult to implement.

    • This topic was modified 8 years, 10 months ago by  mhkool.
  • Ivan Godard
    Keymaster
    Post count: 689

    Recall that there is no physical belt as such; the belt is a semantic part of the programmer’s model, and family members are free to implement anything that provides that model. Also recall that the physical realization of the spiller has essentially three different parts: one that is tightly integrated with the functional units and deals with stalls; an intermediate layer that is essentially an SRAM buffer pool; and a back end that connects the buffers to the memory hierarchy and eventually DRAM for when spiller traffic exceeds the buffer capacity in one direction or the other. Each of these three parts has its own job and organization.

    The back end is simplest: it collects pool entries into whole cache lines and injects them into the hierarchy, or vice versa. It has no knowledge of what the entries mean, only that entries must be moved to keep the pool roughly half full.

    The front end, connected with the FUs, has the hardest job because timing and power are at their most critical. The problem is stalls. A modern CPU cannot stop on a dime; it can take several cycles to get a stall signal from one side of the core to the other – across just the core, mind, not across the whole chip; that takes much longer still. Consequently, when we issue an operation to an FU there will be a result coming out X cycles later, whether the core is stalled or not. An maybe a couple more after that, that were in flight inside the FU. The Mill uses result replay after a stall, not the reissue replay that is commonly used, so those results have to be caught and made available when the core starts up again. The spiller front end does that catch and replay. Each of those results winds up as an operand entry, initially in the FU output latch but soon (if it stays live) moved to one of the entries in the spiller in front of the buffer pool that the replay mechanism has access to.

    But moving out of a latch happens only if the operand is still live. There are several different ways to handle the mapping from logical belt numbers to physical operand numbers/locations; each has its own advantages. However, all have some mechanism by which an operand can tell if it is still live. Note that there may be operands in a latch or buffer that are not yet “on the belt”, having been produced by an FU that is working through a stall. Liveness is updated every cycle, and loss of liveness is what “falling off the belt” means in the hardware.’

    A call leaves the caller’s live operands still live, and the execution of the callee creates new live operands, and promptly the hardware runs out of latches/skid buffers for them all, so operands belonging to frames other than the issue frame are moved into the buffer pool in the middle end of the spiller. Notice that this is not just a matter of taking all the operands that exist at the point of call and moving theme; it is possible for an FU to produce a result operand that does not belong to the current issue frame. For example, if I issue a long-running operation like a FP multiply and then immediately do a call, the program sees the multiply result as coming out after the call returns, but in reality it comes out as soon as the multiplier is done, in the middle of the callee even though the operand belongs to the caller. Likewise, if you start a multiply and then do a return op, there is no way to stop that multiply and you will get a result, which belongs to a callee that no longer exists and is dead on creation.

    So much for background. You ask whether it could be of benefit to know that an operand, still on the belt, will not be referenced in the future; this could be indicated in the machine code in several ways, including those you suggest. However, all the ways refer to the operand by its logical belt number. So the first task would be to find the named operand and tell it that it is no longer live. The hardware that maps logical belt numbers to physical positions exists, and there’s a way to indicate that an arbitrary existing location is no longer live (used by return and branches to renumber the belt). So the hardware to do what you suggest exists. Would another way to use it pay?

    The benefit boils down to the cost of having a redundant operand around, and how long before the operand would die anyway. The cost is pressure on the front end storage capacity and the buffer pool; by the time we get to the back end traffic has smoothed out enough that a few extra pool entries are largely irrelevant. The lifetime ends when there are enough later drops to push the redundant operand off the belt, or we transfer through a branch which discards the entry early. In general a redundant operand will only make it to the middle end pool if there is a call (or interrupt), so there is a saving in pool pressure iff the operand goes dead before a call without an intervening branch. That happens of course, but its frequency is heavily dependent on how call-heavy the code is.

    Pressure in the front end is potentially a larger issue, but how big an issue will depend on the details of the hardware implementation, which will be member dependent across the line and will no doubt evolve with time. The hardware is not up enough to get a good sense of whether front-end pressure will a constraining factor for some, or any, Mills. The software sim is no use for this; we’d need Verilog sim of the hardware to know, and that’s not there yet.

    What would the cost be? Primarily entropy in the encoding. If we take the common two-argument operations and add a “last use” bit, that would add two bits to each operation’s encoding. If we take the marginal cost of an exu op as 15 bits then the feature would increase code size by 13%, with impact in the caches and memory traffic. On the flow side, adding a last-use bit to things like function argument lists would cost one bit per morsel; with a typical 4-5 bit morsel size then the cost is over 20%.

    The alternative would be to use an op that specifies a live or dead list. There is such an op, called rescue, already; it is intended to keep live operands from having to be spilled because they are about to be pushed off the belt, but could be used for a kill operation too. It’s a flow op, and I’d guess might have a typical entropy cost around 30 bits or so. In its intended use it saves at least one spill/fill pair, which together will have roughly the same entropy, while avoiding the scratchpad pressure and latency that a spill involves.

    However, if there is nothing to be rescued and rescue is just used as a operand killer then the cost is those 30 bits and the flow slot that the op would occupy, against the lowered spiller pressure. This is certainly not an obvious win, so we have deferred any decision until we can see how big an issue spiller pressure turns out to be in hardware practice.

    I suspect this is a bit more of an answer than you expected 🙂

    • mhkool
      Participant
      Post count: 7

      I understand that the Mill has a notion of a value in altch being live, i.e. “on the belt”. I like to call this “hardware live”. There is also a “software live” notion: the compiler knows for example that after a sequence of instructions, belt positions b10, b11, b14 and b15 are dead. I understand that if the next instruction contains a call or inner, the spiller will save these four and a number of other latch values in its buffers and that the assumption is that this will not put too much pressure on the spiller’s performance (thanks for the extra explanation about the skid buffers).

      There is not much information in the talks about what exactly happens on a retn. I assume that the spiller puts all hardware live values of the “frame that will become current” back in its buffers. But since the spillers buffer on a Gold CPU has only room for 16 values, it seems that having 2 nested loops could easily make the CPU stall because the spiller is getting values back from L2$. Question is also how many belt positions are hardware live and hence saved/restored. Based on the statistic that belt values are used only once in 80% of the cases, I speculate that the number of hardware live values can be on average two times the number of software live values, which implies that the number of values to restore could be reduced significantly.

      Simply stalling on a retn due to putting many values back from L2$ into the spiller buffers seems too naive, so I assume that a piece of the spiller puzzle is missing which explains that performance is okay here. Ivan, can you explain?

      • This reply was modified 8 years, 7 months ago by  mhkool. Reason: fix grammar
      • Ivan Godard
        Keymaster
        Post count: 689

        The expectation is that for “reasonable” nesting depths the spiller will keep spilled operands in its internal SRAM space, so spills will never go into the memory hierarchy. The spiller does use high-water marks to anticipate the need to restore values from the hierarchy, but an abrupt sequence of a lot of function returns can overwhelm that prefetching and produce a stall, just as an abrupt sequence of calls can overwhelm the spiller’s bandwidth flushing operands to the hierarchy.

        The software does have the ability to discard values from the belt; every branch transfer can do so. I agree that it is tempting to add that capability to the call operation, but we have not done so because of problems with encoding such an operation. The call can specify a belt-full of arguments, and having it also specify a potential belt-full of preserved operands would overwhelm the space available in the encoding.

        There are alternative encodings however, but these would significantly complicate the decoders. So we are leaving things as they are, until we have enough gate-level sim of large programs to see if there really is a problem with spiller bandwidth or to size of spiller SRAM.

        • PeterH
          Participant
          Post count: 41

          Sounds like a fast deep recursion, like the classic Fibonacci algorithm, would stress the Mill. Then again, I’m not sure classic superscalar OOO processors would do better.

          • Ivan Godard
            Keymaster
            Post count: 689

            Things like Fibonacci are simple enough that the compilers recognize tail recursion and turn them into loops on any machine – try it and see. What causes trouble is two-way recursion, such as depth-first search in binary trees; tail conversion gets rid of one of the calls, but the second remains. Research compilers replace the recursion with a parallel state stack, but I don’t know of any commercial compilers that do, and it’s not clear that the state stack is any better than the regular stack.

            Whether the Mill is better or worse than other architectures depends on the amount of state saved in a recursive call and what the save bandwidth is. On a genReg machine the state will be at least the PC and frame registers plus any live data state. The same is true on a Mill, plus any additional belt positions that are zombie; the Mill situation is equivalent to a callee-saves protocol in a genreg machine.

            The number of zombies depends on the amount of work between the calls. The Mill call instruction marks all belt positions not occupied by arguments as invalid, so they won’t zombie in an immediately following call. The same occurs with branches. As other operations execute the invalids get pushed along the belt and eventually fall off, so the belt becomes populated with operands that might be zombie. Hence zombie spill increases with distance from the last control-flow transfer, either branch or the entry of the called function.

            While it is easy to invent cases in which the whole belt is zombie, in practice recursive calls usually seem to be quite close to other control flow and the actual zombie count is ignorable.

          • mhkool
            Participant
            Post count: 7

            I did not have a fast deep recursion in mind. I raised the question about the 16 buffers that the Mill Silver and Gold have. Which seems a small number that may not be sufficient to cope with a small number of loops (using inner) and calls.

            Ivan told that the spiller has SRAM that is used between the buffers and the L2$ so the Mill works fine if the SRAM is sized properly and has sufficient bandwidth. There will be less pressure on the buffers and SRAM of the spiller if (software) dead operands are declared dead for the hardware (for example using the rescue or suggested forget operation), but it is not yet known if this is necessary. I think it is wise to look at gate-level sim runs to decide if something must be done or not.

            If it would pay off to implement a forget operation and if the encoding bits may be an issue, one may use a forgeth1 for the first half of the belt positions and a forgeth2 for the second half of the belt positions to reduce the number of encoding bits. A forget operation does not have to be executed in the instruction immediately before a call or inner, it can be executed earlier: when an operand becomes dead.

          • Ivan Godard
            Keymaster
            Post count: 689

            Yet another possibility is a call variant that implicitly discards all the belt. If there were any belt operands actually live over the call then they would have to be explicitly spilled. While loops generally have loop-carried live operands, it’s fairly common in open code for a call to consume everything live. Worth doing? No idea; measurement needed.

  • mhkool
    Participant
    Post count: 7

    Thanks for the reply, yes longer than I expected but useful – maybe I had to say that I saw all the videos to make it a little shorter.
    Yes, the rescue instruction is a good killer.

    I have the impression that the spiller with 16 buffers on a Gold with 32 belt positions suffers from a lot of pressure, because also the inner instruction produces “a new belt” and causes work for the spiller. Considering that 80% of belt values are used immediately and only once, it seems that there are a lot of belt positions with values that can be killed to prevent work for the spiller. Oh well, you are right and we have to wait for the hardware to see if the pressure on the spiller is an issue or not. My patience is just a fraction larger than yours 🙂

    • PeterH
      Participant
      Post count: 41

      When a subroutine call is made it shouldn’t be necessary for the spiller to save the state of the entirety of the caller’s belt at once. Give that functional units are tagged with both frame and belt position, I don’t expect the belt would need to be spilled any faster than new values are being produced and old values would drop off the end of the belt had there not been a subroutine call. If the subroutine is short, or uses a different subset of available functional units, it’s possible much of the caller’s belt would not need spilling.

      • Ivan Godard
        Keymaster
        Post count: 689

        Exactly. However, while we describe the belt and the spiller as two different physical units for clarity, in reality they are just two different ways of looking at the same collection of hardware. The rarely-mentioned guts is actually the skid buffers and their replay mechanism. It’s not possible to stall a function unit in mid-pipe, so either you throw away what it does and re-issue, or you catch what it does and replay the result, in effect re-retire.

        Replay is hard if an operation can fault or if there are ordering hazards. In a conventional this is handled by OOO renaming, at some cost, but without OOO the hazards pretty much dictate use of re-issue, and even OOO uses re-issue, I suppose as much out of tradition as anything. The Mill’s NaRs and SSA belt eliminates the fault and hazard issue and we have static ordering, so we can use replay. Replay is much cheaper than re-issue, especially with long pipes. However, replay means we have to catch results produced during stall, and for that we need skid buffers, and the ability to feed buffer contents to function units as we come back up from stalls.

        Given that machinery, it’s only a small step to providing the paths by which the skid buffer contents can be moved to SRAM, in addition to the paths to the FUs. That’s the top of the spiller. The bottom of the spiller, which moves content from SRAM to the memory hierarchy, is largely independent.

        Skid buffer replay must work at core-clock and full machine width, which is a huge amount of data with very tight timing. Consequently it doesn’t have time to decide if something is really needed as it comes out the back of the FU; a buffer has to catch it whether it is there or not. This leads to holes in the buffers of course. Those holes get squeezed out later in the bucket-brigade that leads to eventually to DRAM. Where “later” is is up to the implementation.

You must be logged in to reply to this topic.