Mill Computing, Inc

Keymaster

February 24, 2016 at 11:37 am

Post count: 689

Recall that there is no physical belt as such; the belt is a semantic part of the programmer’s model, and family members are free to implement anything that provides that model. Also recall that the physical realization of the spiller has essentially three different parts: one that is tightly integrated with the functional units and deals with stalls; an intermediate layer that is essentially an SRAM buffer pool; and a back end that connects the buffers to the memory hierarchy and eventually DRAM for when spiller traffic exceeds the buffer capacity in one direction or the other. Each of these three parts has its own job and organization.

The back end is simplest: it collects pool entries into whole cache lines and injects them into the hierarchy, or vice versa. It has no knowledge of what the entries mean, only that entries must be moved to keep the pool roughly half full.

The front end, connected with the FUs, has the hardest job because timing and power are at their most critical. The problem is stalls. A modern CPU cannot stop on a dime; it can take several cycles to get a stall signal from one side of the core to the other – across just the core, mind, not across the whole chip; that takes much longer still. Consequently, when we issue an operation to an FU there will be a result coming out X cycles later, whether the core is stalled or not. An maybe a couple more after that, that were in flight inside the FU. The Mill uses result replay after a stall, not the reissue replay that is commonly used, so those results have to be caught and made available when the core starts up again. The spiller front end does that catch and replay. Each of those results winds up as an operand entry, initially in the FU output latch but soon (if it stays live) moved to one of the entries in the spiller in front of the buffer pool that the replay mechanism has access to.

But moving out of a latch happens only if the operand is still live. There are several different ways to handle the mapping from logical belt numbers to physical operand numbers/locations; each has its own advantages. However, all have some mechanism by which an operand can tell if it is still live. Note that there may be operands in a latch or buffer that are not yet “on the belt”, having been produced by an FU that is working through a stall. Liveness is updated every cycle, and loss of liveness is what “falling off the belt” means in the hardware.’

A call leaves the caller’s live operands still live, and the execution of the callee creates new live operands, and promptly the hardware runs out of latches/skid buffers for them all, so operands belonging to frames other than the issue frame are moved into the buffer pool in the middle end of the spiller. Notice that this is not just a matter of taking all the operands that exist at the point of call and moving theme; it is possible for an FU to produce a result operand that does not belong to the current issue frame. For example, if I issue a long-running operation like a FP multiply and then immediately do a call, the program sees the multiply result as coming out after the call returns, but in reality it comes out as soon as the multiplier is done, in the middle of the callee even though the operand belongs to the caller. Likewise, if you start a multiply and then do a return op, there is no way to stop that multiply and you will get a result, which belongs to a callee that no longer exists and is dead on creation.

So much for background. You ask whether it could be of benefit to know that an operand, still on the belt, will not be referenced in the future; this could be indicated in the machine code in several ways, including those you suggest. However, all the ways refer to the operand by its logical belt number. So the first task would be to find the named operand and tell it that it is no longer live. The hardware that maps logical belt numbers to physical positions exists, and there’s a way to indicate that an arbitrary existing location is no longer live (used by return and branches to renumber the belt). So the hardware to do what you suggest exists. Would another way to use it pay?

The benefit boils down to the cost of having a redundant operand around, and how long before the operand would die anyway. The cost is pressure on the front end storage capacity and the buffer pool; by the time we get to the back end traffic has smoothed out enough that a few extra pool entries are largely irrelevant. The lifetime ends when there are enough later drops to push the redundant operand off the belt, or we transfer through a branch which discards the entry early. In general a redundant operand will only make it to the middle end pool if there is a call (or interrupt), so there is a saving in pool pressure iff the operand goes dead before a call without an intervening branch. That happens of course, but its frequency is heavily dependent on how call-heavy the code is.

Pressure in the front end is potentially a larger issue, but how big an issue will depend on the details of the hardware implementation, which will be member dependent across the line and will no doubt evolve with time. The hardware is not up enough to get a good sense of whether front-end pressure will a constraining factor for some, or any, Mills. The software sim is no use for this; we’d need Verilog sim of the hardware to know, and that’s not there yet.

What would the cost be? Primarily entropy in the encoding. If we take the common two-argument operations and add a “last use” bit, that would add two bits to each operation’s encoding. If we take the marginal cost of an exu op as 15 bits then the feature would increase code size by 13%, with impact in the caches and memory traffic. On the flow side, adding a last-use bit to things like function argument lists would cost one bit per morsel; with a typical 4-5 bit morsel size then the cost is over 20%.

The alternative would be to use an op that specifies a live or dead list. There is such an op, called rescue, already; it is intended to keep live operands from having to be spilled because they are about to be pushed off the belt, but could be used for a kill operation too. It’s a flow op, and I’d guess might have a typical entropy cost around 30 bits or so. In its intended use it saves at least one spill/fill pair, which together will have roughly the same entropy, while avoiding the scratchpad pressure and latency that a spill involves.

However, if there is nothing to be rescued and rescue is just used as a operand killer then the cost is those 30 bits and the flow slot that the op would occupy, against the lowered spiller pressure. This is certainly not an obvious win, so we have deferred any decision until we can see how big an issue spiller pressure turns out to be in hardware practice.

I suspect this is a bit more of an answer than you expected 🙂

Reply To: spiller work optimisation