There are essentially three layers in the spiller. At the top layer, data is stored in flipflops, effectively registers except not addressable by the program. These are buffers, used as parking places for values that are still live on the belt and are in the result register of a functional unit but the FU has another result coming out and needs its result register back. These live operands are first daisy-chained up the latency line within the containing FU pipeline, but eventually the pipeline runs out and the operand, if still live, gets moved to a spiller buffer. This cost of the move is the same as a register-to-register copy on a conventional machine. A running program that doesn’t do a function call or return will live entirely in these registers, with no spiller traffic below that.
The second layer is a block of SRAM, connected to those spiller registers. When you do a call operation, we save the belt, which means the belt state in pipe and spiller registers is saved. The save is lazy, but gradually live-but-inactive operands are copied from the needed registers into the spiller SRAM. The SRAM is big enough to hold several belts (and scratchpads and other non-memory frame-related state), so you can nest several deep in calls with everything fitting in the spiller SRAM. Most programs spend most of their time withing a frame working set of five or so, calling in a few, returning out a few, calling in a few, over and over. Such behavior fits entirely internal to the spiller.
However, if a program suddenly switches to a deep run of calls, a run of returns, or if there’s a task switch, the state of all those nested calls (or the new thread) will exceed the spiller’s SRAM and the spiller then uses the third level, which is the regular memory hierarchy. The spiller does not go direct to DRAM; it talks to the L2 cache instead, which provides still more buffering.
If you compare the spiller with the explicit state save used by a conventional legacy register machine, you will see that the spiller top level is akin to the register machine’s rename and architected registers; if a function fits in the registers then there’s no traffic, just as if it fits in the belt and scratchpad there’s no traffic.
If there are nested calls on a conventional then the register state gets saved to the hierarchy using normal store operations. These go to the D$1 cache and are buffered there. This is akin to the spiller’s SRAM, but the spiller has three advantages: it uses no program code to do the save, so the power, instruction entropy, and store buffer contention cost of the stores is avoided; it uses a private repository, so saves are not cluttering the D$1 (which is a very scarce resource); and it’s not program visible, removing many possibilities of program bugs and exploits.
Lastly, if you have deeply nested calls on a conventional, the saved state exceeds the capacity of the D$1 and will overflow into the D$2 and eventually DRAM. The spiller does the same when it runs out of private SRAM. Put all this together, and you see that the spiller is in effect a bunch of registers hooked to its own cache, and the overall benefit is to shift state save/restore out of the top level D$ cache and into the spiller SRAM which is in effect a private spiller cache, freeing up space for real data.
One last point: the total Mill state traffic is less than that on a conventional. A conventional callee-save protocol winds up saving registers that are in fact dead, but the callee doesn’t know that and saves them anyway. And the existence of the scratchpad on a Mill means that many function locals that would be kept in memory are in the scratchpad and so do not contribute to cache load and memory bandwidth. Combine these effects, and our sims suggest that the Mill save/restore and locals traffic is about a factor of two less than that of a conventional. This saves not only bandwidth but also power.
We do not have large-scale sims yet so the overall results are guesstimates, but it does appear that actual DRAM traffic on a Mill will be overwhelmingly composed of I/O and very large external data sets, which have the same traffic load as on a conventional; save/restore, locals, and the working set of globals will never see DRAM. That’s why we the Mill has the Virtual Zero that lets it use “memory” that has no backing DRAM.
All this needs pictures, and we will get to the spiller in the talks eventually.