The spiller is lazy and, being a hunk of hardware, is buffer-granularity and not frame (or other language construct) granularity. For a horrible example of the difference, consider a hypothetical Mill with a 14-cycle hardware divide operation, whose first instruction contains:
That is, it starts a divide and recursively calls itself conditionally.
By the time the very first divide result comes out of the function unit, there are already thirteen other divides in flight in the FU, and we are 14 deep in nested function calls. The spiller is going to spill that result (the FU output latch will be getting a new value next cycle), but the frame (and belt) that the value belongs to is far away, and the retires for that frame may be temporally interleaved with the retires of many other frames. Remember: a call on the Mill looks (on the belt) like any other operation, like an add for example, and has (or rather appears to have) zero latency.
Consequently, as in most things, spill/fill on the Mill doesn’t work like a conventional machine. We don’t spill frames, we spill values, and only when we need the space for something else. And this is true between the belt latches and the spiller buffers, between the buffers and the SRAM, and between the SRAM and the memory hierarchy.
Also, “frame” is an overloaded term, meaning on the one hand the stack-allocated locals of a function activation, and on the other hand the activation itself; this can be confusing. The spiller has nothing to do with the former sense; the program-declared local data is explicitly written to memory by the program. just as in a conventional. The spiller is concerned with internal and transient state. On a conventional this too is explicitly written to memory by compiler-inserted preamble/postamble code, or equivalent asm code for those writing in assembler. Not so on a Mill; the internal state is in general not explicitly accessible by the program, and save/restore is done by the spiller.
Consequently, spiller performance is dissociated from programming language constructs, and is constrained only by bandwidth. Certain program actions, if sustained for long enough, can generate internal and transient values at a rate large enough to overwhelm the spiller bandwidth capacity; you will stall. Granted, you will have to work to make it happen, but you can do it.
However, the stalls induced by the spiller will still leave your program running faster than on a conventional machine (which of course has to save the same information too) using explicit code to save and restore the general registers with every one of those recursive calls.