I did not have a fast deep recursion in mind. I raised the question about the 16 buffers that the Mill Silver and Gold have. Which seems a small number that may not be sufficient to cope with a small number of loops (using inner) and calls.
Ivan told that the spiller has SRAM that is used between the buffers and the L2$ so the Mill works fine if the SRAM is sized properly and has sufficient bandwidth. There will be less pressure on the buffers and SRAM of the spiller if (software) dead operands are declared dead for the hardware (for example using the rescue or suggested forget operation), but it is not yet known if this is necessary. I think it is wise to look at gate-level sim runs to decide if something must be done or not.
If it would pay off to implement a forget operation and if the encoding bits may be an issue, one may use a forgeth1 for the first half of the belt positions and a forgeth2 for the second half of the belt positions to reduce the number of encoding bits. A forget operation does not have to be executed in the instruction immediately before a call or inner, it can be executed earlier: when an operand becomes dead.