Short answer: it’s all done in the naming.
Longer answer: the decoders map logical belt numbers in the code to physical belt numbers in the execution engine. There are twice as many physical numbers as the logical length of the belt, to accommodate drops that occur during phasing.
In your example of G cascading to F, say that G has a single result. Consequently the pre-existing belt value that would be b3 as an argument to G will be b4 as an argument to F (you see why hand-coding a Mill is not recommended 🙂 ). The decode hardware knows that G will drop one result (it’s in the encoding of the call op), so it adds one to the running logical-to-physical bias that it uses for the call of G to get the bias for the mapping for F. For a 16-long belt that’s only a 5-bit add, so speed is not an issue.
When a call returns, the physical belt is the same as it was before the call courtesy the spiller; a previously unoccupied physical position number is now filled with the result, but no existing values change physical numbers. The arguments of the second call are then picked up by their physical numbers (already available due to the mapping in the decoder), and copies made in free buffers. The copies are assigned physical numbers with the new frame number. Bump the frame number, and you’re good to go.
That’s what happens without cascading. It should be clear that it’s not actually necessary to reify the whole belt. All that is really needed is to assign the right physical number to the result of the first call, and to create copies of second-call arguments in the spiller with suitable numbers.
The spiller doesn’t really have time to do that if there’s no cycle between return and next call and it doesn’t get the I-need-this-list for the arguments until after the return. But the decoder already has the I-need-this list for the second call even before the instruction is issued, so it just gives the list to the spiller to save along with the rest at the first call. The spiller can then just do the copies as soon as it knows the return is happening.
There is actually a problem even with the advance notice if the return is conditional, because predicates are not evaluated until writer phase in the next cycle, which with cascading is actually the first cycle of the second-called function. We can go with the prediction and unwind if it is wrong, or simply not cascade conditional returns; its an implementation choice in the hardware, and members will likely differ.
We also have a way to do call arguments without copying, but it’s NYF.
Now for the rest of your question:
1) The return in a cascade goes direct to the next called function, delta the conditional-return case above.
2) Called functions start with no data stack frame; they must explicitly allocate one with stackf if they need it. Return does automatic cutback though. The called belt is created as described above – a call with no arguments need only bump the frame number and presto – new empty belt.
The return op itself has no understanding of what it is returning to. It does have an argument list, encoded just like the list of a call, that gives the belt positions of the result(s). Necessarily these will become b0… in the caller (returns drop at the front as usual), so all it needs is the mapping bias and the frame number of the returned-to frame. Change the physical belt number and frame of those values and the result handling is done. The bias and frame have been saved in the spiller. Hardware implementations will vary, but I expect that the result handling is done by the part of the spiller that does cascading, and the return FU deals only with getting the address and predicate right and dealing with First Winner Rule.
- This reply was modified 8 years, 1 month ago by Ivan Godard.