As pointed out by lpsmith in #1552 above, the Mill does simplify tail-call considerably. At the same time it introduces other issues that other machines don’t have. Here’s a rough-out of what comes to mind; we won’t actually do any of this until we have a complete tool chain to practice on.
1) Mill makes no difference between calls within and calls across compilation unit boundaries.
2) Mill does distinguish between calls within turf and portal calls across turfs. The latter must be secure in both directions, and argument passing of memory-resident arguments is complex. It will be impossible to support portal tail calls because the spiller needs a spill frame to hold the info to return to the calling context. It might be possible to skip an intermediate spill frame if there were immediately nested portal calls, but this case is too specialized to be worth doing anything about. However, the calling code will not have to know whether it is making a portal call or not; the only consequence will be that the calling frame will not be cut back until the portal call returns.
3) Operands on the local belt are identified by a nesting number, similar to a frame number except Mill functions do not necessarily have a frame. A call creates a new nesting number, which is how the belt is emptied in one cycle but is still there after the return. A tail call cannot reuse the nesting number of its caller because that would give it the caller’s belt rather than an empty belt, so a tail cal must also get a new nesting number. However, if tail calls are statically identified by an op distinct from normal calls, the hardware can save the caller’s nest number and return to that rather than to one less than the callee’s number. The nest number is small and wraps around, so there would have to be some arm-waving in the spiller to do the cut-back, but there seems nothing insurmountable.
4) The scratchpad and stack frame (if present) can be cut back as part of the special op. The called function would have to do a normal scratch/frame allocation if it needed to, but it would always be doing that anyway.
5) I don’t see any way to support in-memory arguments (VARARGS or large objects passed by value). In a true tail call these would be on top of the caller’s frame, and hence likely on top of the actual argument that is trying to be passed. A compiler could in principal move everything out of the way, cut the frame, and copy back the actual arguments, but I don’t see any actual compiler ever going to the trouble. Nor am I aware of any tail-call implementation that handles this case; please reply with citation if you know of one.
5) All this addresses true calls. Compilers can turn tail calls into loops in the generated code as well on the Mill as any machine.
6) All the implementation issues appear to be in the spiller in the Mill; it has to know that there is a missing spill frame. In turn, that means that it has to be told, and on a Mill that means a distinct operation. Because the internals of the spiller are per-member and not visible to the application, the implementation of tail calls is likely also to be per-member. In particular the spiller in a multicore Mill is quite different from that of a monocore, to avoid race conditions and coherence overhead.