p.s. after some more thought
The variable returns won’t work on the belt, and there are no shared registers or scratchpad. In fact, the only shared thing is memory. So what happens if the results (after the first) are communicated in memory?
If the shared-register approach works on a conventional, then sharing pseudo-registers at static addresses (w/r/t the program load image) should work too, albeit not as clean as something like a Forth double-ended stack. The cache line containing the pseudo-registers (PR hereafter) will always be hot and in the top-level cache. If ~14 (32-bit?) registers is enough, then the 16 32-bit values in a line should also be enough, although a 64-bit Lisp might want two lines. If the reserved addresses were placed at the bottom of the dp space then the address offsets would always encode in one byte for the accessing load/store operations. As the l/s opcode and other info are going to be around 16 bits, it means an access will be 24 or so bits of entropy in the encoding, favorable compared to 32-bit RISC codes and probably comparable to x86 (I am not an x86 expert – how many bytes in a load/store if the offset is one byte?) However, the l/s ops occupy flow slots, retire stations, and d$1 bandwidth. This is not free, compared to returning results on the belt.
The stores, inside the called function, are fire-and-forget. However, there is load latency to pick the values up from the PRs. The code doesn’t want to wait for a D$1 cycle before looking at the call result, although I suppose it could overlap the testing for the presence of the extra results, and probably some use of the prime result.
However, there’s another way: issue the load of the result before making the call, with the retire delay set so the load will retire in the instruction after the call. That load can be hoisted arbitrarily high, but if LISP overhead guarantees that any call will last longer than d$1 then hoisting is not necessary.
Unfortunately, this doesn’t buy anything when there actually is a new result to load. The load will allocate a retire station and go to cache for the value and then wait out the call. At the end of the call the store will be detected by the retire station, which will then re-issue the load request. But as the load is supposed to retire at once, and the second request to the d$1 is not instantaneous, the retire station must stall the machine until it gets the just-stored data, which is the same time that would have been if the load were issued after the call rather than before it 🙁
All in all, with the present Mill definition, it looks like extra results would pass in PRs in memory, and not be available until d$1 latency after the call. This works, but is unattractive.
I’ll keep the issue in mind and will report if illumination strikes.