A “slow belt” is an interesting idea. The problem is congruence over control flow. When the execution diverges at a branch, the various subsequent paths may drop different numbers of operands, so when the paths rejoins later the older belt items may have been pushed further along on one path than another. The two (or more) possible belt layouts at the join point must be brought into congruence so that in subsequent references each operand has a unique belt offset for its temporal address. The “rescue” and branch operations make things congruent. On the fast belt, the lifetimes are brief enough that there are relatively few operands that need to be shuffled to establish congruence, so the instruction bit space for arguments in rescue and branches is relatively small.
However, the scratchpad, or a slow belt, is for long-lived values by definition, and the lifetime of its working set would extend over much control flow. A slow belt would have to be kept congruent, just as the fast belt must be. However, the working set at the scratchpad lifetime level is much larger than the fast belt working set, so slow rescues or other slow reshuffling ops would need to name operands from a much larger name space, and would need many more bits to encode. A maximal fast rescue on a Silver (fast belt 16) is 16*4 = 64 bits, which is easily encodable. Meanwhile a slow rescue on a 256-position space would need 256*8 = 2048 bits, which we can’t encode. In addition the logical-to-physical mapping hardware for the slow namespace would need to be much bigger than that of the fast belt, likely even bigger than the logical-to-rename-register mapping hardware of a legacy OOO machine.
By the way, a maximal rescue on a 32-belt member needs 160 bits, which needs four flow slots to encode; on a 64-belt a maximal needs 384 bits which is past anything we have configured or have thought to configure; and more than that is right out in the present encoding. It would be nice to be able to encode such ultra-wide constants, not only for wide rescue but also for things like bit-matrix multiply, but we don’t support that at present.
In exchange for the expensive congruence, a slow belt would have a more compact “spill” operation than scratchpad spills do, because slow-belt spills would not need to express the destination address. However, because the slow belt is populated by long-lived values that are going to be referenced many times during their lifetime, a slow belt would have a larger fill/spill (read/write) ratio than the fast belt, which reduces the value of the compact slow spill.
Our scratchpad alternative uses spatial addressing, which moves the logical-to-physical mapping to the compiler. As a result the scratchpad is always congruent and control flow can be ignored regardless of the scratchpad size. The spill op needs to carry an address, but scratch addresses are small (likely 8 to 16 bits in different configurations), and spill is as relatively rare in scratch as it would be in a slow belt.
All in all, conceptually a “slow belt” is possible, but I doubt it would be practical. Definitely worth thinking about though; thank you.