Some questions about the belt.
When doing branches, loops etc. The belt needs to use the “conform” instruction to rearrange the belt.
How efficient is the “conform” instruction?
A CPU with normal registers would never need a “conform” instruction, so isn’t the belt causing a processing overhead here?
I guess the “conform” could be a hardware shuffle of the belt providing very little overhead.
Another problem aspect of the belt might be micro parallelism.
From a single group of instructions to be executed, one might be able to execute some in parallel if they do not conflict. But the belt adds a conflict/race when the result is saved on the belt.
A possible solution to this might be: Multiple belts. The instructions that work in parallel could be saving to different belts, thus removing the conflict.
One would then have to implement some stall/syncronisation mechanism whereby the next instruction stalls, until the previous parallel instruction finishes that the stalled instruction depends on.
How many bytes is the belt? How much needs to be copied on a context switch?