I did not think about queue machines for years; this is a great discussion.
Memories started to come back. We had the Parallel Queue Processor (PQP) – the implementation with the ring buffer cited above. We also had the Queue Register Processor (QRP) that was, in spirit, similar to the Belt + Scratchpad. In the QRP, the scratchpad was analogous to a random access register bank where we stored loop induction variables, and other frequently accessed operands. The instruction set was a hybrid where operands from both the queue (with relative offsets) and the register bank (absolute addresses).
In a queue/belt processor without a scratchpad, the spill-fill pairs can be replaced by single rotation or duplicate instructions. These can also be used to recycle the operands that fall from the tail.
Ivan, I am wondering how large of a belt do you guys need for heavy sw pipelined code? Unfortunately, my queue compiler was constrained to the dataflow information that GCC 4.X would give. I wish I would have had access to a more aggressive compiler at that time.
An idea that I never had the time to explore was the predicated execution in a queue model. I’ll try to see your branch prediction videos over the weekend to see how you handle it in the belt machines.