I finally saw the talk on the belt to the end. I agree that the hardware implementation of queue machines and belt machines is probably different. However, their programming model is similar. I also think that belt, queue, and stack machines are very good at decoupling the programming model from their hardware implementation.
As Ivan points out, “pure” queue machines need perfect trees. Real code with DAGs needs a relaxation of the queue model. We thought about two variants: allowing random access to operands in the queue with an offset relative to the head of the queue (producer-order model), or relative to the tail (consumer-order model). The queue machine’s offsets are a cousin of the belt’s temporal references to operands.
An interesting observation about offsets or temporal references is that once they are introduced, they make the generated code dependent to a particular hardware architecture. This is because the queues and belts have a fixed length. Like VLIWs, the code needs to be recompiled every time is migrated to a new hardware implementation. A ring buffer creates the illusion of an infinite queue and, theoretically, it allows the same code to run on different Hw. This was one of the motivations of having the ring buffer as our main implementation, but we also architectural variants (e.g. a “multi-dimensional” queue processor where functional units were fed from different queues – the code is not pretty).
The only explanation on the conform operation I found is in this discussion thread. I am not sure what it really does. In the queue processor’s programming model we had a “rotation” and a “duplicate” instruction that were used to reorder operands in the queue. The compiler can find these locations in a straightforward manner. Is this what the conform and rescue operations are used for?
Queue machines can’t execute RPN code. We relied on the level-order scheduling to expose the ILP. The relied on the hardware to pick up the stream of instructions and issue multiple at a time. I have not yet seen the videos on the belt’s instruction scheduling, but I suspect you pack the parallel instructions in bundles as in VLIW’s? At the time we retired from the queue machines this was in my todo list.
The discussion on varying latency of the exposed pipeline is interesting. That makes me wonder how would the queue compiler’s scheduling look like? Probably similar to what you propose with the multipliers that span across cycle levels. I think playing with the vertical (depth) position of these operations in the DAG could solve the problem.