Forum Replies Created

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
    Posts
  • canedo
    Member
    Post count: 3
    in reply to: The Belt #1279

    I did not think about queue machines for years; this is a great discussion.

    Memories started to come back. We had the Parallel Queue Processor (PQP) – the implementation with the ring buffer cited above. We also had the Queue Register Processor (QRP) that was, in spirit, similar to the Belt + Scratchpad. In the QRP, the scratchpad was analogous to a random access register bank where we stored loop induction variables, and other frequently accessed operands. The instruction set was a hybrid where operands from both the queue (with relative offsets) and the register bank (absolute addresses).

    In a queue/belt processor without a scratchpad, the spill-fill pairs can be replaced by single rotation or duplicate instructions. These can also be used to recycle the operands that fall from the tail.

    Ivan, I am wondering how large of a belt do you guys need for heavy sw pipelined code? Unfortunately, my queue compiler was constrained to the dataflow information that GCC 4.X would give. I wish I would have had access to a more aggressive compiler at that time.

    An idea that I never had the time to explore was the predicated execution in a queue model. I’ll try to see your branch prediction videos over the weekend to see how you handle it in the belt machines.

  • canedo
    Member
    Post count: 3
    in reply to: The Belt #1277

    I finally saw the talk on the belt to the end. I agree that the hardware implementation of queue machines and belt machines is probably different. However, their programming model is similar. I also think that belt, queue, and stack machines are very good at decoupling the programming model from their hardware implementation.

    As Ivan points out, “pure” queue machines need perfect trees. Real code with DAGs needs a relaxation of the queue model. We thought about two variants: allowing random access to operands in the queue with an offset relative to the head of the queue (producer-order model), or relative to the tail (consumer-order model). The queue machine’s offsets are a cousin of the belt’s temporal references to operands.

    An interesting observation about offsets or temporal references is that once they are introduced, they make the generated code dependent to a particular hardware architecture. This is because the queues and belts have a fixed length. Like VLIWs, the code needs to be recompiled every time is migrated to a new hardware implementation. A ring buffer creates the illusion of an infinite queue and, theoretically, it allows the same code to run on different Hw. This was one of the motivations of having the ring buffer as our main implementation, but we also architectural variants (e.g. a “multi-dimensional” queue processor where functional units were fed from different queues – the code is not pretty).

    The only explanation on the conform operation I found is in this discussion thread. I am not sure what it really does. In the queue processor’s programming model we had a “rotation” and a “duplicate” instruction that were used to reorder operands in the queue. The compiler can find these locations in a straightforward manner. Is this what the conform and rescue operations are used for?

    Queue machines can’t execute RPN code. We relied on the level-order scheduling to expose the ILP. The relied on the hardware to pick up the stream of instructions and issue multiple at a time. I have not yet seen the videos on the belt’s instruction scheduling, but I suspect you pack the parallel instructions in bundles as in VLIW’s? At the time we retired from the queue machines this was in my todo list.

    The discussion on varying latency of the exposed pipeline is interesting. That makes me wonder how would the queue compiler’s scheduling look like? Probably similar to what you propose with the multipliers that span across cycle levels. I think playing with the vertical (depth) position of these operations in the DAG could solve the problem.

  • canedo
    Member
    Post count: 3
    in reply to: The Belt #1218

    This is a great initiative, the best of luck to the founders.

    Has anyone made the connection yet between belt machines and queue machines?

    From 2004 to 2008 I was part of Prof. Sowa’s lab where we developed the Parallel Queue Processor. We had a similar idea (FIFOs holding operands) with similar objectives: high parallelism, no register renaming, and short instruction set. http://dl.acm.org/citation.cfm?id=1052359

    For those interested in the compilation aspects: http://ir.lib.uec.ac.jp/infolib/user_contents/9000000330/9000000330.pdf

    Some additional material about queue machines has been compiled by Prof. Abderazek:

    http://web-ext.u-aizu.ac.jp/~benab/research/projects/QueueCore/

Viewing 3 posts - 1 through 3 (of 3 total)