• Author
    Posts
  • milljma
    Participant
    Post count: 3
    #743 |

    It seems like the compiler is responsible for coming up with bundles of operations which
    can be executed concurrently. How do you know the compiler can always come up with bundles
    which have 30 parallel operations ?

  • Will_Edwards
    Moderator
    Post count: 98

    the compiler is responsible for coming up with bundles of operations which
    can be executed concurrently.

    Correct.

    How do you know the compiler can always come up with bundles
    which have 30 parallel operations ?

    Always is a strong word; we obviously can’t. The conventional wisdom was that there’s only an ILP of 2 or so in open code. This is not true. Our Execution talk describes phasing which is one of the ways we improve on this.

  • milljma
    Participant
    Post count: 3

    I’ve looked at your execution slides. Is slide 26 the kind of situation you’re talking
    about when you say ILP is more than 2 ? Isn’t it also very common to have a lot of
    logical branch/decision making in a program ? Are slides 45 & 48 your answer to branching ? Do you have a compiler ? If so, have you tested it to verify average ILP of
    more than 4 ?

    • Ivan Godard
      Keymaster
      Post count: 689

      To expand a bit on what Will said: the instruction (for that Gold member) may carry up to 30 operations. The compiler does not have to find 30 for every instruction; you would never see that level of ILP in open code except in contrived examples. However, in loops, with pipelining and/or unrolling, there is no limit to the amount of available ILP. One-instruction loops are common on a Mill, with the instruction containing large numbers of ops, approaching the encoding limit of the particular family memory.

  • milljma
    Participant
    Post count: 3

    If you give me a sequence of 30 Sparc instructions like that shown in slide 26, I can
    execute them all in one cycle. There is nothing about a belt or phasing that cannot be
    done in an implementation of an existing an ISA, such as Sparc. Your compiler is the key
    to your performance advantage.

    My unsolicited ( & probably unwelcome ) suggestion is
    1. Use the open source Sparc compiler & RTL as a base to implement your key ideas. You can even extend the ISA with your favorite instructions.

    2. Demonstrate performance advantage with some existing Sparc code. Limitations are acceptable.

    If you can do this, Oracle will fund you & also buy you out once you finish the whole
    thing. So your funding & exit considerations are all taken care of.

    Nobody cares about an extra 100K or even 500K gates. It’s a non issue. If your compiler
    can extract large ILP from loops, slide 26, then concurrent execution can be done. There
    is no gate count limit & gates are blazingly fast now days.

    • Ivan Godard
      Keymaster
      Post count: 689

      If you work for Oracle, your employer is going to be very unhappy about you spilling the beans on their revolutionary new CPU.

      Otherwise, you might look a bit more closely about the design of the Sparc you mention. To be able to execute 30 operations of the slide-26 kind (a constant, an add, and a store) in one cycle, you would need ten load-store units and twenty ALUs – actually, 30 ALUs, because the Sparc takes two instructions to build a 32-bit constant. I’m not up on the most recent Sparc offerings, but I think the biggest they have has two load-store units and two ALUs, which is a little short of what you need. 🙂

      So why not build a Sparc with more functional units? Well, knowing why not is why hardware engineers get paid, but the short answer is that the units have to be able to talk with each other, and the cost of the connections increases as to the square of the number of units. Rather quickly you reach a point at which the power required will melt the chip. You might Google “dark silicon problem” for more.

      The Mill avoids this barrier using a method long used in the embedded world: static scheduling with exposed pipeline. That solves the melting problem, but unfortunately such designs give very bad performance on general purpose programs. The issues are run-time (cache misses and the like), so compiler improvements don’t help. The Mill has solved those issues, and is able to bring DSP power-performance numbers to general purpose code.

      I wish it were as easy as you believe; I could have spent the last decade on a beach. 🙂

You must be logged in to reply to this topic.