Forum Replies Created

Viewing 15 posts - 61 through 75 (of 94 total)
  • Author
    Posts
  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: Compiler #744

    the compiler is responsible for coming up with bundles of operations which
    can be executed concurrently.

    Correct.

    How do you know the compiler can always come up with bundles
    which have 30 parallel operations ?

    Always is a strong word; we obviously can’t. The conventional wisdom was that there’s only an ILP of 2 or so in open code. This is not true. Our Execution talk describes phasing which is one of the ways we improve on this.

  • Will_Edwards
    Moderator
    Post count: 98

    You answer your own question very well πŸ™‚ I think your reasoning closely approximates the ootb team.

    > Or maybe you all have thought of some clever means of partitioning your belt network by depth that isn’t obvious to me (and so is probably NYF), or I’m wrong to thinkg that all inputs and outputs have to have the same size?

    Well holistically Mill is an ABI at the load module level. But its very much grounded in a hardware architecture, of course. Yet how the belt is implemented is always described in the talks as an implementation detail

  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: mill + parallella #715

    Do you have any insights into how to do this or how you imagine it working?

  • Will_Edwards
    Moderator
    Post count: 98

    Can mods (like myself) host images on ootbcomp? WP normally has add-media dialogs, but I cannot find them for ootbcomp.com.

  • Will_Edwards
    Moderator
    Post count: 98

    The multiply op belongs in the execute phase, so issues in the second cycle of the instruction.

    The number of cycles it takes is member dependent, and operand width dependent, and dependent on the type of multiply (integer, fixed point, floating point, etc). Multiplying bytes is quicker than multiplying longs, and so on. But the specializer knows the latencies and schedules appropriately.

    Lets imagine it takes 3 cycles, which includes the issue cycle. The instruction issues on cycle N, but the multiply operation issues on cycle N+1 and retires – puts the results on the belt – before cycle N+4.

    The CPU likely has many pipelines that can do multiplication, as its a common enough thing to want to do. The Gold, for example, has eight pipelines that can do integer multiplication and four that can do floating point (four of the integer pipelines are the same pipelines as the four that can do floating point).

    So on the Gold, you can have eight multiply ops in the same instruction, and they all execute in parallel. Furthermore, even if a pipeline is still executing an op issued on a previous cycle, it can be issued an op on this cycle. And each multiply can be SIMD, meaning that taken altogether the Mill is massively MIMD and you can be multiplying together a staggeringly large number of values at any one time, if that’s what your problem needs.

  • Will_Edwards
    Moderator
    Post count: 98

    Do the 4th and 5th level replies have the same indent in your browser?

    We seem to quickly get more than 5 deep in a thread and then it becomes really hard to follow. I just answered a couple of questions only to later work out that Ivan had answered them in his usual depth and detail..

  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: Security #791

    Yes, you can use the stack to pass arguments between frames.

    You may also use the heap. And how you do it depends a lot on whether you make a distinction between objects and primitives, and how you do garbage collection.

    Normal calls within a compilation unit are very conventional and all the normal approaches work; its only if you want to take advantage of the Mill protection mechanism or its convenient for dynamic linking that you use portals.

  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: Execution #725

    Answering particular parts of your question where I can:

    > Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?

    Well, definitely informed by code analysis. There’s an upcoming talk on Configuration, and its been described in the talks and in the Hackaday interview.

    A central tenet of the Mill is that people building products can prototype custom processor specifications in the simulator very quickly and choose a configuration with the best power performance tradeoff informed by benchmarking representative code.

    > With regard to calls, how many cycles do they normally take in overhead?

    One. Calls and branches (that are correctly predicted) transfer already in the next cycle. In the case of mis-predictions where the destination is in the instruction cache, again the Mill is unusually fast; it has a penalty of just five or so cycles.

    Additionally, there is is none of the conventional pre- and post-ambles.

  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: mill + parallella #718

    Yes, I understand you.

    We do want to produce an early “make” model for enthusiasts, as there are lots of enthusiasts who have asked for Mill dev boards. The funding mechanics priority and timescales of this is still being discussed.

  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: Execution #646

    Does the mill support (arbitrary) vector element swizzling?

    Yes. There is a shuffle op for arbitrary rearrangements of a vector.

    I’m just wondering if the same functionality that enables free pick might also allow free swizzles.

    I believe its in the op phase.

    I could see how it might be machine dependent due to different vector sizes.

    Well, you can always use the remainder op to create a mask that you then pick against with 0s or Nones to create a partially filled vector? This was covered in the strcpy example in the Metadata talk and the Introduction to the Mill CPU Programming Model post.

  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: Execution #637

    True, the analogy doesn’t stand up to such scrutiny πŸ™‚

    Your compiler is still responsible for Tail-Call-Optimisation to turn recursion into loops. The CPU does what it’s told.

  • Will_Edwards
    Moderator
    Post count: 98

    I was careful to say wider belt, meaning more elements in a vector, rather than longer belt because I imagine its diminishing returns and stresses instruction cache and so on.

    The key thing is that it is straightforward to simulate variations and evaluate them on representative target code. I’m sure that the current configurations haven’t been plucked from thin air, but rather represent what is considered the most advantageous mix for the first cut.

    I do want a Platinum Mill on my desktop and to hell with cooling! When we have a monster for gaming rigs, compiler rigs and for the fun of it, then we can dream of an Unobtainium Mill.

  • Will_Edwards
    Moderator
    Post count: 98

    Sorry, I could have been over-enthusiastic. I can imagine a Mill with more L2 and L3, wider belt, more FP, higher clock rate and so on πŸ™‚

    But yes, the Mill is high end compared to today’s CPU cores, perhaps? πŸ˜‰

    I will fix the post when I am next on a laptop.

    • This reply was modified 10 years, 9 months ago by  Art. Reason: correct formatting to intended
  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: Instruction Encoding #578

    Well, it would be staggeringly unlikely to be a meaningful program because the two streams diverge in memory from the entry point; another address e.g. EBB+n would mean that the bit stream fragment between EBB…EBB+n that is one side is also valid instructions to the other side when decoded backwards…

    Of course, trying to generate two legal, overlapping EBBs with this property may be a fun exercise for determined readers πŸ™‚

  • Will_Edwards
    Moderator
    Post count: 98
    in reply to: Metadata #571

    Add-reductions keep coming up in my mind when doing 3D (e.g. as game and graphics engines will be doing buckets of). In 3D graphics there are lots of vectors which are 3 or 4 long.

    I imagine that, whilst belt vectors are powers-of-2 in length, you can load a non-power-of-2 vector, and the load automatically pads it with Nones? So if you load(addr,float32,3) you actually get xyzNone.

    And you’d want an add reduction to treat Nones as 0 rather than propogate them.

    The shuffle sounds useful for computing cross-product.

    Generally in games/graphics you want sqrt, inverse sqrt and dot product. You also likely want to sum a vector again when you do matrix multiplication.

    My thinking would be that in the Mill IR sqrt, inverse sqrt, sum reduction, fork/exec/vfork, memcpy and memmove etc are built-in functions, and the specialiser on each target turns that into single or multiple operations as the target supports. So that’s like microcode (or standard function inlining), but in the specialising compiler rather than in the outer compiler or on-CPU. It would be a hassle for a specialiser to have to unravel some old IR that is coding its own sqrt loop using a lower-level operation if there is ever hardware with better built-in sqrt, for example?

    And as for hazards, we all want to avoid them, but pragmatically if its the specialiser that has to know about them and it has to know about one of them, it might as well open the floodgates and have a few more πŸ˜‰

Viewing 15 posts - 61 through 75 (of 94 total)