Forum Replies Created

Viewing 15 posts - 421 through 435 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1547

    Not yet for public release. The architecture has been running in sim for 5+ years now, but only for conAsm except for our partial compiler, now abandoned. When genAsm and the specializer are ready (Real Soon Now) then we expect to provide either online access via our servers or a downloadable version to run on your own, or both. The full LLVM-based tool chain will take longer.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Prediction #1491

    Exactly right.

    When the prediction facility was being designed, we decided that any unstable transfer point such as method calls (what Cliff Click calls “megamorphic”) would be of interest only if control passed through it frequently enough to keep the various target EBBs in the L2. Consequently we did not concern ourselves with issues of DRAM latency, and were primarily concerned with the miss-predict penalty (yes, the Mill has a penalty that is a third that of other architectures, but a stall is still a stall) and the L1/L2 latencies.

    Next, we observed that the codes we looked at, and in our personal experience, megamorphic transfers tended to not be calls on leaf function. Instead, a megamorphic transfer usually led to a sequence of transfers that were not megamorphic, until control returned back through the megamorphic point. This reflects a program model in which a test point figures out the type of a datum, and then does stuff with the known type. Mill run-ahead prediction works well with this model, because once through the megamorph the subsequent control flow will predict well and will be prefetched ahead of need. Incidentally, the original source may not have been written to this model, but the compiler’s value-set analysis and inlining will tend to remove the type-checks internally to the known-type cluster, and even if it does not, the type will in fact be fixed and the transfers not be megamorphic.

    The alternative model works less well, where the same object is subjected to a succession of leaf-function calls. Each of the calls will be megamorphic if the object is (and so will miss-predict frequently), and the leaf functions don’t have long enough transfer chains for run-ahead to win much.

    This suggests an obvious optimization for the compiler: transform the latter structure into the former. This dimensionality inversion or transpose is the control space analogy to the re-tiling optimizations that are used to improve array traversal in the data space, and for all I know some compilers already do it. It should benefit any prediction mechanism, not just ours.

  • Ivan Godard
    Keymaster
    Post count: 689

    The Mill will do very well for DSP work up to a limit of data parallelism. If the necessary SIMD is greater than 8 or so then a GPU-like architecture will do better, albeit at the cost of having to do GPU programming.

    Of course, DSP covers many quite disparate loads. For example, if the load requires double or quad floating point (GPUs are notoriously bad at DP and don’t have quad at all) then a Mill will have getter performance than a GPU and lower power than an OOO CPU. In contrast, if the DSP is really not much more than a microcontroller (for example in audio work) then a Mill is overkill.

    No simple single answer, sorry.

  • Ivan Godard
    Keymaster
    Post count: 689

    The Wiki is not yet open for the public. Yes, you may be able intermittently reach it as we work on it, but it will move/break/go away without notice. There will be an announcement :-)

  • Ivan Godard
    Keymaster
    Post count: 689

    This isn’t quite what I was hoping for from you. These codes are already hand-tuned for a particular x86. A Mill can of course emulate an x86, but it won’t be fast. On some Mill members there may even be (an approximation to) AVX2 operations; you have to pick a member with the same vector height as a Haswell. You could then code your critical function in conAsm assembler for that member, and you’d get roughly the performance that a Haswell on the same clock and process would give you, or better depending on how many shuffle units were on the member you chose. I’m sure some people will use Mills in this way because it’s how they are used to coding, but we don’t recommend it.

    What I was hoping to get from you was the actual algorithm that your code is a machine-dependent implementation of. Express it in scalar if need be. There won’t be any shuffle operations in the algorithm of course, nor any register sizes, nor registers for that matter. It’s impractical for me to try and deduce what your code is supposed to do by looking at what it actually does; decompilation is provably impossible in general.

    The Mill does have an arbitrary shuffle. It’s not clear that any of your codes would actually used a shuffle, unless that were the only hammer you had.

    Ivan

  • Ivan Godard
    Keymaster
    Post count: 689

    In general all count fields described in the talks are not byte counts but shifter tap counts; the taps themselves can be anywhere, and can be separated by any (static) number of bits. The bit layouts in the Wiki reflect this.

    So your reasoning is right 🙂

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Pipelining #1551

    Perfect scheduling for any machine (or more generally for any set of resources and consumers) is known to be NP. You can detect when you have found a perfect schedule, but unless the best possible schedule is also a perfect schedule you cannot tell if you have the best schedule without exploring all other schedules. See A solver changes the representation of the problem but not the difficulty. Good heuristics is the goal, and in practice work fine.

    Feasibility is guaranteed because it is evident in the degenerate case and the algorithm will reduce to the degenerate case if it does not find a feasible solution earlier. This does not mean that it always produces the degenerate case; in fact it never does. One of the problems with CS results is that they are good at determining the limits of an algorithm, but determining and proving average behavior is much harder 🙂

    We don’t cache the scratchpad because deterministic latency is critical to its function. If varying latency were useful then we would use the memory cache and eliminate the scratchpad.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Simulation #1549

    The interesting part of doing Forth on a Mill is not the ops, which are straightforward, but in figuring out how to map the Forth dual-stack program model onto the belt. If you decide to pursue this, get acquainted with the Mill ISA (on the Wiki) and standard Forth, rough out the implementation strategy, and get back to us. If we are ready at that point, you can be a beta tester of the public sim; NDA will be required.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Pipelining #1545

    One must distinguish pipelining from vectorization; the two are often conflated but present very different issues.

    Essentially any closed loop can be scalar pipelined on a Mill, although it may not be worthwhile. Because the compiler does not have all the code available (due to specialize-time emulation substitution and inlining of external functions and templates) we must determine the applicability of SPL in the specializer; this is done after all the control-flow folding has been done and the code is ready to schedule. If-conversion (including pick) is done at this stage, as well as conventional control-flow simplification like jump-to-jump elimination. We also flop the sense of branches so that the fall-through case is the likely one (if profile info is available) or the longest path (which gives better schedules). We also recognize loops at this point, and replace exiting branches with leave ops and put an inner op in the loop head (the dominator that is the entry to the loop).

    The heuristic for SPL is straightforward: we count the latency of one full iteration (following the dataflow of execution) as if it were straight-line code, and count the number of ops in that code in the various resource categories. Because we know how many resources (of each category) are present on the target, we know the minimal number of issue cycles that are needed if every op were completely packed into the available width. If the linear latency (following the dataflow) is twice or more the packed latency (maximal entropy) then we SPL. The “twice” is heuristic’ we could pack three linear into two packed, for example, but don’t have enough experience to know if it’s worth it.

    Scheduling is pretty much the same as for open code, except that open code uses an extensible linear tableau to allocate slots to ops, whereas SPL uses a slice of the tableau that is packed-latency high and schedules it as a torus, with wraparound as it looks for free slots. The schedule is guaranteed to be feasible by the definition of the packed latency, but the placement is greedy and does not necessarily lead to a feasible placement (the bin-packing problem is NP). If it fails then the packed latency is increased by one and the whole mess is rescheduled. This is guaranteed to lead to a feasible placement, because eventually the increase in packed-latency will be enough for an open-code placement, which is always feasible.

    Once placement is done, we calculate belt lifetimes of intermediates. This is the same as for open code, although we must distinguish the intermediates from different iterations that are simultaneously live. As with open code, those lifetimes are measured in terms of relative position on an infinite belt, rather than in terms of cycles, and if the relative position between producer and consumer exceeds the actual belt size then we insert spill and fill ops at the producer and the first consumer that needs a lost operand. There can be several of these of course. If this is the first spill/fill we add a rotate op at the top of the loop and put a scratchf op (scratchpad allocation) in the loop head; these get fixed up later when we know how much scratchpad will be used. If we added any spill/fill then we throw away the schedule and redo everything from the top with the modified code. This too is guaranteed to reach a feasible schedule, because the limit is for every produced value to be spilled and the belt reduced to a single position. In practice, one reschedule is needed ~20% of the time, two in single-digit percents, and we’ve not found any code that needs more.

    The torus schedule is a normal schedule and may embed control flow, and represents the steady state. Any dataflow on the back arc to the entry point represents either a recurrence (for example the loop control variable) or a loop-carried temporary. The recurrences are those with a dataflow from the loop head (or earlier) into the loop; those with flows in from the head but not on a back arc are initialization, not recurrences. Initialization flows are handled like any other dataflow across a join point, and may require injection of a conforms op to get all the arcs into the top of the loop to have a congruent belt. Back-arc-only flows require creation of a None value in the loop head, which acts as an initializer just like explicit initialization. And true recurrences need both the initial value and the injection of a recurs op at the join. We could be smarter about recurs injection; it’s only needed if a None can reach the recurrence. Truth be told, there’s probably other cases where we could be smarter, but by and large the resulting code looks decent.

    While SPL needs very little from the compiler and is mostly specializer only, vectorization needs a lot from the compiler. We need the compiler to tell us the iteration interval, the expected iteration count (if it can tell), and the strides of all recurrences. These are hard for the specializer to figure out, so we don’t try to recognize vector loops in the specializer, so far anyway. From the target description we know what the vector size will be for each reference, and from the iteration interval we can tell if that size is feasible or if we have intra-vector recurrences. There are ways to handle such recurrences in the lit, but don’t hold your breath. If it’s feasible we transform the loop to one that is still a scalar loop but where each “scalar” is a vector of the appropriate size; this involves scaling the control variable calculations. The smear operations for while loops are inserted at this point, and the trailing reductions when the loop yields a vector that must be reduced to a scalar. The result is just a normal loop, but with some funny operations in the code, and is scheduled (including SPL) normally.

    Only some of all this is working as I write this; in particular none of what the specializer needs from the compiler is yet available. If you have compiler experience (especially LLVM) and are willing and able to put in significant time on a sweat-equity basis then send me a resume.

  • Ivan Godard
    Keymaster
    Post count: 689

    Whether power-shorting on NaR (and None) arguments pays depends on how common they are and whether the case can be detected early enough to take effect – it can take more than a cycle for a signal to cross a multiplier! Power shorting might pay on a native divide, but a multiplier would likely have already committed the power before it could figure out that it didn’t need to do so. And of course we don’t expect to be implementing native divide.

  • Ivan Godard
    Keymaster
    Post count: 689

    Control-flow join points must normalize the belt so that each live operand is in the same place regardless of how the join was reached. The best normalization code depends on the number of live values and their respective positions.

    The completely general solution is to choose the belt layout of one inbound path as canonical, and add a conform operation to each other path to force the belt on that path to match (conform with) the canonical. If you have profile info then the canonical path should be the one most commonly taken.

    If the values to be normalized are in the same relative order on all paths (although not the same actual positions) then putting a rescue operation on each path gives a cheaper encoding than conform.

    If the code between the dominator and the join on each path is speculable (as in your example) then if-conversion will give by far the best code, replacing the branch with one or more pick operations. In modern fabrication processes it cost little more to do a wasted arithmetic operation than to leave the functional unit idle for that cycle. However, if there is more code than available FU cycles this speculation strategy will increase latency, so using pick is most advantageous on wide family members. However, overall program performance may still be better with pick if the branch does not have consistently predictable behavior; the speculation latency of the pick will be exceeded by the mispredict penalties of the branch.

    Lastly, in some cases (including your example) it is possible to do a poor-man’s conform using a null arithmetic operation, typically ORing with an immediate zero. Then if the code takes the true path there will be a result dropped by the add, and if it takes the false path then a copy of the value will be dropped by the OR, and in either case the correct value will be at the front of the belt for further processing.

    The specializer algorithm currently if-converts everything if the operations are speculable, and it inserts conform ops if not. A later peep-hole step then changes the conforms to rescues if that is possible.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Specification #1516

    The Mill already defines signed division, but not in the sense you mean. DIVU accepts unsigned arguments and so the ambiguity for signed does not exist. DIVS accepts signed arguments and produces the C-language result, as is standard in the programming languages that the hardware must accept. We expect that division will be an emulated op (rather than hardware native) on nearly all Mill members. We define a reciprocal-approximation helper op to get the emulation started, and the helper will be native. In practice, the helper and software Newton/Rapheson of the emulation is nearly as fast as a hardware version would be.

    It is easy to specify another operation, but it still must be implemented, tested, verified, and documented, and that is not without cost. Consequently we only add ops when their use can improve performance of widely used languages, or where they are needed by critical applications that will access them through intrinsics.

    So far we know of no language or app that would justify another division. It is quite rare for an app to have division as a performance-critical step, but if you are aware of any then please bring them to our attention.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Security #1485

    There is considerable member-dependent variation possible in the formats of these things, and there are probably variations that we haven’t thought of yet that will be used some day. Consequently you should see this answer as representative rather than graven.

    ID sizes (turf, task) are member dependent. The most important constraint on their sizes is that the product of a task and a turf is used as the implicit address to locate the root stacklet in a portal call, and hence must fit together in a virtual address (60 bits) together with a stacklet size and enough bits for non-stacklet memory as well. All these bit fields are per-member fixed, and the member designer can increase one only by shrinking another. A representative number might be 23 bits each for turf and task ids.

    We don’t seem to be constrained by the size of a portal data structure, although I suspect that when we implement that part of an OS that we’ll find we want to put more, or more kinds or, data in the portal than we have now thought of. There is some need for the access to be power-of-two sized and aligned, and for hardware reasons it will probably be a multiple of the vector size even if we don’t need the bits.

    In general, such member-specific details are of interest only to the OS and will sit behind a member-independent API.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Pipelining #1467

    Well, now that you ask, I’ll go with “pedantic” :-)

    Yes, the total number of operations performed by a loop is the product of the number of operations per iteration and the number of iterations, which is finite if the loop ever terminates. Termination is in general undecidable but is assured for many loops of interest.

    I prefer to avoid a 20-minute digression into the halting problem in a public general-interest talk. The point of the “unbounded” remark is to refute the widely bandied-about claim that MIMD (and width in general) is pointless because there is only 2X ILP in code. The data of the studies showing 2x is reasonable, but the interpretation of that data is misleading and the design conclusions are flat-out wrong.

    Yes, there are loops with iteration counts so low (like, one) that the total operations count cannot fill a MIMD design. But by definition such loops are not where the program spends its time, and are irrelevant. The loops that matter all last long enough to be piped. More, they last long enough that no practical MIMD design can run out of iterations, so from the designer’s point of view, yes, they are unbounded.

    The 2x claim has been used to hamstring designs in open, non-loop code also. From saying “there are only an average two operations that are independent of each other and hence can issue together” to saying “more than two pipe are wasted” is to assume that issue must be sequential. Open code on the Mill gets 6 operations in an instruction from 2x ILP code because phasing lets us have three steps of a dependency chain going at once. The 2x data is not wrong, but the conclusions from that data show a woeful lack of imagination.

  • Ivan Godard
    Keymaster
    Post count: 689

    Thanks for the details.

    I’m working by eyeball here, but it appears that both the interpolate and copy vectorize and pipeline. Ditto SAD. The dct is interesting. You’d get pretty good results in scalar on a wider Mill; there’s no great reason to use vector at all, except for encode density. It’s got 48 scalar ALU ops and 32 multiplies on its face; on a Gold that would be 10 cycles, plus a cycle for function overhead, per call, with no compiler optimization at all. However, there’s quite a bit of CSE in the code, and the shift suggests that you’re actually working in fixed-point, not integer, so that might shrink when the compiler is done with it.

    The interesting question is whether a SIMD code (like you are using on Haswell) is actually any better than a straightforward MIMD implementation, and if it is, whether a tool chain could find it. Looking at the 0_7 column, the constants are clearly a permute of each other. In Mill SIMD with more than one multiplier you would not want to implement them as a permute though; better to load them, unless you are hurting for L1 for other reasons. They are all small values, so it might figure out to load them as a byte vector and widen for use on members with less than 32-byte vectors.

    Then there’s building the add/sub_0_7 vector. That’s two splats, a mask, and a pick on a Mill, but what’s really going on is an interleave. The opset has a de-interleave (“alternate”), but no interleave; this is a case where one would be useful. However, it’s doubtful that the tools would figure out to use it. The interleave could be emulated by a double-vector shuffle, but that’s a two-slot gang op and has a longer latency than a specific interleave would be (interleave would be just a dyadic splat, one cycle). A dyadic permute will also work in this case, and is a better choice than the general shuffle.

    The multiply is then a straight vector op. The +/- between the columns is a pair of ALUs, a mask and a pick. So if I have this right, the critical path using vectors is: splat, pick, mul, ALU, pick, ALU, pick, ALU, shift. That’s 8 clocks because the picks hide in the phasing. However, we won’t have 32 multipliers, so those have to be piped, which will add three more clocks likely.

    Given the slop in these guesstimates, it’s not clear whether a SIMD or MIMD would win out, although for sure the MIMD would warm up the iCache :-) The MIMD is trivial for the tool chain; recognizing the SIMD would be an interesting exercise. Another exercise is what to do if the member vector is >32 bytes; again, trivial in MIMD, not so much in SIMD.

Viewing 15 posts - 421 through 435 (of 674 total)