Forum Replies Created

Viewing 15 posts - 256 through 270 (of 674 total)
  • Author
    Posts
  • Ivan Godard
    Keymaster
    Post count: 689

    Yes, many-core Mills are as possible as many-core anything, with the same advantages and drawbacks if done naively.

    We do not expect to be naive, but the differences are still NYF.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Russian CPU Elbrus #2943

    There’s no ISA details (the native ISA, not the x86 translation) available in English I know of.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: code examples? #2931

    Somehow your several recent posts didn’t get put on my feed until this one sprung something loose. I’ll answer those others separately; sorry for the delay.

    Both of these codes involve loops, and to get good code on them requires that the specializer does software pipe-lining at a minimum, and auto-vectorization where relevant. The tool chain isn’t up to that yet. When it is we’ll do another mini-talk, and your codes might be the subject. No promises though 🙂

  • Ivan Godard
    Keymaster
    Post count: 689

    This is a nice little test; thank you for suggesting it. We might work it up into a talk like the switch talk. However, it really needs both pipelining and auto-vectorization, both of which are torn apart in the tool chain right now; plain scalar code isn’t really that informative. So let us beat the tools back into shape and I’ll post the result here when it’s ready.

  • Ivan Godard
    Keymaster
    Post count: 689

    We plan to record and post the video with our other talks. Video gremlins permitting…

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: Prediction #2868

    Reposted from comp.arch with permission, as likely of general interest:
    ———————————————————————–

    On 6/28/2017 1:47 PM, Stephen Fuld wrote:

    The discussion in another thread about branch prediction versus exit prediction got me thinking, and that led me to realize that I didn’t fully understand Mill exit prediction. So let me ask some questions as a means of increasing my understanding.

    What exactly are you keeping? The address of the instruction that is predicted to cause the exit? The address of the EBB that is predicted to be executed next? The number of cycles until the exit? I can see advantages for each, and perhaps even multiple things.

    The latter two; see below.

    I presume the “index” for the predictor is the starting address of the EBB. Is that correct?

    More likely the entry index into the SRAM prediction table.The entry address of the executing ebb is in a specReg (whimsically called “entryReg”). That plus the target offset from the prediction gives the entry address you will exit to. Rinse repeat.

    If the EBB contains a (perhaps) conditional subroutine call, is that temporary exit from the EBB handled by the exit predictor? I see problems with both yes and no answers. Or is there some other mechanism to handle this case?

    Yes, every transfer of control involves a prediction. Eventually there will be a return prediction which takes the target from a conventional return-address stack.

    What about conditional branches within the EBB? Are they not predicted at all?

    There is no such thing on a Mill. You can branch to the beginning of yourself (one-ebb loops are *very* common on the Mill), but you can’t branch into the middle of any ebb including yourself.

    Thanks in advance for helping my understanding.

    This is more fully explained in http://millcomputing.com/technology/docs/#prediction, but here’s an overview:

    Each logical entry contains some prediction history state, the details depending on the prediction algorithm(s) used; those are not program-visible and may vary by member. Possible keys include the entry address of the ebb that is going to exit; some history of such addresses; some perceptron or TAGE logic derived from such addresses, and so on. In general the Mill can use the same algorithms as any other ISA, but there are fewer entries (one per ebb rather than one per branch) but each entry has more prediction state (as described below vs. one taken/untaken bit). However it is done, the prediction logic must produce a prediction-table entry index and from that a prediction.

    Each prediction contains two critical pieces, plus assorted niceties. One is the target, conceptually the address of the next ebb to execute. In practice it is the offset from the entry of the executing ebb, because all Mill branch-like things are entry-relative. As a result we don’t have to have a 60-bit address for each. In effect entry offsets form a linked list in the prediction table, so the prefetch logic can run arbitrarily ahead of fetch and execution. Other than representation, the target is similar to what is in a BTB.

    The other crucial part of a prediction is a duration: the number of cycles between when the ebb is entered and when it exits. Duration is statically known; Mill is a statically scheduled exposed pipeline machine. If the duration in the prediction is three (for example) then the fetch and decode logic knows to fetch and decode exactly three (very wide) instructions before starting to fetch from the next ebb in the target chain.

    Some of the niceties: rather than the duration being in cycles from the beginning of the ebb, the duration is in cache lines plus instructions within the last cache line. This both tells the prefetcher how many lines to munge, and also reduces the number of bits needed for duration in for large ebbs. Duration can overflow any fixed bit assignment of course; we have not yet determined whether injecting a dummy branch or simply permitting a mispredict at the end of large ebbs is better. Current our working sims use a dedicated set of logN representations for the long duration in lines. This will always mispredict (5 cycles) and may be over-eager in fetching from DRAM, but so far it looks like over-eager is better than under-eager. Pending measurement.

    Another nicety: Mill code can have several (possibly conditional) calls in a single instruction; taken calls are executed in slot order and each can see the results of its predecessors. Each unconditional or taken call will have its own prediction, so an instruction-count duration is ambiguous if there are more than one. Consequently the duration is not only cache-lines+instructions but cache-lines+instruction+taken-calls.

    There is a pretty conventional return-predictor; a table entry for a return uses a dedicated target value to indicate a return.

    There is a transfer-kind code in the entry. This is used both to distinguish returns and to let the transfer hardware get a head start on the side-effects of the transfer: calls create stack and spiller frames and returns unwind them, the leave op discards in-flights, etc. Mill is very CISC, in the sense that single ops do a lot, not in the VAX-like sense in which ops are composed out of subops that must be parsed and interpreted in the context of the rest of the instruction; unlike x86 there is at most one memory access per operation. There’s good and bad in that. On the one hand a call (say) is conceptually simple, encodes densely, and is free of special cases. On the other hand the ISA gives you exactly one way to make a call, and you can’t tailor your own call protocol.

    A great many loops on a Mill are one or only a few instructions in one or a few ebbs. As in many architectures there is a loop buffer on a Mill, or rather two of them. One of them holds the hardware signals coming from the decoder; the decoder is a 3-cycle process so we buffer up to three signal sets, and a special prediction tells the decode-and-issue to feed from the signals buffers rather than decoding around the loop. This doesn’t speed anything up (we issue every non-stall cycle regardless), but it does save power. The second loop buffer is the i$0 microcache that holds some small (per member) number of cache lines fully associative. If the lines of a loop, whether one or several ebbs, fit in i$0 then we avoid banging on the main instruction cache, again saving power.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: The Belt #2969

    A multi-accumulator (MA), like the belt, has the advantage that the encoding does not need to specify a destination. It can also often avoid specifying one source, but pays for that with move operations to load the accumulator. You suggest using each FU as an accumulator machine, which is fine for ALUs but some FUs have no result to accumulate. In addition encoding would like a power-of-two number of addresses, while FUs are rarely provisioned in power-of-two numbers. Lastly, even a use-once may still need to be saved: (a OP1 b) OP2 (c OP3 d) where all ops use the same FU. All this can be worked out, but which approach would have the better entropy in real codes would depend on details of the encodings and the workload.

    A larger issue is hazards: the belt is SSA and there are no W-W, R-W, or W-R hazards, while an accumulator is updatable and so someone must deal with hazards.

    Because the Mill transiently hold results at the FU outputs one can see it as a kind of MA in terms of routing. It is not MA encoded; we are willing to pay the cost of the address remapping to avoid hazards and have efficient spill.

  • Ivan Godard
    Keymaster
    Post count: 689

    You should ask on the comp.arch newsgroup; they have regular posters that are knowledgeable in such things.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: switches #2941

    The machine has no physical ordering required here, so the choice of pick vs call order is arbitrary. We chose pick-after-call several years ago, before there were predicated calls, based whether we thought f()?x:y was more or less common than b?f():x. Perhaps we should re-examine that. The way our devtech works, it would be three one-line changes in the code base to switch, and all the tools would be using the other order.

    Or we might leave it as is until we have enough of a testbase for measurement both ways 🙂

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: switches #2938

    Pick phase is after call phase, and you got it right from the doc. For some reason I was thinking of branches, for which the predicate is evaluated in writerPhase, after the pick. Sorry for the brain freeze 🙁

    Hence your pick-and-call needs an extra instruction. As for ranges of calls, currently the compiler transforms your example into (in effect):

    if (x < 0) return 0;
    else if (x < 3) return bar();
    else if (x < 6) return baz();
    else return 0;
    

    Your pick-based solution is better than this, even with the extra instruction, but it’s unclear that the case is common enough to look for.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: switches #2933

    The basic problem with using the vector hardware for scalar operations is what do you do when you want to do a vector betw()? We have gone to considerable trouble to make sure that every scalar operation can be applied element-wise to a vector. The reverse is not true: there are reducing vector operations (like any() and all()) that are not meaningful in scalar.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: switches #2932

    Pick is a good way to implement the within() and without() functions (or their inverse includes() and excludes()). For a range check that involves branching rather than boolean result one can just use two branches. For calltr1 you need a single bool and the pick works. Yes, the predicates on predicated ops like calltr1 can see a pick; ain’t phasing wonderful?

    We don’t generate range checks in switches; just a chain of less-than compares. Each less-than peels off the next range in sorted order.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: switches #2918

    You’re welcome. It was a nice tidy test case that pointed up different aspects of the architecture, which is actually hard to do – most tests are too big or too complex to grok in the context of a talk. Feel free to cite the credit in your CV 🙂 Veedrac got a credit too, did you notice?

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: switches #2888

    It usually takes a couple of weeks to do the video editing. There will be an announcement mail when it goes up.

  • Ivan Godard
    Keymaster
    Post count: 689
    in reply to: switches #2859

    There’s a –Os setting to specialize for space. At present it shuts off cloning speculable operations up into inbound control flow arcs, except in the special case where the EBB has only one op (commonly a retn), so there is no size bloat because the cloned op replaces the otherwise necessary branch.

    We don’t unroll loops at present, and actually have had quite a bit of argument with the LLVM code that really really wants to unroll. There is one case where we know we want to unroll, in loop pipelines in which there is enough hardware functionality to do more than one iteration concurrently and there’s no loop-carried dependencies. However, the specializer isn’t smart enough yet to do that.

Viewing 15 posts - 256 through 270 (of 674 total)