Forum Replies Created

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • rolandpj
    Participant
    Post count: 4
    in reply to: Prediction #3045

    An idea for debunking – this seems as good a place as any.

    A fundamental issue that branch prediction attempts to solve is that there is no explicit clue as to a branch (or computed jump or call or return) target, until the actual PC-mutating instruction itself (is decoded).

    This is directly analogous to the memory load problem – you don’t see the memory address until (the same instruction as when) the load result is supposed to be available.

    So… why not solve it with explicit support for preload, rather than prediction.

    In more (deliberately vague) detail:

    Add a second register/buffer in the decode stage. Include an explicit ISA instruction to fetch the alternative instruction stream. As with ‘load from memory’, you can do this as soon as you know what the next (possible) PC/IP target is, which for most conditional and absolute branches/jumps/calls is known at compile-time. C++/JVM etc. ‘virtual’ call destinations, while dynamic, are also typically available at least several instructions before the actual PC/IP switch point (from linear execution).

    You want, of course, to implement this in hardware at the level of i-cache lines, probably 32/64-byte units. At all times, your primary source of instructions is the linear execution path, but you have a prepared line of ‘alternative’ (possibly conditional or virtual or return) address instruction buffer.

    The actual branch/jump/call/return instruction then reduces to a ‘switch-instruction-buffer’.

    Even better, call/return for leaf call-points is a natural consequence – the return point is immediately available as the alternative instruction buffer after the ‘call’.

    I seem to recall some history in this approach, but I don’t have any references at hand.

    At a high level, it is entirely analogous to ‘prefetch… load’, which the Mill does explicitly as I recall.

    So, is branch prediction actually necessary, or is it a reaction to pipelining on ISA’s which didn’t originally consider that they might be pipelined?

    😀

    • This reply was modified 7 years, 2 months ago by  rolandpj. Reason: clarification, grammar
    • This reply was modified 7 years, 2 months ago by  rolandpj.
    • This reply was modified 7 years, 2 months ago by  rolandpj.
  • rolandpj
    Participant
    Post count: 4
    in reply to: The Belt #2963

    The Belt vs Multi-Accumulator

    What is the advantage of the belt abstraction over a a multi-accumulator abstraction? In particular, why not treat each functional unit as an independent accumulator machine.

    As noted in some of the talks, only 14% of results are used more than once.

    The belt ISA (register) abstraction provides a bit-wise efficient encoding, since it eliminates the destination register. However, a multi-accumulator model requires (typically) even less – just one argument per operation, which can source values from other functional units. Basically the compressed instruction is a bitmap of active functional units, and the 2nd (+?) value for each operation is explicitly provided in the variable-length instruction trailer.

    This architectural design seems orthogonal to other aspects – memory management, load/store stations, scratchpad, spill mechanism, etc.

    Admittedly the belt abstraction provides a neat fn call abstraction. On the other hand, a callee-save instruction could efficiently push accumulators towards the spiller, and a ‘call’ instruction could efficiently marshal parameters from the accumulators.

    A natural and maybe pragmatic extension is to provide multiple accumulators per functional pipeline – somewhere between a full register bank and a true multi-accumulator model.

    Per the comment in talk #3 about extending the backless memory model to the heap: this might be critical for JVM middleware – typically the stack only contains pointers into the heap, and thread-local eden space is treated effectively like the traditional stack in C(++).

    • This reply was modified 7 years, 3 months ago by  rolandpj.
  • rolandpj
    Participant
    Post count: 4
    in reply to: The Belt #2976

    I don’t seem to be able to edit, but more blah.

    My suspicion, as an ex-compiler guy and now middle-ware corporate, and occasional wannabe hardware dude, is that single-threaded performance requires a few things. Obviously massive op/instruction. But more importantly, collapsing of local branches to enable that. Without that, ILP of general-purpose code is branch-limited. So, why not have predicated execution of operations in an instruction, as a bit-mask, as an extension of single-instruction predication (per 32-bit ARM)? The masked-out instructions produce None, or are NOPs. For wider ops/instruction, allow multiple masks.

    I share the same cynicism that some of the VLIW audience did at one of your talks – namely, there just isn’t enough apparent ILP available. We have to waste speculative energy to change that.

    For JVM middleware, most compiler speculation is around virtual method calling versus concrete run-time types. You can unwind that, just like C++ in the JIT, to predict virtual call targets, and/or you can specialize for known concrete run-time types. It kinda looks like branches in the JIT code. In addition, most commercial (managed environment) software is now strictly pointer-bound. Unlike carefully-crafted C(++), there is no language support for stack-local structures. Apparent load/store activity is huge. Garbage collection is important, and it’s unclear exactly what hardware facilities would be useful. Local allocation areas (Eden space) are organised as stacks, independent of the conceptual call-stack, but promotion out of Eden needs to be handled.

    I guess I’m saying that SpecInt is a thing, g++ performance is a thing, but what we really need is support for what is running most big corporate workloads. It’s managed environments (JVM/Microsoft). I have written some JIT’s in my distant past, but I am rusty as all hell.

    I suspect, though, that the ideal target architecture is different from what might be expected from SpecInt. (And yes, efficient multi-threading is part of that, but doesn’t capture the dynamic CPU demand).

  • rolandpj
    Participant
    Post count: 4
    in reply to: The Belt #2974

    My concern was not so much entropy of encoding, as hardware efficiency, but both are interesting.

    The belt on high-end (30+ ops/cycle) operates as a massively multi-port register file – particularly if all operations are allowed to access all belt positions. This is true (I think) no matter whether it’s really implemented as a CAM, as shadow register files, etc. The talks allude to some combination of the above but I am reading between a whole lot of lines, no pun intended. From the talks, for example, you do actually intend to have FU-local shift register banks, and the mapping to the belt abstraction is a hardware problem(!).

    The belt abstraction is useful, indeed, for steady-state implementation of software pipelining, and you have extended the concept of intrinsically circular rotating buffers into your ‘scratch-pad’ – which can be seen as a local/remote register file concept (or internal/external register). In short, the belt abstraction is awesome for compiler writers, which is a nice reaction to RISC, and incorporates a lot of the advantages of a stack abstraction – encoding entropy in particular.

    I don’t really know what I’m talking about, but I there are so many interesting aspects of the design, most of which are old ideas, maybe yours (tagged typing of local registers, entropy encoding of intent, virtual memory translation behind the cache, blah blah). I am not aware of a hardware abstraction that is a use-it-or-lose-it register file (the belt) – it’s certainly a standard software device. The other aspect that I haven’t seen before, with little conviction, is single-instruction phasing – i.e. instruction encoding that pragmatically crosses hardware pipeline boundaries – however, I’m not sure how generally useful that is, beyond efficient encoding of short subroutines (which the compiler should inline, no?).

    Regarding a general-purpose belt, vs. local specialisation. Most floating-point computations are really logically distinct from integer computations. Most branch predicate results are independent from real integer results (even though they are often flag-setting variants of the same silicon area). Most vector operations are very different from integer operations, particularly when you get ridiculously wide – 512 bits. Why would you carry them on the same belt (particularly unusually bit-wide operations)? The answer, I guess, is that the belt is an abstraction, but I think there is entropy opportunity there too.

    I am fascinated. When do we see silicon?

    • This reply was modified 7 years, 3 months ago by  rolandpj. Reason: More blah
Viewing 4 posts - 1 through 4 (of 4 total)