The Belt

rolandpj

The Belt vs Multi-Accumulator

What is the advantage of the belt abstraction over a a multi-accumulator abstraction? In particular, why not treat each functional unit as an independent accumulator machine.

As noted in some of the talks, only 14% of results are used more than once.

The belt ISA (register) abstraction provides a bit-wise efficient encoding, since it eliminates the destination register. However, a multi-accumulator model requires (typically) even less - just one argument per operation, which can source values from other functional units. Basically the compressed instruction is a bitmap of active functional units, and the 2nd (+?) value for each operation is explicitly provided in the variable-length instruction trailer.

This architectural design seems orthogonal to other aspects - memory management, load/store stations, scratchpad, spill mechanism, etc.

Admittedly the belt abstraction provides a neat fn call abstraction. On the other hand, a callee-save instruction could efficiently push accumulators towards the spiller, and a 'call' instruction could efficiently marshal parameters from the accumulators.

A natural and maybe pragmatic extension is to provide multiple accumulators per functional pipeline - somewhere between a full register bank and a true multi-accumulator model.

Per the comment in talk #3 about extending the backless memory model to the heap: this might be critical for JVM middleware - typically the stack only contains pointers into the heap, and thread-local eden space is treated effectively like the traditional stack in C(++).

ivan

A multi-accumulator (MA), like the belt, has the advantage that the encoding does not need to specify a destination. It can also often avoid specifying one source, but pays for that with move operations to load the accumulator. You suggest using each FU as an accumulator machine, which is fine for ALUs but some FUs have no result to accumulate. In addition encoding would like a power-of-two number of addresses, while FUs are rarely provisioned in power-of-two numbers. Lastly, even a use-once may still need to be saved: (a OP1 b) OP2 (c OP3 d) where all ops use the same FU. All this can be worked out, but which approach would have the better entropy in real codes would depend on details of the encodings and the workload.

A larger issue is hazards: the belt is SSA and there are no W-W, R-W, or W-R hazards, while an accumulator is updatable and so someone must deal with hazards.

Because the Mill transiently hold results at the FU outputs one can see it as a kind of MA in terms of routing. It is not MA encoded; we are willing to pay the cost of the address remapping to avoid hazards and have efficient spill.

rolandpj

My concern was not so much entropy of encoding, as hardware efficiency, but both are interesting.

The belt on high-end (30+ ops/cycle) operates as a massively multi-port register file - particularly if all operations are allowed to access all belt positions. This is true (I think) no matter whether it's really implemented as a CAM, as shadow register files, etc. The talks allude to some combination of the above but I am reading between a whole lot of lines, no pun intended. From the talks, for example, you do actually intend to have FU-local shift register banks, and the mapping to the belt abstraction is a hardware problem(!).

The belt abstraction is useful, indeed, for steady-state implementation of software pipelining, and you have extended the concept of intrinsically circular rotating buffers into your 'scratch-pad' - which can be seen as a local/remote register file concept (or internal/external register). In short, the belt abstraction is awesome for compiler writers, which is a nice reaction to RISC, and incorporates a lot of the advantages of a stack abstraction - encoding entropy in particular.

I don't really know what I'm talking about, but I there are so many interesting aspects of the design, most of which are old ideas, maybe yours (tagged typing of local registers, entropy encoding of intent, virtual memory translation behind the cache, blah blah). I am not aware of a hardware abstraction that is a use-it-or-lose-it register file (the belt) - it's certainly a standard software device. The other aspect that I haven't seen before, with little conviction, is single-instruction phasing - i.e. instruction encoding that pragmatically crosses hardware pipeline boundaries - however, I'm not sure how generally useful that is, beyond efficient encoding of short subroutines (which the compiler should inline, no?).

Regarding a general-purpose belt, vs. local specialisation. Most floating-point computations are really logically distinct from integer computations. Most branch predicate results are independent from real integer results (even though they are often flag-setting variants of the same silicon area). Most vector operations are very different from integer operations, particularly when you get ridiculously wide - 512 bits. Why would you carry them on the same belt (particularly unusually bit-wide operations)? The answer, I guess, is that the belt is an abstraction, but I think there is entropy opportunity there too.

I am fascinated. When do we see silicon?

rolandpj

I don't seem to be able to edit, but more blah.

My suspicion, as an ex-compiler guy and now middle-ware corporate, and occasional wannabe hardware dude, is that single-threaded performance requires a few things. Obviously massive op/instruction. But more importantly, collapsing of local branches to enable that. Without that, ILP of general-purpose code is branch-limited. So, why not have predicated execution of operations in an instruction, as a bit-mask, as an extension of single-instruction predication (per 32-bit ARM)? The masked-out instructions produce None, or are NOPs. For wider ops/instruction, allow multiple masks.

I share the same cynicism that some of the VLIW audience did at one of your talks - namely, there just isn't enough apparent ILP available. We have to waste speculative energy to change that.

For JVM middleware, most compiler speculation is around virtual method calling versus concrete run-time types. You can unwind that, just like C++ in the JIT, to predict virtual call targets, and/or you can specialize for known concrete run-time types. It kinda looks like branches in the JIT code. In addition, most commercial (managed environment) software is now strictly pointer-bound. Unlike carefully-crafted C(++), there is no language support for stack-local structures. Apparent load/store activity is huge. Garbage collection is important, and it's unclear exactly what hardware facilities would be useful. Local allocation areas (Eden space) are organised as stacks, independent of the conceptual call-stack, but promotion out of Eden needs to be handled.

I guess I'm saying that SpecInt is a thing, g++ performance is a thing, but what we really need is support for what is running most big corporate workloads. It's managed environments (JVM/Microsoft). I have written some JIT's in my distant past, but I am rusty as all hell.

I suspect, though, that the ideal target architecture is different from what might be expected from SpecInt. (And yes, efficient multi-threading is part of that, but doesn't capture the dynamic CPU demand).

ivan

You wander a bit, but I'll try to address some of your points.

The belt is not equivalent to a multi-port register file because there's no update; everything is SSA, so if anything it acts like a RF with only read ports and no write ports. There is an all-source-to-all-sink distribution problem, conceptually akin to a crossbar matrix, but one of the patents explains how we can do 30X40 and have it time like 8x20. However, the distribution is still the scaling limit for the belt.

Phasing is valuable in open code because of the shape of the typical dataflow graph: a few long chains, with local short business at each node of the long chain. We're not often executing two long chains simultaneously, but we are often doing several bush pieces at once and feeding that to the long chain. The bush fragments tend to be short (1-3 ops) and to fall naturally into the three cycles given by phasing, so we can overlap the later bush phases of this long-chain node with the earlier phases of the bush of the next node. It sounds a bit like breadth-first search, but the presence of the long chains means that the placement algorithm is different.

We get long open code because we do a lot of speculative execution. It's a little like trace scheduling.

Some years ago we had vectors ultra-wide operands (UWOs). UWOs make sense on a machine with a dedicated UW RF, but not on the belt where scalars are mixed in, so we replaced UWOs with our skyline vectors.

About seeing silicon: go to https://millcomputing.com/investor-list/

Predication: we have predicated forms of (nearly) all the operations that cannot be speculated, notably all control flow and stores. There's no point in wasting entropy and compiler monkey-puzzle on the rest.

Managed systems: we have some gc-centric facilities, We also expect that a Mill JIT would take advantage of the regularity and simplicity of the ISA to generate to the machine rather than to a memory-to-memory legacy abstraction. The big Mill wins in a managed environment will not come from the execution core however; it will come from the reduced memory traffic, cheap coherence, and ultra-fast task switch.

NXTangl

Wait. If you issue a FMA like that and immediately call, the code on the other side of the call knows nothing about the in-flight op. How do you handle hazards like that? Do you stall? And if you do stall, when do you decide to stall?

NXTangl

A couple of questions/thoughts.

What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?

Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn't you just have a FU with multiple output registers feeding into the latency-1 tree? I'm not able to visualize what makes it difficult.

For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that's the only thing I can think of that the second-level mux tree would have over the one-cycle tree.

ivan

Calls (including the body of the called function) have zero latency. The FMA drops after the call returns.

The Mill spiller not only saves the furrent belt and scratchpad, but also everything that is in-flight. The in-flights are replayed when control returns to the caller, timed and belted as if the call hadn't haoppened.

That how we can have traps and interrupts be just involuntary calls. The trapped/interrupted code is none the wiser.

ivan

Try and use one question per post. It's easier for the reply and the readers.

What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?

What's an output register?

Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn’t you just have a FU with multiple output registers feeding into the latency-1 tree? I’m not able to visualize what makes it difficult.

Hardware fanout and clock constraints. Lat-1 is clock critical and the number of sources (and the muxes for them) add latency to the lat-1 FUs. Lettng lat-1s drop two results would double the number of sources for the belt crossbar, and it's not worth it. Lat-2 and up have much more clock room.

For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that’s the only thing I can think of that the second-level mux tree would have over the one-cycle tree.

There's no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we'd need another pipe cycle in there to keep the clock rate up.

NXTangl

Sorry. Anyway, I meant like instead of an entire belt, the code specifies the sources based on the latency 1/2/3/4 results of each FU, plus the actual spiller registers. Since a specializer can know statically where the spiller puts overwritten results, that's not a problem in conASM. Is belt renaming that important?

NXTangl

That makes sense for most FUs--outputs that are only sometimes used are unnecessary paths. But what about FUs where every possible op always has two outputs? Is there a use case for that?

NXTangl

There’s no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we’d need another pipe cycle in there to keep the clock rate up.

So, to clarify: the problem is not that the routing information isn't known far enough in advance, but that the results for the one-cycle latency ops don't exist that far in advance? And anything specified as latency 2 or more will exist a full cycle in advance?

ivan

The specializer doesn't know where the spiller puts things, because it cannot predict the dynamic control flow and hence the belt demand. As soon as there is a trap/interrupt or the main program goes through a function or label pointer, or even a conditional branch, well, that's all she wrote.

ivan

None we know. If we find one, or any in which the two-result op is on the program's critical path even if there are other one-res ops, then it's easy for us to put the second into the fastpath too. And pay the price.

ivan

Not quite. Everything is known well in advance, and the time for the actual computation may be the same for one and two res ops. However, after a result is computed it must be routed from the producer to all the consumers it will have in the next cycle. That route time is independent of which op produced the result. The route time is largely dependent on the number of sources, i.e. the number of distinct operands that can be dropped in each cycle, which is the fan-in to the routing crossbar.

We want to keep the routing time low, because every operation pays that cost. So we want few belt sources, and need to move low-value ops out of the way. We do that with the two-stage crossbar, and relegate all multi-result ops and all non-lat-1 ops to the second stage. There may be a two-res op that is fast enough that its result is available as fast as a simple add. But putting those two results on the fastpath would add a cost to every other fastpath op, including critical things like add.

jcdutton

Hi.

Some questions about the belt.
When doing branches, loops etc. The belt needs to use the "conform" instruction to rearrange the belt.
How efficient is the "conform" instruction?
A CPU with normal registers would never need a "conform" instruction, so isn't the belt causing a processing overhead here?
I guess the "conform" could be a hardware shuffle of the belt providing very little overhead.
Another problem aspect of the belt might be micro parallelism.
From a single group of instructions to be executed, one might be able to execute some in parallel if they do not conflict. But the belt adds a conflict/race when the result is saved on the belt.
A possible solution to this might be: Multiple belts. The instructions that work in parallel could be saving to different belts, thus removing the conflict.
One would then have to implement some stall/syncronisation mechanism whereby the next instruction stalls, until the previous parallel instruction finishes that the stalled instruction depends on.
How many bytes is the belt? How much needs to be copied on a context switch?

ivan

There isn't any conform op any more; the functionality was moved to branch ops so a taken branch can rearrange the belt to match the destination. However, the same question applies: the rearranging takes zero cycles and has essentially no cost. Nothing is actually moved; all that happens is that the mapping from belt number (to decode) to operand location (dynamic) is shuffled.

There's no real conflict/race. The "drop to the belt" doesn't involve any real movement. The result stays in the producer's output latches until the space is needed for something else (when a real move will happen). "Drop" just enters the new value's physical location into the mapping from belt number to location. It tells the rest of the machine "when you want b3 it's over there". You can have multiple drops from different sources; each goes into its own output location, and each gets its own mapping. Yes, there are necessarily multiple updates to that mapping happening, but the values are 3-bit belt numbers, not 128-bit data, and the hardware can handle it.

The belt has no byte size; it's a logical list of locations. The size is the number of locations it tracks; current configs exist with 8, 16, and 32 positions. It doesn't have to be saved, because each operand self-identifies with a timestamp which is its modulo index in an infinite sequence of drops; the leading numbers of that are the current belt.

It's a bit to get your head around, I know :-)

We talk about the belt as if it were a shift register, and dropping as if they were copies, but that's all "as if" metaphor.

tbickerstaff

Whilst the conform instruction is an extra instruction, its arguably doing what a conventional register machine has to do anyway, but spread out across the instructions of the loop. Ideally, such as when there is only a single place in the code that can branch to a block, you don't need it as can setup the receiving block based on the incoming belt layout. But when you do need it you are just bringing back the overhead that a conventional register machine has in the encoding of the output registers of instructions.

As for the idea of a conflict, I'm not sure where you got that from. It might be worth going back to the execution talk. The mill is a statically scheduled VLIW arcitecture. So microparallism is handled by an explicit ordering of instructions and statically defined latency (for a specific family member, see the specification talk for how this is handled) which allows the specializer to dispatch multiple opcodes per instruction, and know exactly when each will drop its result onto the belt. Thus very few instructions can actually stall. pickup being one of the few, and there's very little you can do if you are stalled on a pickup.

And for context switches the amount of data would depend on family member, but is handled in the background by the spiller. And anyway the register file/belt on a modern processor is a tiny part of the cost of a context switch as opposed to cache flushing/TLB flushing and these days flushing the speculative execution pipeline. So the differences in the mills memory system will hide any impact the belt has on context switches.

ivan

You are right that the congruence arguments to branch instructions provide a replacement for the result specifiers of legacy ISAs. You are wrong when you assume that the costs are equivalent.

Legacy result specifiers are needed on any instruction that has a result. A congruence argument is needed only for values that are still live across a branch at the end of a basic block, and then only if the branch is going to a control flow join point. The counts of specifiers and of congruence arguments can and do diverge dramatically in real code. I see codes like sha1 where there are hundreds of results dropped in a block, and only two are passed to the next block. I've never actually instrumented the tools to measure, but by eyeball I'd guess that there is one congruence argument for every eight drops on average.

Just to give you a better guess I took a look at the code I happened to have been working on before writing this note: it's five functions from a buddy-allocator package and I had been chasing an case of less than optimal asm coming from the specializer. In the whole package there were exactly three congruence arguments, one each on three branches; as it happened all three were control variables of loops. The code of the five functions dropped 52 times total.

This ratio is so high for a mix of three reasons. First is that many values die before they get to a branch; only the live ones need congruence. Second is that the specializer doesn't use congruence for values that are live across loops but not used within the loop: instead it spills at point of creation and fills at point of use. That allocator code has five spills and six fills on the Silver config. A legacy machine with lots of registers could have just left the values in register. Of course those would have to be spilled across calls; as Heinlein said, TANSTAAFL. Third, the specializer goes to a lot of work to eliminate branches. That's to reduce the cost of prediction and mispredicts, but as a by product increases the benefif of congruence vs. specifiers.

tbickerstaff

Thanks for that. I was speculatively hoping that conform would be less costly for the liveness reasons, but without actual code examples in front of me, I assumed the worst case of legacy blocks containing only enough instructions to equal the cost of conform. I didn't know how much the specializer avoids branches though.

« Previous Page Next Page »