The Belt

staff

Talk by Ivan Godard – 2013-07-11 at Google

Slides: 2013-07-11_mill_cpu_belt (.pptx)

Belt Machines

Data interchange without general registers

A large fraction of the power budget of modern superscalar CPUs is devoted to renaming registers: the CPU must track the dataflow of the executing program, assign physical registers and map them to the logical registers of the program, schedule operations when arguments are available, restore visible state in the event of an exception—all while avoiding register update hazards.

Not all CPU architectures are subject to hazards that require register renaming. Unfortunately, earlier hazard-free designs either require one-at-a-time instruction execution (stack and accumulator machines) or push hazard avoidance off onto the compiler or programmer (VLIWs). The Mill is a new machine architecture that eliminates these problems by adopting a new machine model, the “belt”.

The belt machine model is inherently free of update hazards because all operation results go onto the belt by Single Assignment; in other words, once created they never change their value. Belt machines have no general registers and thus no rename registers that physically embody them. Result addressing is implicit, which produces compact code and easily accommodates operations like integer divide that logically produce multiple results. The machine model integrates naturally with function call, eliminating caller/callee save conventions and complex call preamble and postamble code.

A belt machine has short pipelines because it lacks the extra pipe stages associated with rename; typical misprediction penalty is five cycles (if decode is also fast). Area and power consumption in a belt core is a third that of an equivalent superscalar in large part because a belt lacks the large number of physical rename registers and the interconnect needed to supply register values to the functional units.

The talk explains the belt model as seen by the programmer and the physical internals of a typical implementation.

Will_Edwards

A call pushes parameters onto the callee's belt and creates a new frame id etc, which seems straightforward; but what happens to the belt when you branch rather than call?

The loop execution has advanced the belt some number of positions, so the passed-in parameters are no longer at the top of the belt. There must be some kind of rewind mechanism? Which then makes me wonder what happens to internal values you don't want to lose?

JonathanThompson

Hi Will,

Unless I've misunderstood something meaningful because the videos weren't too clear, how is it really any different from how a typical load/store CPU works? That is: when you do a branch in a load/store CPU, you know (as the compiler writer) what's stored where for registers as of the time of the branch: it's no different with the Mill Belt, and unless I'm mistaken, the Belt doesn't move merely because of the branch.

That's not to say that there aren't instructions in-flight that need to be accounted for: but that's not really any different from no branch, because the compiler knows exactly how much is in-flight as well, as it's 100% deterministic, even if there are pipeline stalls from cache misses.

Will_Edwards

Hi Jonathan,

I'm beginning to form my own (uninformed) mental model of the belt in loops.

The loop body must be in an EBB, and the parameters to an EBB are passed at the front of the belt. The call instruction has a list of belt positions and order to make the new belt for the callee.

If you jump back to the start of the EBB, you have to put the parameters for the next iteration back at the front of the belt.

My intuition would be that the branch ops can all specify a list of belt positions and order to put these back at the front of the belt for the next iteration, just like a call can.

I might have missed if this has been explained in any of the talks so far.

ivan

Check out the scratchpad, described in the Belt talk. It's where you put stuff that you don't want to lose.

ivan

In reply to Jonathan #397: Yes, regular branches have to belt consequences. Of course, the predicate for a conditional branch must be on the belt, but that's an argument, not the branch itself.

ivan

The operation that renumbers the belt is called "conform" and is a separate op from branches. It takes a belt number list, and those are moved to the front of the belt, in order, and the rest of the belt is invalidated. There is also a "rescue" operation that takes a similar list, but the rescued values are left on the belt in original belt order, not in list order. Obviously you can use conform to get the effect of rescue; rescue exists because it has a more compact encoding, for use when you don't care what the order is so long as they don't fall off.

Any time you have a control-flow join point in which the in-bound arcs do not have congruent belts (such as at the top of a loop) then you pick one of the arcs, define its belt as canonical, and use conform on the other arcs to make everything congruent. For loops you generally want to define the most frequently taken backwards branch as the canonical one; that eliminates the conform that would be most frequently executed.

Will_Edwards

Is it possible to issue an op even if an instance of that same op is already in flight in that same pipeline?

(This is hinted at in this talk but I want to confirm my understanding - or misunderstanding - of this)

ivan

In general all ops are fully pipelined, so each slot can issue one op per cycle. Consecutive ops in a pipeline can be the same or different, it doesn't matter. Ops do not overtake one another in the pipeline, but they do differ in latency so a shorter latency op can retire before a longer-latency op that was issued earlier.

There are a few cases of intra-pipeline hazards that code gen must be aware of and statically avoid. An example is the FMA (fused multiply-add, sometimes called MAC for multiply-accumulate). If, for example, a regular multiply takes four cycles and a regular add takes two for some operand width, it is possible to do a FMA in five, not six, and with better precision than with separate operations. Hence, as the intermediate result is coming out of the multiplier in cycle 4 it needs to be fed directly into the adder. However, the code could be issuing an add operation in cycle 4 that is also goung to try to use the adder, and they can't both use it.

As a result, while the FPU is pipelined and you can issue one FMA per cycle, or one add per cycle, with no problem, the right combination of FMA and add causes a pipeline hazard. The compiler should avoid this, and the hardware will throw a fault if it detects a hazard violation.

Some implementations of FMA have all their own hardware and don't share the regular adder. These have no hazard, but do have more costly hardware; it's all in the specification.

Symmetry

How do operations that have multiple (always 2?) return values interact with the output registers for the pipelines? Does the extra get bumped to a higher latency result register? Do you use the extra buffer registers that were mentioned? Or are there special pipelines for dual-result operaitons that have pairs of registers?

ivan

There's no restriction to two results, although we have none more than two except the function return op itself, which can completely fill the belt if it wants.

FUs that have ops that can return more than one result have an output result register for each result. These bump up the latency chain normally if there are the same number of registers at the next higher latency. If there are not (which is common) it's up to the hardware guys whether to add dummy result registers at the higher latencies to bump into (simpler, but more space and more belt sources) or bump the excess direct into the spiller buffer pool (more spiller sources). The software can't tell.

Tino

Out of curiosity, does anyone (Ivan?) happen to have the title of the Yale Patt paper/work referenced in this presentation? Seems like an interesting read.

Thanks in advance :)

yogaman

@Tino: I'm not certain what source Ivan was recalling, but this PhD thesis from one of Professor Patt's students reports some of the value fanout data, e.g., 4% of results are produced but never used (p. 9).

https://www.lib.utexas.edu/etd/d/2007/tsengf71786/tsengf71786.pdf

I hope this helps.

Tino

Exactly what I was looking for. Thanks yogaman ;)

ksrm

Say you have two values on the belt, a and b, and you conditionally add one to a. Then either your belt looks like (a b) or (a+1 a b). How does the compiler then address value b in the code that follows? It is either b1 or b2, but how can the compiler know?

edit: I see this may have been answered in an earlier comment. If I understand right, at the end of the conditionally executed branch you would have a conform operation which reshuffles the belt. Is that a single-cycle operation?

Will_Edwards

Yes you can rearrange the belt with the conform op.

You can also speculatively execute operations and pick whether to use the result later. The Mill has a lot of pipelines and there's often slots available, and speculation doesn't fault, so its routine to unbranch cheap code in this way. This is described in the Metadata talk.

PeterH

KSRM, I believe the way the operation would usually be done on a Mill, first you have an unconditional increment of a, then a conditional select, the select being a "zero time" operation.
(a b)
(a+1 a b)
(a a+1 a b) or (a+1 a+1 a b)
A given operation always adds the same count of results to the front of the belt. If you use branching you need to make sure the belt is always set up correctly before your branches merge.

canedo

This is a great initiative, the best of luck to the founders.

Has anyone made the connection yet between belt machines and queue machines?

From 2004 to 2008 I was part of Prof. Sowa’s lab where we developed the Parallel Queue Processor. We had a similar idea (FIFOs holding operands) with similar objectives: high parallelism, no register renaming, and short instruction set. http://dl.acm.org/citation.cfm?id=1052359

For those interested in the compilation aspects: http://ir.lib.uec.ac.jp/infolib/user_contents/9000000330/9000000330.pdf

Some additional material about queue machines has been compiled by Prof. Abderazek:

http://web-ext.u-aizu.ac.jp/~benab/research/projects/QueueCore/

ivan

I was unfamiliar with your work and greatly appreciate you bringing it to my attention. Your thesis will take a bit for me to digest; please email me at ivan@millcomputing.com and I’ll comment when I have gotten through it. Unfortunately, research in Japan is not as well known here in the US as it should be; for that matter, the same is true of research from anywhere outside the US, due to some mix of language barrier, NIH, old boy networks, and perhaps arrogance.

As for your Queue Machines, it’s clear that stack machines, queue machines, and belt machines like the Mill are all members of a super-family of architectures in which transient results are single-assignment; the differences among stack, queue, and belt lies in the permitted/encouraged access pattern to use those transients. Perhaps systolic array machines may be thought of as belonging to this group too. The other great super-family, relaxing the single-assignment constraint, includes accumulator and general-register machines.

Thank you.

LarryP

Canedo wrote:

Has anyone made the connection yet between belt machines and queue machines?

From 2004 to 2008 I was part of Prof. Sowa’s lab where we developed the Parallel Queue Processor. We had a similar idea (FIFOs holding operands) with similar objectives: high parallelism, no register renaming, and short instruction set. http://dl.acm.org/citation.cfm?id=1052359

…

That’s very interesting; another architecture using a form of FIFO for operands! Thanks for sharing those links.

One thing I admire about the Mill architecture, as described, is that the behavior of the Belt is prescribed in terms of what it does (programmer’s model) independent of how it’s implemented (in general, or on any given family member.) If I’ve understood from a quick skim, the above-mentioned Parallel Queue Processor’s queues are tied to a particular hardware implementation, namely a ring buffer. A ring buffer would not appear to lend itself to Belt operations like conform() — which permits arbitrary reordering of the belt’s contents.