Difference between revisions of "Belt"

From Mill Computing Wiki
Jump to: navigation, search
m
Line 32:Line 32:
 
You also don't need rename registers for temporal values, that still need to be addressable by their canonical register name. Rename registers that become necessary in instruction sets with very limited registers are a big source of data shuffling and since they all must be able to serve as source or destination for values the cost of interconnecting all those registers among each other and with the different functional units is an exponential explosion of complexity and one of the biggest energy and die space demands in modern conventional CPUs.
 
You also don't need rename registers for temporal values, that still need to be addressable by their canonical register name. Rename registers that become necessary in instruction sets with very limited registers are a big source of data shuffling and since they all must be able to serve as source or destination for values the cost of interconnecting all those registers among each other and with the different functional units is an exponential explosion of complexity and one of the biggest energy and die space demands in modern conventional CPUs.
  
The belt drastically reduces the amount of data sources and drains, and the interconnectivity is even more dramatically reduced, all this without sacrifcing the amount of values available to execution units.
+
The belt drastically reduces the amount of data sources and drains, and the interconnectivity is even more dramatically reduced, all this without sacrificing the number of values available to execution units.
  
 
== Media ==
 
== Media ==
 
[http://www.youtube.com/watch?v=QGw-cy0ylCc Presentation on the Belt by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-07-11_mill_cpu_belt.pptx Slides]
 
[http://www.youtube.com/watch?v=QGw-cy0ylCc Presentation on the Belt by Ivan Godard] - [http://millcomputing.com/blog/wp-content/uploads/2013/12/2013-07-11_mill_cpu_belt.pptx Slides]

Revision as of 21:49, 22 July 2014

The belt provides the functionality general purpose registers provide in traditional processors as source of operands and destination of results for hardware operations as well as a very short term data store. Depending on Mill family member each belt has 8, 16 or 32 slots, each of which can be seen as a general purpose register.
There are of course important differences, else there wouldn't be different terminology for it.

Semantics

First of all, belt slots are read-only. Once a new value is created it cannot be changed. This makes data dependencies trivial to track and as a result Hazards are impossible.
With a limited amount of slots and immutable slot values this wouldn't be very useful if it wasn't for the fact that when a new value is created and dropped at the front of the belt, the oldest value at the end is discarded, and all other slots are advanced by one position on the belt. This resembles a conveyor belt, hence the name.
Those belt position indices are how input operands for hardware operations are referred to. This index is advanced for each value as more values are pushed onto the front. Values are thus not referred to by fixed slot locations, but relative to the time slot they were inserted at the front. This is called temporal addressing and it is not very friendly to manually keep track of in machine language, but trivial for compilers to do. In fact it can be viewed as a 1:1 mapping of the ubiquitous SSA data flow representation used in most modern compilers, making complicated and sub-optimal register mappings that discard a lot of the information the compiler has obsolete. This means you actually never have to use slot numbers/register names even in the lowest level assembly equivalent languages. You can always use symbolic names in SSA form. No register renaming, no shuffling values between registers. In fact even if the programming model is of values being moved along a belt, in the implementaiton no values are actually ever moved, only slot indices are advanced.

Using and Retaining Values

While new values and operation results are always pushed to the front of the belt, operands can come from any belt slot. The compiler takes care to place all produced values close to the consumers of those values, and as such any values needed as operand can almost always be found on the belt. Only almost always though. It obvioulsy happens more often on shorter belts in less powerful cores. To deal with the rare cases that you need a value for longer than it is on the belt, a Scratchpad with the spill and fill instructions is provided that store and load those longer needed values to and from a fast hardware buffer managed by the Spiller.

Multiple Results

Since new values from operation results are always implicitly dropped at the front of the belt, operations don't need to encode destinations for them. It also makes operations that yield multiple results straightforward to implement, most importantly the division operations that on all register machines are a well of ugly special case treatment. But it is equally true for the normal call instruction, making multiple result functions native to hardware with no penalty.

Slot Data Format

The required word size of the Mill is 64bits, and that's the minimum size of each slot. High-end machines can hold 128bit values there. But 32bit 16bit and 8bit values are just as possible. Only the relevant bits are used, moved and updated for each type. A belt slot isn't limited to a scalar value either. Each slot can just as well be a 4 element vector SIMD of any byte width base elements.

Metadata

What kind of value a belt slot holds, byte width of the base and vector type, is determined by a few bits of metadata attached to each belt slot. This metadata is set on value creation, may that be via a load instruction or by any other operation. It is preserved in the Scratchpad but discarded again on store.

Hardware operations using slot operands determine the operand type by the metadata and become operations of the specified width or vector operations accordingly without having to encode this information in the instruction stream.

Rationale

Why not just use registers?

Hazards have been mentioned already, which are made impossible by immutable values, i.e. the processor doesn't need to keep track of whether a value is still what it expects it to be for any reason, whether it is just a temporal value, etc. etc. This takes huge effort and comes at great cost, especially on out-of-order machines.

You also don't need rename registers for temporal values, that still need to be addressable by their canonical register name. Rename registers that become necessary in instruction sets with very limited registers are a big source of data shuffling and since they all must be able to serve as source or destination for values the cost of interconnecting all those registers among each other and with the different functional units is an exponential explosion of complexity and one of the biggest energy and die space demands in modern conventional CPUs.

The belt drastically reduces the amount of data sources and drains, and the interconnectivity is even more dramatically reduced, all this without sacrificing the number of values available to execution units.

Media

Presentation on the Belt by Ivan Godard - Slides