Difference between revisions of "Encoding"
(→General Instruction Format) | (→Rationale) | ||
Line 90: | Line 90: | ||
And of course you want different processors for different workloads, even if it is general purpose code. Office desktops and communication servers and scientific computing servers and mobile devices and game consoles and industrial controllers, they all have vastly different requirements. They need different kinds and numbers of functional units, of cache sizes, of bandwidths etc. It is impossible for one architecture and one instruction set, much less one processor, to serve all those roles equally well. Software compatibility between all them is still highly beneficial, often even necessary. | And of course you want different processors for different workloads, even if it is general purpose code. Office desktops and communication servers and scientific computing servers and mobile devices and game consoles and industrial controllers, they all have vastly different requirements. They need different kinds and numbers of functional units, of cache sizes, of bandwidths etc. It is impossible for one architecture and one instruction set, much less one processor, to serve all those roles equally well. Software compatibility between all them is still highly beneficial, often even necessary. | ||
+ | |||
+ | There are other advantages to be gained from split instruction streams and phased/shifted instruction subdivisions. What it says is that certain memory regions in relative position to each other can only contain certain content. This is a very effective way to increase entropy and code density. | ||
+ | |||
+ | Furthermore, this specialization allows you to specialize and dedicate caches. This is only relevant for the Flow and Exu streams, not for the insturction phase shifts, which are already in the decoder. What dedicated and split caches allow you, apart from packing them much more tightly, it also allows you to make them smaller and faster by vastly reducing the wire lengths from the bottom of the cache to the decoder, with comparable capacity. | ||
+ | |||
+ | This is actually a trend that can be seen all over the Mill architecture: more and more specialized caches are more efficient and faster and can be placed better on the chip than fewer large unspecialized ones. Division of labour works for machines too. | ||
== See Also == | == See Also == |
Revision as of 04:55, 28 July 2014
The Mill architecure employ a unique split stream instruction encoding that though being wide issue and very dense enables sustained decoding rates of over 30 operations per cycle. As it provides those unparalleled numbers, it does so at the fraction of the energy and transistor cost of mainstream variable instruction length instruction sets like x86.
Contents
Semantics
Extended Basic Block
Code on the Mill is organised into EBBs, i.e. batches of code with one entry point and one or more exit points. There is no implicit fallthrough in EBBs, the instruction flow can only leave them with an explicit branch, which means at least the last operation in every EBB is an unconditional branch. It can contain more conditional branches that either go to the top of other EBBs or to the top of itself. And in contrast to other architectures there can even be calls in the EBB, they don't leave it, as long as they return normally. The image to the right is a purely logical view. The right block is even a normal canonical basic block with one entry point and one exit. In reality things are a little more complicated, as will be seen later here.
Instructions and Operations and Bundles
The unusual encoding makes it necessary to make clear distincitons between instructions and operations and bundles that are not really necessary on traditional machines. In the earliest RISC architectures an instruction and an operation and a bundle are usually the same thing: a word size bundle of bits is retrieved from the instruction cache and dropped into the decoder. There one instruction is retrieved and issued resulting in one operation being performed. On wide issue machines one instruction can contain several operations that are all issued together. Modern machines drop a bundle containing several instructions at once into the decoder.
So, a bundle is the batch of memory that gets fetched from memory and dropped into the decoder together.
An instruction is all the operations that get issued to the functional units together.
And an operation is the most basic piece of processing in a functional unit, an add or xor for example.
Split Instruction Streams
Conventionally there are two apporaches to instruction encoding: fixed length instructions and variable length instructions. Fixed length instruction are cheap and easy to decode, but don't offer very good code density. Variable lengths instructions can offer good code density, but decoding them tends to have polynomial cost. The rate of decode in the former tends to be limited by instruction cache size and instruction cache throughput then. For the latter the limiting factor is the cost of of processing and interpreting the the actual bits, of recognizing the instructions in the stream. The best fixed length instruction decoders can scrape low double digits per cycle. And Intel's heroics can get up to 4 instructions per cycle on x86. How to overcome this bottleneck?
It has to be variable length encoding for simple code density reasons. No way around that. But since decoding one stream of instructions n2 in cost, decoding 2 streams equivalent to the one is a quarter of the cost. And when those two streams are split in their functionality, you get even more gains in code density and simplicity from being able to have different meanings of the bit patterns on each side [1].
The functional split on the Mill for the two different streams is into:
- an Execution core for all arithmetic and logic, actual computation
- a Flow core, for flow control and memory accesses
This gives a fairly even split in how much both sides have to do on average.
Implementation
There are of course a few technical hurdles to overcome for split stream encoding. For one, you need two pcs. In practice there are even 3, but more on that under Decode. How do you specify branch targets with more than one pc? Encoding 2 two addresses in whichever way is not very efficient. You could have implicit base addresses in dedicated memory regions for the two streams and then encode only one offset. Then you run into significant address space waste whenever the two stream are of different length for the same flow of control.
The approach the Mill takes is to only have one address as a branch target, where both program counters end up. But from there one stream walks up the mememory and the other stream walks down. Both sides can be differently long, as needed. So while this branch target logically is the top of an EBB, in memory it actually jumps somewhere in the middle of it.
The increasingly dark blue bars to each side of the entry points to the EBB are the instructions. The same shade of blue for two instructions on each side means they are pulled from the cache and dropped in their respective decoder together, the Flow instructions from the decreasing program counter side, the Execution instructions from the increasing side.
Each instruction is a half-bundle, which together with the instruction on the other side makes a full bundle. Both instructions belong logically together and are issued together.
It should be noted that because of the amount of operations you can pack into one instruction and the Phased execution, the vast majority of EBBs on the Mill only consist of one execute and one flow instruction.
You can also see that each instruction has a different size. While those full instruction sizes are always full byte sizes, there is also always a minimum size of how many bytes an instruction has. This depends on the processor Specification, in particular how many Slots and functional units and Belt positions this specific Mill processor has.
But it is still a variable length VLIW instruction set, and those normally are very hard to parse.
General Instruction Format
The reason this is not the case on the Mill is in the instruction format. This format is different on both of the instruction streams on each side, but both also follow a general pattern and idea mirrored in orientation on each side, which is described here.
Each instruction is decoded in several Phases and each phase has dedicated hardware. As soon as one phase is finished for one instruction the corresponding phase of the next instruction immediately follows. But how does the decoder know where to take the next instruction from when all instructions are a different length? Simple: each instruction has a fixed length header, that contains a shift count that tells the decoder how long this specific instruction is, and where the next instruction starts. And in the case of a branch it is the Precoder that sets the instruction pointer in Prefetch without delay.
This fixed header also contains the amount of instructions/size information for each phase, and each phase has a dedicated region within the instruction:
- Phase 0 immediately follows the fixed header. The decoder always knows where it is, because it is a fixed location within the instruction and immediately starts decoding it assuming maximum size of the phase region. In the next cycle it knows how big the phase actually is, because the header has been parsed too and can cancel all the excess.
- Phase 1 is aligned to the end of the instuction. Decoding phase 1 starts one cycle after phase 0, because then the shifter knows how long the instruction is, and how big the Phase 1 region is from the header.
- Phase 2 immediately follows the Phase 0 region, and also starts decoding in the second cycle, because then the position is known.
Each phase region can only contain operations that belong into that phase and has its own instruction format and dedicated decoding hardware. Because of this specialization in each phase, each of those instruction formats is very simple and dense and the specialized decoders can be fast and cheap.
This process of decoding isn't limited to 3 Phases. The general pattern of decoding 1 Phase on the first cycle and 2 more on each consecutive one can be expanded to arbitrary sizes, but it turns out 3 is enough. It could be considered another doubling of the instruction streams.
The phase regions are also not limited to byte sizes, indeed they rarely are. They contain a number of fixed size operations for that phase and usually this doesn't add up. This means almost always there is an alignment hole in the middle of an instruction where the phase1 and phase2 regions meet. This space isn't wasted. It contains a delay count that serves as a NOP of the given length. But not for the instruction itself, but for the instruction of the opposing side, i.e. the alignment holes on flow encode the delays on execute and vice versa. There are also normal NOPs, but you rarely ever need to use them.
This kind of efficient NOP is very important on wide issue machines, and it is also very important in statically scheduled exposed pipeline architectures. The operations in phase 1 and phase 2 can have different latencies, and this delay serves to statically sync with the other stream.
In an interesting side note: in contrast to conventional architectures, espcially RISC architecuturs, on the Mill the program counters or instruction pointers are always explicitly set/altered with values from the instruction stream or from operands. There is no implicit increment. On other processors this only happens with branch instructions.
Rationale
The need to be statically scheduled wide issue and to be able to decode a lot more instructions/operations per cycle has become apparent for quite a while. In the DSP world this is standard practice for decades already. The problem is, the many and unpredictable branches in general purpose workloads make static scheduling very hard and punish it severely with stalls whenever static scheduling fails.
Additionally, wide issue machines with many parallel operations in one instruction are very inefficient in code size when instruction level parallelism is limited, as it tends to be in general purpose workloads with small current data flow windows.
There are a two factors that offer a way out of this though:
- Compilers have unlimited data flow windows and can statically and optimally schedule everything if the instruction set provides those interfaces.
- ~80% of all code is in loops, and statically scheduled loops have unbounded instruction level parallelism.
So, when the problem of statically scheduling most loops has been solved, and it has been a gnarly problem since the start of computing, but for the Mill it is solved, performance mainly becomes a problem of how many functional units you have, and how fast you can feed them with instructions and data.
You still don't want a fixed width wide issue instruction set, for that general purpose workloads are just far too unregularly, but you want the option to be as parallel as you have functional units, when the control and data flow allows it.
And of course you want different processors for different workloads, even if it is general purpose code. Office desktops and communication servers and scientific computing servers and mobile devices and game consoles and industrial controllers, they all have vastly different requirements. They need different kinds and numbers of functional units, of cache sizes, of bandwidths etc. It is impossible for one architecture and one instruction set, much less one processor, to serve all those roles equally well. Software compatibility between all them is still highly beneficial, often even necessary.
There are other advantages to be gained from split instruction streams and phased/shifted instruction subdivisions. What it says is that certain memory regions in relative position to each other can only contain certain content. This is a very effective way to increase entropy and code density.
Furthermore, this specialization allows you to specialize and dedicate caches. This is only relevant for the Flow and Exu streams, not for the insturction phase shifts, which are already in the decoder. What dedicated and split caches allow you, apart from packing them much more tightly, it also allows you to make them smaller and faster by vastly reducing the wire lengths from the bottom of the cache to the decoder, with comparable capacity.
This is actually a trend that can be seen all over the Mill architecture: more and more specialized caches are more efficient and faster and can be placed better on the chip than fewer large unspecialized ones. Division of labour works for machines too.
See Also
Decode, Phasing, Instruction Set
Media
Presentation on the Encoding by Ivan Godard - Slides