The decode process turns the binary instruction streams into requests to the functional units.
Slots and Pipelines and Functional Units
Each instruction is divided into blocks. Within those variable length blocks are the operations arranged in arrays. Each position in those arrays is called a slot. It corresponds directly to a hardware slot, which a dedicated decoder that can only decode the operations that can occur at this position in the instruction. The slot then sends a request to perform the decoded operation to the proper pipeline, or in some cases like the pick operation to the crossbar circuits.
A pipeline is a collection of functional units that share a common data path, the same inputs and outputs. It is in the functional units where the actual work gets done, where the data is manipulated and shoveled around by operations.
Since there are two separate instruction streams, both are decoded by their respective specialized decoders. And not only each stream has it's own decoder, within each instruction, each block has its own specialized decoder. And each slot within each block has its own specialized binary operation format. This format depends on what functional units are available on the hardware pipeline this slot feeds into. Different hardware slots provide different functionality.
Streams and Decoders
The decoders are distinct specialized modules though, each with their own data paths and caches and processing units to accommodate their specific work loads. The general format of the instruction streams is described under Encoding. Here we go into the details and see how those streams and instruction formats are different.
A morsel is the basic unit to encode values within the instruction stream. They take as many bits as are needed to address all belt locations on a specific core, i.e. 3-5 bits.
In most instructions taking immediate values those immediates also are morsel-sized.
The instruction headers in the exu stream of course contain a shift count to the next instruction. And then they also contain the count of the encoded slots in each block. Well, not of the exu block or block 3, since that can be simply inferred from the other slot counts.
The slots in each block have their own format.
Exu reader operations have one hardcoded parameter source selector, and this source selector actually defines the whole operation. Which means you don't even need an opcode. Reader operations are encoded just as a sequential id to identify all the available reader sources, and any unused id values in the bit width are just filled up with Popular Constants. There is no internal structure in the operation and no wasted entropy.
Exu slots encode all the operations with 2 operands. Those are structured. They have an opcode, the size of which is dependent on the operation population in that slot. It might be different for every slot in the exu block.
And then there are 2 morsels for arguments. Each of which might be a belt location, or an immediate value. Although usually, if there is an immediate value it is the 2nd argument. And often operations with an immediate argument get shorter opcodes to gain more bits for the immediate value.
There are also the operations that take implicit arguments from neighboring slots or from condition codes from neighboring slots. In those cases there is of course an opcode prefix and the bits used for the arguments extend the opcode.
There are only two pick slot and pick phase operations, consequently there is only one bit of opcode. There are 3 belt operands, so 3 morsels are needed. But there is an even shorter encoding of only two arguments for the picks that produce Nones in one path.
And again, like in the reader slot, there is not really a need for opcodes here, since the destination fully describes the operation. It needs an additional morsel for the belt operand though.
The operations in the flow stream have completely different requirements. It's not many operations with few small arguments, it's few operations with potentially many and large arguments. So there is only really one logical flow block that encodes all the flow operations by combining 3 physical instruction blocks.
There is of course the normal shift count for the size of the whole instruction. But there is only one operation count.
The operation heads are the most complex part of the flow operation encoding. There is one head for each available flow slot on the core. And each head contains:
- an opcode, the size of which depends on the slot population
- 2 bits of extension count
- 2 bits of manifest size
- 1 bit of manifest complement
Each operation head has 0-3 morsel sized extensions. Those extensions can serve as extended op codes, as belt operands, as register selectors, as small immediate value or whatever else the operation needs.
Manifests are of 0, 1, 2 or 4 bytes in size. How they are interpreted depends on the operation. They can be addresses, constants, operand lists. They can even be combined with the extension bits to form larger bit patterns.
A manifest value of 0 takes no additional size, since it is just a zero sized constant.
If the complement bit is set in the head, the bitpattern is inverted to form the manifest value. i.e. a zero length manifest with the complement bit set becomes a -1 value. A 1 becomes 0xFFFFFFFE. This results in a very compact encoding for most commonly used 32bit address offset bitpatterns.
The slot counts in the instruction heads always have a few values or value combinations that are no valid slot counts. This would be all wasted entropy, if it wasn't for the skinny block mechanism. Operations that take no or only implicit arguments can be encoded in those unused value combinations without taking any additional space. The best examples of such operations would be NOPs or returns with no return value.