Instruction Encoding

staff

Talk by Ivan Godard – 2013-05-29 at Stanford

Slides: 2013-07-11_mill_cpu_encoding_08 (.pptx)
This talk at Stanford EE380 Computer Systems Colloquium

Instruction Encoding

Instructions can be wide, fast to decode and compact

The military maxim, “Amateurs study tactics, professionals study logistics” applies to CPU architecture as well as to armies. Less than 10% of the area and power budget of modern high-end cores is devoted to real work by the functional units such as adders; the other 90% marshals instructions and data for those units and figures out what to do next.

A large fraction of this logistic overhead comes from instruction fetch and decode. Instruction encoding has subtle and far reaching effects on performance and efficiency throughout a core; for example, the intractable encoding used by x86 instructions is why the x86 will never provide the performance/power of other architectures having friendlier encoding.

Some 80% of executed operations are in loops. A software-pipelined loop has instruction-level parallelism (ILP) bounded only by the number of functional units available and the ability to feed them. The limiting factor is often decode; few modern cores can decode more than four instructions per cycle, and none more than 10. The Mill is a new general-purpose CPU architecture that breaks this barrier; high-end Mill family members can fetch, decode, issue and execute over 30 instructions per cycle.

This talk explains the fetch and decode parts of the Mill architecture.

Will_Edwards

Can you describe how jumps within a frame - e.g. a loop - work, and what kind of latency they have?

ivan

There are no jumps "within" an ebb; by definition ebbs can only be entered at the entry point.

Branch and call operations are flow side and (ignoring code pointers) contain a manifest signed 32-bit offset that uses normal flow-side manifest encoding (and so can occupy from zero to 32 bits in the encoding, depending on value). The offset is relative to the entry point address of the ebb containing the transfer operation.

Transfers using a code pointer take the pointer from the belt and transfer to the indicated address; there is no offset.

Ignoring misprediction, latency of all transfer operations is effectively zero; the cycle after the instruction that contained the transfer will issue the instruction at the target address. Thus a one-instruction loop (a common situation on wider Mills) executes an iteration every cycle.

infogulch

So the transfer op executes in the same cycle as other ops to do something in a loop...

How do you break out of the loop?

What happens when two transfer ops are executed in the same cycle?

akahlich

A consequence of branch offset encoding that Ivan did not point out is that a branch to the entry of the current EBB always has an offset of zero, and therefore requires zero extra offset bits in the encoding. This compact encoding size is independent of the size of the EBB, and makes the encoding of the branch at the bottom of the loop very small for single EBB loops.

ivan

Branch operations can contain an explicit delay, the way loads can; a delay of zero may be omitted, indicating an immediate branch, which takes effect in the following cycle, as described. A delay of one takes effect in the cycle after that, and so on. As a result, there may be several branches in flight at the same time. These all will occur, in their respective cycle.

Instructions may contain several different branch operations. For sensible code, these will normally use different predicates and there will normally be at most one unconditional branch. All predicates are evaluated, and any branches with unsatisfied predicates are ignored.

For any given cycle, there will be a (possibly empty) set of branches which are due to retire in that cycle; some with zero delay from the current instruction, and some from prior instructions that are at last timing out. One of these retiring branches wins, according to the First Winner Rule. FWR says that shorter delay beats longer delay, and for equal delay higher slot number beats lower slot number. Operations are in asm are packed into slots in reverse textual order, so slot zero is on the right of the instruction as written. Consequently you can read the code and the textually first branch that is satisfied will be the one taken; hence First Winner Rule.

As an example:

if (a == 0) goto A0;

else if (a == 1) goto A1;

else goto Arest;

codes as

eql(<a>, 0), eql(<a>, 1);

brtr(b0, "A0"), brtr(b1, "A1"), br("Arest");

With phasing (due in the 2/5 talk) the code can be reduced to a single instruction rather than the two shown here. I use goto for simplicity here, but the same structure is used for any branches.

Loops are no different:
while (--i != 0) { a += a; }
can be encoded as:

L("loop");

sub(<i>, 1);

eql(<sub>, 0);

brtr(<eql>, "xit");

add(<a>, <a>), br("loop")

but would be better coded as:

L("loop");

sub(<i>, 1);

eql(<sub>, 0);

add(<a>, <a>), brfl(<eql>, "loop");

//falls through on exit

Yes, the second form does the add an extra time on the last iteration, but the former value is still on the belt and is the correct value for "a" at the end. The belt permits optimizations like this that are not possible if the a+=a was updating a register. Combine this optimization with phasing and a NYF operation and the whole loop is one instruction.

I'm not sure what a flat-out OOO superscalar would do with this code, but it clearly would not be better than the Mill :-)

ivan

If you need to keep a belt value over a loop then you should run it out to scratchpad before the loop ("spill" op), and bring it back again when you need to use it next ("fill" op). Described in the belt talk.

ivan

How do you break out of a loop? Execute a branch op. (I think I'm not understanding the question).

What happens if two transfers would retire in the same cycle? First Winner Rule applies. FWR will be covered in the 2/5 talk (I think), but short story: there is a canonical ordering that both hardware and code generators (and asm programmers) use.

Symmetry

Is the Mill ISA non-self-synchronizing? That is, is it possible that you could jump to memory address foo and get one valid stream of execution, but also be able to jump to address foo+1 and get an alternative but equally valid stream of execution? I imagine it works this way in the Mill, as it does with x86.

Will_Edwards

Well, it would be staggeringly unlikely to be a meaningful program because the two streams diverge in memory from the entry point; another address e.g. EBB+n would mean that the bit stream fragment between EBB...EBB+n that is one side is also valid instructions to the other side when decoded backwards...

Of course, trying to generate two legal, overlapping EBBs with this property may be a fun exercise for determined readers :)

Symmetry

Finding these isn't neccesarily easy, but the worry is that they allow certain classes of security vulnerabilities like so:
http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html

But again, this is at worst an area where Mill fails to improve on x86.

ivan

In the recommended OS model for Mill these sorts of exploits are impossible. Security on the Mill is the subject of a future talk, but briefly: the Mill is intended to use a nano-kernel style OS structure. There is no "kernel mode" or "user mode", no supervisor, and no privileged operations; "stack cracking" is impossible. Where architecture comes in: there is no performance penalty for a secure system on the Mill.

IMO, putting a user-reachable JIT where it can access memory that it is not intended to use, all in the name of performance, is appalling. The inevitable result is well deserved.

There is no complete technological solution to security; all precautions fall before blackmail, cute agents, and other social penetrations. However, admitting that is not the same as posting a large "enter here" sign on an unlocked door.

LarryP

In the specification talk, Ivan mentioned that the encoding was optimal-entropy encoding, with otherwise-unused bit patterns used to encode common/useful constants. On the Mill, there are separate encodings for the two instruction streams. Does optimal-entropy encoding therefore imply that there are constant-supplying operations in both instruction streams?

Will_Edwards

Yes, we can put immediate values with the ops that want them, on both sides.

ivan

Expanding a bit on Will's answer:

There are several mechanisms to get an immediate into an operation or onto the belt. The reader-phase flow-side con operation can supply any non-NaR scalar value of any width (including 16-byte on Mill family members supporting native quad), but not vector values. The reader-phase exu-side operation rd can supply any of a hardware-defined preselected set of values known as popCons (popular constants), both scalar and vector. Both con and rd were used in the little demo program. Lastly, many op-phase exu-side operations such as add, sub, and the relational ops, both scalar and vector, have immediate forms in which the second argument is a small literal encoded in the operation.

Of these, the exu-side ops, both popCons and immediates, are used as entropy-soaks; popCons for entropy in the readerBlock decode block and immediates for the exuBlock decode block. There is also an entropy-soaker for the exu-side writerBlock decode block, but it is NYF. There is no flow-side entropy-soaker, although one can think of the conBlock block and extBlock block as being entropy-soakers for the ops in the flowBlock block. There is nothing to prevent an entropy-soaker operation in flowBlock itself, but we've never found an operation that is naturally flow-side that can use a (varying) few bits of immediate.

The take-away on this is that Mill encoding is not perfect maximal entropy, but it is pretty close and much denser than any other CPU encoding.

grandinj

I’m curious.

Since instruction decoding is obviously a performance bottleneck, and since it is also obviously hard to parallelise, why does no architecture (to my knowledge) not performance instruction decoding when it loads data into the icache?

It seems reasonable to me that if you are designing a greenfield architecture, you should be able to design in sufficients constraints to make it possible to decode to micro-ops when the instructions are fetched from main memory, and cache those decoded micro-ops in the icache instead of directly caching the original instructions.

Now, this will obviously increase the size of the icache, but it seems to me that there is more opportunity to throw gate-level parallelism at this if it is performed at RAM access time, rather then just before execution.

Symmetry

The Pentium 4 did, but that didn’t end up working very well for them. The problem is that if you have a variable length encoding then the mapping of addresses between the address you’re jumping to and the location of the instruction in the cache becomes complicated. They ended up using a trace cache rather than a traditional cache to solve this.

http://en.wikipedia.org/wiki/Trace_Cache#Trace_cache

Modern Intel processors actually have something like this too, but only as an optional L0 cache for loops.

In contrast, if you have a fixed length encoding then decoding is usually simple enough that the the cache latency or hit rate you lose by turning your instructions from 32 bit pre-decode instructions to the 60 or whatever bit post-decode instruction isn’t worth it at that stage.

LarryP

Must fields within a block of a Mill instruction all be the same width?

The lecture on encoding indicates that all ops encoded in a single block are all of equal width. However, if the op-count for some block is N (!= 0), we know that the block contains operations for slots 0 through N-1 (of the slots for that stream (either exu or flow-side.) So long as the bit-widths for operations in the block are fixed for a given Mill model, the total bit width of this block (and thus the location of the next block) should be trivially calculable from that count and the widths, even if the fields within the block are not of uniform width. And similarly for the existence and location of the alignment hole, used (if present) to encode a compact no-op for the other bundle stream.

Even in the uniform-width case, the decoders must calculate/look up the starting point of block N+1, based on the starting position and op counts of this and previously parsed blocks -- even if that calc is a simple multiply by a constant. So it seems to me that handling non-uniform widths within a block wouldn't significantly complicate block decoding over the uniform-width case. Is my reasoning wrong, and/or are there other reasons why the fields in a block must have uniform widths?

Note: since I posed this question to Ivan (off list, so as not to speculate in a forum post), I now see details in the Wiki (for example http://millcomputing.com/wiki/Cores/Gold that describe different bit widths for both exu and flow slot operation encodings. But, as is often said in the videos, some of the existing docs/videos are simplifications, so I'll be curious what the Millcomputing folks say on this issue.

ivan

In general all count fields described in the talks are not byte counts but shifter tap counts; the taps themselves can be anywhere, and can be separated by any (static) number of bits. The bit layouts in the Wiki reflect this.

So your reasoning is right :-)

ivan

From comp.arch, on how the Mill operation specification machinery works:
On 5/20/2016 8:04 PM, Ivan Godard wrote:

(in the "instruction decoding perfected" thread, to deprecate manual instruction bit-layout):

> In our spec machinery you supply traits for each encodable attribute.
> One of those traits tells the bit-allocator when the decoder will need
> the value. An attribute can be "pinned", in which case it gets its own
> bitfield (which need not be contiguous) in every operation in the slot,
> needed or not; or "direct", in which case it gets its own field, but
> that field is only in the ops that use that attribute and may be in
> different location in different ops; or "merged", in which case it is
> cross-producted with other attribute to make a single larger value-set
> that is encoded as if it were an attribute itself; or "uncoded" in which
> case it doesn't appear in the binary at all but is carried metainfo for
> the specification machinery.

People have asked for me to post Mill details occasionally. In case anyone is interested, as a p.s. on how specification driven layout works here's a cut from the actual source code that declares the operation attributes for Mills. The enumerations for those attributes that are enums are declared elsewhere. An example is:
enum directionCode {
leftward,
rightward
};
which attribute is used in shifts and a few other operations. As you see below, the value of this attribute is given (in conAsm) by the operation mnemonic rather than by an argument ("byMnemonic"); is encoded as part of a cross-product with other attributes ("merged"); and has the same meaning and potential valueset across all Mills ("universal").

To add a new attribute the architect must declare the enum (if there is one), add a line for the attribute to the declares below, and add a traits specification for the possible values; here's the traits for directionCode:

const
attrTraits
attrTraits::insts[intTraits::count] =
{
attrTraits(leftward, "l",
"toward greater significance"),
attrTraits(rightward, "r",
"toward lesser significance")
};

This gives the text that will be used in the mnemonic in conAsm to indicate the desired value ("l"/"r"), plus a short description that is used in online help and other documentation. For a simple attribute like this, the whole process takes around 15 minutes, including adding an operation that uses the attribute and rebuilding the world to generate the new binary encodings.

Here's the specs for all instruction attributes used in the Mill as specified 2016/05/21.

declareEnumeratedAttrTraits(accessCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(awarenessCode, byMnemonic, merged, universal);
declareNumericAttrTraits(base0Code, byParam, uncoded, byMember);
declareEnumeratedAttrTraits(basedCode, byMnemonic, merged, universal);
declareNumericAttrTraits(bit0Code, byParam, direct, byMember);
declareNumericAttrTraits(bit1Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(blockCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(ccGenCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(conBytesCode, byDerivation, pinned, universal);
declareNumericAttrTraits(con0Code, byParam, uncoded, universal);
declareEnumeratedAttrTraits(condSenseCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(conSet, byMnemonic, uncoded, universal);
declareNumericAttrTraits(count0Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(directionCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(domainCode, byMnemonic, merged, universal);
declareNumericAttrTraits(elem0Code, byParam, direct, byMember);
declareNumericAttrTraits(field0Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(exactitudeCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(exclusivityCode, byMnemonic, merged, universal);
declareNumericAttrTraits(extCount, byDerivation, pinned, universal);
declareNumericAttrTraits(fault0Code, byParam, merged, universal);
declareNumericAttrTraits(imm0Code, byParam, direct, bySlot);
declareNumericAttrTraits(lit0Code, byParam, pinned, byMember);
declareNumericAttrTraits(lit1Code, byParam, pinned, byMember);
declareNumericAttrTraits(lit2Code, byParam, pinned, byMember);
declareEnumeratedAttrTraits(memCode, byMnemonic, uncoded, universal);
declareNumericAttrTraits(memAttrCode, byParam, merged, universal);
declareEnumeratedAttrTraits(memorySubOpCode, byMnemonic, merged, universal);
declareNumericAttrTraits(NaRcode, byParam, direct, byMember);
declareEnumeratedAttrTraits(off0Code, byParam, direct, universal);
declareNumericAttrTraits(opand0Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand1Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand2Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand3Code, byParam, pinned, byMember);
declareEnumeratedAttrTraits(opCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(overflowCode, byMnemonic, merged, declareEnumeratedAttrTraits(scaleCode, byParam, direct, byMember);
declareNumericAttrTraits(scratchCode, byParam, direct, byMember);
declareNumericAttrTraits(specr0Code, byParam, direct, byMember);
declareNumericAttrTraits(specw0Code, byParam, direct, byMember);
declareNumericAttrTraits(streamr0Code, byParam, direct, byMember);
declareNumericAttrTraits(streamw0Code, byParam, direct, byMember);
declareNumericAttrTraits(tag0Code, byParam, direct, byMember);
declareNumericAttrTraits(trap0Code, byParam, merged, universal);
declareNumericAttrTraits(widthCode, byParam, direct, byMember);
universal);
declareNumericAttrTraits(pop0Code, byParam, direct, bySlot);
declareEnumeratedAttrTraits(resCount, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(resultCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(roundingCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(shrinkCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(vectorFill, byMnemonic, merged, universal);

Feel free to ask questions.

Obligatory plug: if you'd like to be part of the Mill effort then see http://millcomputing.com/join-us/