Mill Computing, Inc. › Forums › The Mill › Architecture › Instruction Encoding
Tagged: encoding, instructions, width
- AuthorPosts
- #249 |
Talk by Ivan Godard – 2013-05-29 at Stanford
Slides: 2013-07-11_mill_cpu_encoding_08 (.pptx)
This talk at Stanford EE380 Computer Systems ColloquiumInstruction Encoding
Instructions can be wide, fast to decode and compact
The military maxim, “Amateurs study tactics, professionals study logistics” applies to CPU architecture as well as to armies. Less than 10% of the area and power budget of modern high-end cores is devoted to real work by the functional units such as adders; the other 90% marshals instructions and data for those units and figures out what to do next.
A large fraction of this logistic overhead comes from instruction fetch and decode. Instruction encoding has subtle and far reaching effects on performance and efficiency throughout a core; for example, the intractable encoding used by x86 instructions is why the x86 will never provide the performance/power of other architectures having friendlier encoding.
Some 80% of executed operations are in loops. A software-pipelined loop has instruction-level parallelism (ILP) bounded only by the number of functional units available and the ability to feed them. The limiting factor is often decode; few modern cores can decode more than four instructions per cycle, and none more than 10. The Mill is a new general-purpose CPU architecture that breaks this barrier; high-end Mill family members can fetch, decode, issue and execute over 30 instructions per cycle.
This talk explains the fetch and decode parts of the Mill architecture.
- This topic was modified 10 years, 11 months ago by staff.
Can you describe how jumps within a frame – e.g. a loop – work, and what kind of latency they have?
- This reply was modified 10 years, 11 months ago by Will_Edwards. Reason: clarify
There are no jumps “within” an ebb; by definition ebbs can only be entered at the entry point.
Branch and call operations are flow side and (ignoring code pointers) contain a manifest signed 32-bit offset that uses normal flow-side manifest encoding (and so can occupy from zero to 32 bits in the encoding, depending on value). The offset is relative to the entry point address of the ebb containing the transfer operation.
Transfers using a code pointer take the pointer from the belt and transfer to the indicated address; there is no offset.
Ignoring misprediction, latency of all transfer operations is effectively zero; the cycle after the instruction that contained the transfer will issue the instruction at the target address. Thus a one-instruction loop (a common situation on wider Mills) executes an iteration every cycle.
So the transfer op executes in the same cycle as other ops to do something in a loop…
How do you break out of the loop?
What happens when two transfer ops are executed in the same cycle?
Branch operations can contain an explicit delay, the way loads can; a delay of zero may be omitted, indicating an immediate branch, which takes effect in the following cycle, as described. A delay of one takes effect in the cycle after that, and so on. As a result, there may be several branches in flight at the same time. These all will occur, in their respective cycle.
Instructions may contain several different branch operations. For sensible code, these will normally use different predicates and there will normally be at most one unconditional branch. All predicates are evaluated, and any branches with unsatisfied predicates are ignored.
For any given cycle, there will be a (possibly empty) set of branches which are due to retire in that cycle; some with zero delay from the current instruction, and some from prior instructions that are at last timing out. One of these retiring branches wins, according to the First Winner Rule. FWR says that shorter delay beats longer delay, and for equal delay higher slot number beats lower slot number. Operations are in asm are packed into slots in reverse textual order, so slot zero is on the right of the instruction as written. Consequently you can read the code and the textually first branch that is satisfied will be the one taken; hence First Winner Rule.
As an example:
if (a == 0) goto A0; else if (a == 1) goto A1; else goto Arest;
codes as
eql(<a>, 0), eql(<a>, 1); brtr(b0, "A0"), brtr(b1, "A1"), br("Arest");
With phasing (due in the 2/5 talk) the code can be reduced to a single instruction rather than the two shown here. I use goto for simplicity here, but the same structure is used for any branches.
Loops are no different:
while (--i != 0) { a += a; }
can be encoded as:L("loop"); sub(<i>, 1); eql(<sub>, 0); brtr(<eql>, "xit"); add(<a>, <a>), br("loop")
but would be better coded as:
L("loop"); sub(<i>, 1); eql(<sub>, 0); add(<a>, <a>), brfl(<eql>, "loop"); //falls through on exit
Yes, the second form does the add an extra time on the last iteration, but the former value is still on the belt and is the correct value for “a” at the end. The belt permits optimizations like this that are not possible if the a+=a was updating a register. Combine this optimization with phasing and a NYF operation and the whole loop is one instruction.
I’m not sure what a flat-out OOO superscalar would do with this code, but it clearly would not be better than the Mill 🙂
- This reply was modified 10 years, 11 months ago by Ivan Godard.
How do you break out of a loop? Execute a branch op. (I think I’m not understanding the question).
What happens if two transfers would retire in the same cycle? First Winner Rule applies. FWR will be covered in the 2/5 talk (I think), but short story: there is a canonical ordering that both hardware and code generators (and asm programmers) use.
A consequence of branch offset encoding that Ivan did not point out is that a branch to the entry of the current EBB always has an offset of zero, and therefore requires zero extra offset bits in the encoding. This compact encoding size is independent of the size of the EBB, and makes the encoding of the branch at the bottom of the loop very small for single EBB loops.
If you need to keep a belt value over a loop then you should run it out to scratchpad before the loop (“spill” op), and bring it back again when you need to use it next (“fill” op). Described in the belt talk.
Is the Mill ISA non-self-synchronizing? That is, is it possible that you could jump to memory address foo and get one valid stream of execution, but also be able to jump to address foo+1 and get an alternative but equally valid stream of execution? I imagine it works this way in the Mill, as it does with x86.
Well, it would be staggeringly unlikely to be a meaningful program because the two streams diverge in memory from the entry point; another address e.g. EBB+n would mean that the bit stream fragment between EBB…EBB+n that is one side is also valid instructions to the other side when decoded backwards…
Of course, trying to generate two legal, overlapping EBBs with this property may be a fun exercise for determined readers 🙂
Finding these isn’t neccesarily easy, but the worry is that they allow certain classes of security vulnerabilities like so:
http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.htmlBut again, this is at worst an area where Mill fails to improve on x86.
In the recommended OS model for Mill these sorts of exploits are impossible. Security on the Mill is the subject of a future talk, but briefly: the Mill is intended to use a nano-kernel style OS structure. There is no “kernel mode” or “user mode”, no supervisor, and no privileged operations; “stack cracking” is impossible. Where architecture comes in: there is no performance penalty for a secure system on the Mill.
IMO, putting a user-reachable JIT where it can access memory that it is not intended to use, all in the name of performance, is appalling. The inevitable result is well deserved.
There is no complete technological solution to security; all precautions fall before blackmail, cute agents, and other social penetrations. However, admitting that is not the same as posting a large “enter here” sign on an unlocked door.
In the specification talk, Ivan mentioned that the encoding was optimal-entropy encoding, with otherwise-unused bit patterns used to encode common/useful constants. On the Mill, there are separate encodings for the two instruction streams. Does optimal-entropy encoding therefore imply that there are constant-supplying operations in both instruction streams?
Yes, we can put immediate values with the ops that want them, on both sides.
Expanding a bit on Will’s answer:
There are several mechanisms to get an immediate into an operation or onto the belt. The reader-phase flow-side con operation can supply any non-NaR scalar value of any width (including 16-byte on Mill family members supporting native quad), but not vector values. The reader-phase exu-side operation rd can supply any of a hardware-defined preselected set of values known as popCons (popular constants), both scalar and vector. Both con and rd were used in the little demo program. Lastly, many op-phase exu-side operations such as add, sub, and the relational ops, both scalar and vector, have immediate forms in which the second argument is a small literal encoded in the operation.
Of these, the exu-side ops, both popCons and immediates, are used as entropy-soaks; popCons for entropy in the readerBlock decode block and immediates for the exuBlock decode block. There is also an entropy-soaker for the exu-side writerBlock decode block, but it is NYF. There is no flow-side entropy-soaker, although one can think of the conBlock block and extBlock block as being entropy-soakers for the ops in the flowBlock block. There is nothing to prevent an entropy-soaker operation in flowBlock itself, but we’ve never found an operation that is naturally flow-side that can use a (varying) few bits of immediate.
The take-away on this is that Mill encoding is not perfect maximal entropy, but it is pretty close and much denser than any other CPU encoding.
I’m curious.
Since instruction decoding is obviously a performance bottleneck, and since it is also obviously hard to parallelise, why does no architecture (to my knowledge) not performance instruction decoding when it loads data into the icache?
It seems reasonable to me that if you are designing a greenfield architecture, you should be able to design in sufficients constraints to make it possible to decode to micro-ops when the instructions are fetched from main memory, and cache those decoded micro-ops in the icache instead of directly caching the original instructions.
Now, this will obviously increase the size of the icache, but it seems to me that there is more opportunity to throw gate-level parallelism at this if it is performed at RAM access time, rather then just before execution.
The Pentium 4 did, but that didn’t end up working very well for them. The problem is that if you have a variable length encoding then the mapping of addresses between the address you’re jumping to and the location of the instruction in the cache becomes complicated. They ended up using a trace cache rather than a traditional cache to solve this.
http://en.wikipedia.org/wiki/Trace_Cache#Trace_cache
Modern Intel processors actually have something like this too, but only as an optional L0 cache for loops.
In contrast, if you have a fixed length encoding then decoding is usually simple enough that the the cache latency or hit rate you lose by turning your instructions from 32 bit pre-decode instructions to the 60 or whatever bit post-decode instruction isn’t worth it at that stage.
Must fields within a block of a Mill instruction all be the same width?
The lecture on encoding indicates that all ops encoded in a single block are all of equal width. However, if the op-count for some block is N (!= 0), we know that the block contains operations for slots 0 through N-1 (of the slots for that stream (either exu or flow-side.) So long as the bit-widths for operations in the block are fixed for a given Mill model, the total bit width of this block (and thus the location of the next block) should be trivially calculable from that count and the widths, even if the fields within the block are not of uniform width. And similarly for the existence and location of the alignment hole, used (if present) to encode a compact no-op for the other bundle stream.
Even in the uniform-width case, the decoders must calculate/look up the starting point of block N+1, based on the starting position and op counts of this and previously parsed blocks — even if that calc is a simple multiply by a constant. So it seems to me that handling non-uniform widths within a block wouldn’t significantly complicate block decoding over the uniform-width case. Is my reasoning wrong, and/or are there other reasons why the fields in a block must have uniform widths?
Note: since I posed this question to Ivan (off list, so as not to speculate in a forum post), I now see details in the Wiki (for example http://millcomputing.com/wiki/Cores/Gold that describe different bit widths for both exu and flow slot operation encodings. But, as is often said in the videos, some of the existing docs/videos are simplifications, so I’ll be curious what the Millcomputing folks say on this issue.
- This reply was modified 10 years ago by LarryP. Reason: clarity
In general all count fields described in the talks are not byte counts but shifter tap counts; the taps themselves can be anywhere, and can be separated by any (static) number of bits. The bit layouts in the Wiki reflect this.
So your reasoning is right 🙂
From comp.arch, on how the Mill operation specification machinery works:
On 5/20/2016 8:04 PM, Ivan Godard wrote:(in the “instruction decoding perfected” thread, to deprecate manual instruction bit-layout):
> In our spec machinery you supply traits for each encodable attribute.
> One of those traits tells the bit-allocator when the decoder will need
> the value. An attribute can be “pinned”, in which case it gets its own
> bitfield (which need not be contiguous) in every operation in the slot,
> needed or not; or “direct”, in which case it gets its own field, but
> that field is only in the ops that use that attribute and may be in
> different location in different ops; or “merged”, in which case it is
> cross-producted with other attribute to make a single larger value-set
> that is encoded as if it were an attribute itself; or “uncoded” in which
> case it doesn’t appear in the binary at all but is carried metainfo for
> the specification machinery.People have asked for me to post Mill details occasionally. In case anyone is interested, as a p.s. on how specification driven layout works here’s a cut from the actual source code that declares the operation attributes for Mills. The enumerations for those attributes that are enums are declared elsewhere. An example is:
enum directionCode {
leftward,
rightward
};
which attribute is used in shifts and a few other operations. As you see below, the value of this attribute is given (in conAsm) by the operation mnemonic rather than by an argument (“byMnemonic”); is encoded as part of a cross-product with other attributes (“merged”); and has the same meaning and potential valueset across all Mills (“universal”).To add a new attribute the architect must declare the enum (if there is one), add a line for the attribute to the declares below, and add a traits specification for the possible values; here’s the traits for directionCode:
const
attrTraits
attrTraits::insts[intTraits ::count] =
{
attrTraits(leftward, “l”,
“toward greater significance”),
attrTraits(rightward, “r”,
“toward lesser significance”)
};This gives the text that will be used in the mnemonic in conAsm to indicate the desired value (“l”/”r”), plus a short description that is used in online help and other documentation. For a simple attribute like this, the whole process takes around 15 minutes, including adding an operation that uses the attribute and rebuilding the world to generate the new binary encodings.
Here’s the specs for all instruction attributes used in the Mill as specified 2016/05/21.
declareEnumeratedAttrTraits(accessCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(awarenessCode, byMnemonic, merged, universal);
declareNumericAttrTraits(base0Code, byParam, uncoded, byMember);
declareEnumeratedAttrTraits(basedCode, byMnemonic, merged, universal);
declareNumericAttrTraits(bit0Code, byParam, direct, byMember);
declareNumericAttrTraits(bit1Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(blockCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(ccGenCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(conBytesCode, byDerivation, pinned, universal);
declareNumericAttrTraits(con0Code, byParam, uncoded, universal);
declareEnumeratedAttrTraits(condSenseCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(conSet, byMnemonic, uncoded, universal);
declareNumericAttrTraits(count0Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(directionCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(domainCode, byMnemonic, merged, universal);
declareNumericAttrTraits(elem0Code, byParam, direct, byMember);
declareNumericAttrTraits(field0Code, byParam, direct, byMember);
declareEnumeratedAttrTraits(exactitudeCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(exclusivityCode, byMnemonic, merged, universal);
declareNumericAttrTraits(extCount, byDerivation, pinned, universal);
declareNumericAttrTraits(fault0Code, byParam, merged, universal);
declareNumericAttrTraits(imm0Code, byParam, direct, bySlot);
declareNumericAttrTraits(lit0Code, byParam, pinned, byMember);
declareNumericAttrTraits(lit1Code, byParam, pinned, byMember);
declareNumericAttrTraits(lit2Code, byParam, pinned, byMember);
declareEnumeratedAttrTraits(memCode, byMnemonic, uncoded, universal);
declareNumericAttrTraits(memAttrCode, byParam, merged, universal);
declareEnumeratedAttrTraits(memorySubOpCode, byMnemonic, merged, universal);
declareNumericAttrTraits(NaRcode, byParam, direct, byMember);
declareEnumeratedAttrTraits(off0Code, byParam, direct, universal);
declareNumericAttrTraits(opand0Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand1Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand2Code, byParam, pinned, byMember);
declareNumericAttrTraits(opand3Code, byParam, pinned, byMember);
declareEnumeratedAttrTraits(opCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(overflowCode, byMnemonic, merged, declareEnumeratedAttrTraits(scaleCode, byParam, direct, byMember);
declareNumericAttrTraits(scratchCode, byParam, direct, byMember);
declareNumericAttrTraits(specr0Code, byParam, direct, byMember);
declareNumericAttrTraits(specw0Code, byParam, direct, byMember);
declareNumericAttrTraits(streamr0Code, byParam, direct, byMember);
declareNumericAttrTraits(streamw0Code, byParam, direct, byMember);
declareNumericAttrTraits(tag0Code, byParam, direct, byMember);
declareNumericAttrTraits(trap0Code, byParam, merged, universal);
declareNumericAttrTraits(widthCode, byParam, direct, byMember);
universal);
declareNumericAttrTraits(pop0Code, byParam, direct, bySlot);
declareEnumeratedAttrTraits(resCount, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(resultCode, byMnemonic, uncoded, universal);
declareEnumeratedAttrTraits(roundingCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(shrinkCode, byMnemonic, merged, universal);
declareEnumeratedAttrTraits(vectorFill, byMnemonic, merged, universal);Feel free to ask questions.
Obligatory plug: if you’d like to be part of the Mill effort then see http://millcomputing.com/join-us/
At 33:24 in the video, it seems like both decoders will see non-zero lag counts at the same time: 1 on the left side, and 2 on the right side. Since there are no explicit no-ops here, each of these non-zero-lag instructions must immediately follow the previous instructions in memory.
So how does the machine figure out that the right side NOPs first, and the left side NOPs second? Is there some canonical order? Is this encoded? Or is it something clever? 🙂
The running lag on each side is maintained in an internal register in that side, which are independently counted down each issue clock and are reset up as added lag is decoded. The difference in the contents of the lag resisters tells which side restarts first, or rather that is the effect, because each is independent and restarts when its private lag runs out. Of course, if the registers have the same value they restart together. In your example, because both are non-zero they both stall and the left counts to zero and the right to one. The next cycle the left (at zero) issues and the right counts to zero. The cycle after that they both issue.
This of course leaves the question of how to know the running lags at some random point of the code. The call operation saves the lags and return restores them, which takes care of calls. However, the debugger is more of a problem. The compiler statically knows when each half-instruction issues relative to the half-instruction on the other side and can put the relative lag in the debug info. However, if both are lagging, there’s no way (at present) to tell the debugger/sim to startup (or examine) one, two, … cycles before lag-out and issue resumption, and instead you see the program as of the first actual issue after the point you select if that point is in the middle of bilateral lagging. We could add a command to the UI to let the user set the lag point, but it oesn’t seem to be actually useful and we haven’t.
So I have been taking a class on computer architecture (I am a software guy). The more I learn the more in awe I am with the beauty of the Mill instruction encoding and other features.
CISC sucks. It needs millions of transistors just to decode 6 instructions.
RISC is a clear improvement, but the superscalar OoO design is ridiculously complicated, as I learn about the Tomasulo algorithm, wide issue/decode, speculative execution, I can’t help but think “this is insane, there has to be a better way”. It feels like the wrong path.
VLIW seems like a more reasonable approach. I know binary compatibility problems and stalls have been a challenge for VLIW architectures.The Mill is just beautiful, it has a sane encoding and simplicity of a VLIW. But phasing and double instruction stream really take it to the next level.
The separate load issue and retire is in hindsight the obvious way to solve the stalls due to memory latency that is so common in VLIW.
The branch predictor is so cool too, you can predict the several EBB’s in advance, even before you start execution. Mainstream predictors have to wait until they get to the branch instruction.
The specializer is a neat solution to binary compatibility.I really hope to see this CPU make it to silicon.
So do we 🙂
Would your instructor welcome a class presentation by you about the Mill?
As a programmer of many decades the parts I like most about what I’ve learned about the mill are the innovations around memory. The implicit zero for stack frames is a thing of beauty. You get a strong guarantee of initialization that’s actually faster for the hardware.
Pushing the TLB to the periphery is also genius. A 64-bit address space really is a tremendous amount. We all “know” that statements like “640k is enough for anyone” are laughably short-lived, but that’s only a joke until it isn’t. If “enough is as good as a feast”, then an exponential number of feasts must truly be enough for anyone. That one restriction of living in a single 64-bit address space yields so many benefits if you take advantage of it in the right way. You just have to have a wider perspective and a willingness to rethink past decisions (i.e. separate address spaces for each process).
That’s just a few of my favorites (NaR-bit? Implicitly loading zero? Static scheduling with the performance of OoO?). There have been so many times when learning about the mill architecture that I’ve had that a-ha moment. Once a problem is framed correctly the solutions seem obvious. It reminds me of the joke about the two mathematicians who meet at a conference and are both working on the same problem. They get to talking in the bar at night and start working through it together. They work until the bar closes and then continue to work in the lobby until well into the next day. Finally, they make a break through and one says to the other “Oh! I see! It’s obvious!”
- AuthorPosts
You must be logged in to reply to this topic.