staff
Slides:
PowerPoint (.pptx)
Metadata in the Mill CPU
Smarter data for performance and power
The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope. It achieves DSP-like power/performance on general purpose codes, without reprogramming. The Mill is a wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, sustained. The pipeline is very short, with a mispredict penalty of only four cycles.
To support such sustained performance, the Mill conveys some of the semantics of execution in the form of operand metadata. For example, size metadata bits attached to each operand eliminate the need for redundant opcodes that serve only to encode size metadata. Another example is the NaR bit, for “not a result”, which among other uses allows improved smart exception handling and allows the Mill to encode the novel “None” singleton operand, which enables smaller and faster generated code. Metadata propagates through execution, following rules specified by the architecture.
Use of metadata provides a number of advantages to the architecture:
- Metadata reduces the number of distinct opcodes by a factor of seven.
- Metadata enables speculative execution without fix-up code.
- Metadata eliminates flag-control overheads in floating point.
- Metadata permits vectorizing of while-loops.
The talk describes these and other technical aspects of metadata and speculation in the Mill design.
[deleted]
Very interesting! I have been playing with a processor design where ALU status flags are pre-register (i.e. registers are 33 bits, including the carry bit, plus an overflow flag), but I forgot to think how it could be applied to floating point and vectors!
Intel has their KNI instruction set with vector mask registers, but the idea of just integrating them with the registers themselves seems to work very well. Especially as pick itself is actally a 4-way choice. Each selection boolean can be 0, 1, None, or NaR.
I presume you have 6 status bits per 32 bits, which either gives you one NaR bit per byte (with "None" encoded as a type of NaR), or one NaR bit and 5 IEEE flags per 32-bit float.
(Um... or do ypu have 6 bits per half-precision?)
The question someone asked (youtube@35:20) about how you manage vector pick in the bypass system is definitely something I'd like a more detailed explanation of. For scalars, it's obviously simple renaming. For vectors, do you rename each byte lane separately? Even if you did, the sources would eventally fill up the bypass registers abd break your "the are always enough output latches" assumption) and then what? Do you bypass in 0 cycles and realize in 1 cycle in parallel?
The other thing I wonder about is what happens if you feed incompatible data to an ALU, like a vector of bytes in one side and a vector of 32-bit words in the other? Fault, interpret as per operand 1, or what?
What happens if you widen a maximum-width scalar? Narrow a minimum-width?
Another question is whether scalar and vector widen share the same opcode? The talk stresses (@20:25) "there's only one narrow op, and only one widen op" but having the number of results dropped on the belt (1 for scalar widen, 2 for vector) depend on the input metadata seems perverse.
Can you do vector/scalar ops directly, or is there a required "broadcast" instruction? Pick seems to support vector/sclar operation; do others?
The 2-output smear seems like an unnecessary extra write port. I realize you don't have regular register file write ports, but it's still an extra bypass path.
Did you consider either of these two options:
1. doing away with smearx, and instead having a pickx operation which shifts the control word. The most significant part of the control word is not used by pickx and can be branched on to control the surrounding loop.
2. Having a smearx which rotates rather than shifts the vector, and a pick0 which ignores the lsbit of its control vector input. A branch can look at the lsbit of the control vector, which is ignored by pick0. (This is a little uglier, but might make 0-cycle pick easier to implement.)
I also note that the strcpy() is using unaligned loads. Have you considered having an "aligned partial load" that fills the unwanted leading bytes with None?
(BTW, bragging about how the Mill's 5-cycle mispredict penalty is better than other chips' 20-30 is specious. Those chips execute 2-3 instructions per cycle, so the penalty is ~60 instructions. The Mill's slower clock means that the time delay is not much less, and it may easily be more instructions.)
igodard
Note: Easier to reply if you put the questons in separate postings :-)
Metadata bit count: implementation dependent. While stealing NaR bits (one per byte) for other use in wider data (such as the FP flags) would save bits, it would complicate the hardware logic; generally bits are cheap.
Four-case pick: None or NaR selectors simply pass through, using the data width rather than the selector width. Implementation defined whether the None/NaR payload passes through or a new payload is created - the Belt crossbar is clock-critical, and pick must be very fast.
Vector-wise pick: passing the data through directly to the consumer is straightforward; just muxing. Creating the additional operand on the belt is more complicated, but Not Filed Yet.
Mixed-width arguments: each operation has a set of width signatures specified for what it will accept (see the next talk, Specification). If it gets something it doesn't want then it used to produce a NaR, but we changed it a year ago so it faults immediately; wrong width indicates a compiler bug, not a data problem like overflow.
Excess widen/narrow: same as width-error above.
"only one opcode": "only one opcode per signature" is more correct, but that might take us too far afield in a talk. For widening there are two ops, one for all scalar widths and one for all vector widths. For narrowing there is only one op but with two signatures, one with one argument (scalar) and one with two (vector). The same is true for other widening operations, such as add with the "widen" overflow attribute.
"broadcast" operation: there isn't one. There is a "splat" operation that replicates a scalar into a vector, but that's not the same. Just do "add" and it's happy with either vector or scalar data, whichever it gets. There's only one adder; the width just sets breaks in the carry tree.
"smearx": this does have two results, which impacts the operation latency. The second port exists anyway; the Mill has a lot of two-result operations (such as vector widen). The talk doesn't address latency because it couldn't cover pipelining (May? maybe), but a loop written as shown, without pipelining and phasing, would need a no-op after the smearx. With phasing and pipelining the whole loop is only one cycle anyway, but that's two more talks worth of explanation :-(
"pickx": not advantageous. It would add another mux to the belt crossbar, slowing the clock rate. In addition, while not used in the example, the bool vector is useful in other loops for more than pick.
rotating smear: Both the smear and the pick0 should be easy, and I don't think that the pick0 would have the clock cost that pickx does. However, the loop control bool is at the wrong end of the bool vector from where smeari puts it, so there would need to be an extract operation to get it out to a scalar before it can be branched on, so that's the same second cycle and belt position that the present definition of smearx uses, so no gain.
aligned loads: we have looked at quite a few possibilities in this area, with a goal of getting good performance on funnel-shifting a data stream. It's quite hard to do that without branches, and we are not entirely happy with the present approach (Not Filed Yet)
clock rate: While for business and risk-reduction reasons the initial Mills will have a low clock, there is nothing in the architecture that precludes getting the same clock rate as any other chip.
mispredict penalty: While you are right that a high-end Mill mispredict stall can lead to loss of as many operation issues as a superscalar with a longer recovery would have, on the lower-end Mill family members (with peak issue widths no greater than conventional machines) the Mill advantage is real. We don't claim that the Mill is always better at everything; we claim that it is always no worse and very frequently much better :-)
An aside: I'm impressed with how far you have taken the overview given in the talk. The Mill boggles quite a few people :-) I hope you will continue to hang around the forum here.
[deleted]
I've been following the talks (from a long way away, unfortunately) quite carefully, waiting for the really interesting stuff. Finally things are starting to come together, although there are still quite a few holes. In particular the spiller and how it captures and replays in-flight operations at the time of a call.
> only one opcode per signature
That's what I thought, I was just making sure.
> “broadcast” operation: there isn’t one. There is a “splat” operation that replicates a scalar into a vector, but that’s not the same. Just do “add” and it’s happy with either vector or scalar data, whichever it gets. There’s only one adder; the width just sets breaks in the carry tree.
No, "splat" sounds exactly what I meant "broadacst" to be. A functional unit could also do this implicitly if one operand was a scalar of the appropriate side, but it's a lot of wiring and presumably not worth it.
> However, the loop control bool is at the wrong end of the bool vector from where smeari puts it,
Um, huh? So you're big-endian? I was assuming little-endian, in which case the unused "first" byte result would also be the overall lsbit (you did say that boolean was the lsbit of the field, right?), which would be in the same place as a scalar, and thus presumably in the right place to branch on directly, with no extract.
thebellster
Something interesting that I don't think you've yet addressed yet it's how to deal with variable latency instructions (aside from loads). For example, floating point ops can be orders of magnitude slower than usual when the operands are subnormals. How does the mill deal with these sorts of corner cases, besides stalling? Does it use the same method as loads, or something else?
ivan
Nancy Reagan was right: "Just say no!".
There are no variable-latency operations except load. Even on a conventional ooo MACHINE, early-out doesn't pay - it mucks up the bypass and scheduler.
Ivan
ivan
Implicit splat: We used to have implicit splat; makes the compiler easier. But when the hardware crew started on the implementation it turned out to be a big hit on the clock rate, paid by all operations whether used or not. So it was taken back out.
Endianness: Internally little-endian, your choice for memory.
Bool vector from rotating smear: The bool vector is not a bit vector, it's a vector of ordinary operand which happen to be either zero or one; the width of each bool element is whatever the width metadata says it is, just like for ints. As a special case, a vector argument to a conditional branch tests the element that corresponds to the highest memory address if the vector is loaded from memory (or computed from something that was loaded). This saves an extract operation in the common case of memory-upwards loops and smeari. For memory-downwards loops (less common) the explicit extract would be necessary to get a testable bool. Instead, smearil (left inclusive smear) produces an already-extracted exit condition bool, the way that smearxr (smear right exclusive) does, as shown in the video.
Smearx requires either a second, scalar, result (as shown), or a rotating smear followed by an extract (at no gain over the existing), or branch variants that test each end of the bool vector (doable but branches are already pretty cluttered). Sort of by definition smear is only useful in loops, and in a pipelined loop latency is irrelevant, so the extra cycle doesn't bother us
[deleted]
Thebellster: The Mill really requires fixed latency. The load technique could be used, but would be a huge PITA.
Although many FPUs have slow special-case handling of denormals, it is possible to handle them at full speed. The Am29027 FPU did it, as do some recent IBM POWER processors. And the nVidia Fermi GPU, for reasons very similar to the Mill: variable latency would add more complexity than it saves. It just takes a bit more implementation effort.
rmmh
None/NaR are neat ideas. The demonstration of them working together with the vector operations to make high-performance strcpy simple is very impressive. Using NaRs to delay faulting until it becomes an externally observable effect is particularly clever, and None closely mirrors a compiler's reasoning about undefined behavior.
Allowing arithmetic operations to have truncate/fault/saturate/widen semantics in combination with type-tagged data should make vector assembly more readable, assuming it includes type comments. Not having penalties for moving vectors between ALU and FPU units will be useful.
1. How are FPU rounding modes handled? A single control register? Encoded in each instruction? (I'm guessing control register)
2. Is casting a belt item a 0-cycle flow-side operation? If I start with a vector of 4-byte elements {0, 1, NaR, 2} and try to cast it to a vector of 8-byte elements, do I get {0x100000000, NaR}, or a fault?
NaRs might cause strange behavior if there's a long-running loop that generates a NaR in an early iteration and carries it forward but doesn't "retire" it until after exiting the loop. Compilers should detect this case and insert an explicit NaR check when faults might retire a long time after occurring.
I hope the belt timings work out when the hardware starts being finalized. There's a lot of flow-side operations going on that border on magical (vector pick??).
cypherpunks: Intel CPUs since Sandy Bridge have also had constant-latency FPU operations.
ivan
Ryan:
FP rounding modes: modes are in the operation, indicated by the mnemonic (one of the choices does pick up from the PCSW). This is useful when you have FP ops with different rounding in the same instruction, for example in interval arithmetic.
vector casts: there is no such operation as your example; all widen and narrow ops are cardinality-preserving (N->N). We could narrow your four words to four shorts, but what value would you expect in the other four shorts of the result vector? However, there is a vector narrow that narrows two vectors to one with half-size elements. Thus 2X4Xword->8Xshort (i.e. 8->8). You can widen or narrow Nones and NaRs like any other data. A narrow that overflows gives you the same truncate/except/saturate choice as any other overflow (the fourth choice, double width result, doesn't make much sense when narrowing and doesn't exist).
Belt timings: all picks, including vector pick, are exu-side encoded. After the decoding there's really no such thing as a "side" any more; execution itself is a collection of FU pipe with no particular "side".
David
Is there a carry flag, perhaps like the floating point metadata flags?
I didn't see one mentioned in the addu variations, and none of them directly showed "wrapped value with indication that a wrap happened" (bit-banging an adduw result notwithstanding).
Thanks!
ivan
There are no condition codes as such; it's kind of hard to do when you can have half-a-dozen CC-setting operations in each instruction, although some architectures try, using multiple CC registers.
To have the effect of a carry bit, if you know that no argument was a Nar and you do an addux and get a NaR out, then you must have gotten an overflow, but that's pretty kludgy. Or you could do an adduw and test the first bit in the extended data, but that's pretty kludgy too.
However, there is an alternative: the current operation set includes addc and subc, with two results: a wrapped value and a carry/borrow. It is not clear whether these ops will stay around though; the only use is in an arbitrary-precision library, and for quad arithmetic on members that don't support native quad; neither of these seem that common, and it's not clear that the ops (which would have two-cycle latency) are worth it.
We expect a general triage of the operations set later in the process of an FPGA implementation.
[deleted]
Well, you have floating-point condition codes; is there a reason those bits can't be output by the integer ALU as well? At least for "large-enough" word sizes, so you don't need a carry bit per byte. Doing it using 2-input adds is a bit awkward (think about -2+1+carry and -1+1+carry), but an "add operand 2's carries to operand 1's data, and OR (or add) the resultant carries with operand 1's carries" instruction would do the trick, I think.
I'd output carry and signed overflow bits, and handle overflow saturation in a separate instruction. You could also do a separate "divide by 2" instruction to compute averages without overflow if that was easier than another add opcode.
ivan
The FP error flags are still live until realized by a non-speculative operation (or fall off the belt), so it is not possible to reuse them for integer. Consider:
float f = 10.0;
int i = int(f/3.0) + 0x7fffffff;
The float divide will set the inexact flag, while the add will integer overflow. Neither is real until out of speculative context, so both must be preserved. There are many (including me) who feel that the inexact flag is bogus, should never have been introduced, and should be deprecated now, but as a commercial product we have to live with the stands we have, and in any case code could have set the other flags instead.
As for "large enough", a major focus of the Mill design is to avoid corner cases, especially those that get fobbed off on the compiler. We have tried hard to make everything in the operation set uniform. You have identified one case where we failed: the FP flags only exist for FP-sized values, so code like:
float f = ..;
uint8_t b = f < 0.0;
is a problem, because the float flags propagate through the eql operation and subsequent narrowing to a one-byte value. This is not a problem in scalar, because there is a flag-set in the operand even though it is small. But it is a problem with vector narrowing, because we couldn't see paying for five bits of flags in every byte of a vector. Hence we gritted our teeth and defined vector narrow to be a realizing operation that cannot be speculated. :-(
It would certainly be possible to have a carry metabit in each byte, and we considered it. After all, there is already a NaR bit in each byte. However, we could not find enough usage for such a design to justify the cost. Perhaps we overlooked some important use for carry - did you have a use-case in mind?
David
The majority of my assembly language background is in 8-bit CPUs (6502) where Carry is widely used, so at first it's jarring to see it missing. However, much of it is based in mathematical optimization and comparison tricks because of the narrow register width. With a reasonable register width and richer mathematical instructions, that becomes much less of an issue.
It's also used for bit-packing with shifts, but of course is limited to 1-bit sizes when relying on Carry.
Marshaling and [de]serialization is one aspect that's always a thorn in the side of execution speed, so being able to directly shift various N-bit-wide values in & out of wider numbers with under/overflow detection would be nice. Of course, if the individual steps in such an operation can be automatically executed in parallel, then the need for more powerful instructions goes away.
There are quite a few ABIs that use Carry as an in/out boolean flag, or side-band error return indicator. This is, of course, a result of registers being "expensive", due to fixed register files. If belt slots are "cheaper", then those can become normal parameters. However, there still is pressure on spilling that comes with this.
One interesting scenario that comes to mind is the SBCL compiler for Common Lisp. When calling a function in Lisp, it is unknown how many values it will return. In SBCL x86-64, a single valued return returns with carry clear. For multi-valued return, carry is set and the count of returned values is held in RCX (which is a scratch register otherwise).
While it's probably a more general question than metadata, how would you address older values on the belt after a function call which returns a variable/unknown number of return values?
ivan
"It’s also used for bit-packing with shifts, but of course is limited to 1-bit sizes when relying on Carry."
There are bit test/set/clear/flip ops for both dynamic and immediate bit numbers. You no longer need to use shifts.
"Marshaling and [de]serialization is one aspect that’s always a thorn in the side of execution speed, so being able to directly shift various N-bit-wide values in & out of wider numbers with under/overflow detection would be nice. Of course, if the individual steps in such an operation can be automatically executed in parallel, then the need for more powerful instructions goes away."
A full funnel shift is complicated on any machine; we have worked on it, but are not yet satisfied with the general case of parsing a stream of varying-length items. For static fields (such as bit-fields in C) there is a merge operation that is a bitwise "pick" (take a bit from src0 or src1 based on the corresponding bit in a mask src2), which replaces (a&m)|(b& ~m). However, it's not clear whether a compiler will recognize that idiom. We have gone back and forth on whether there should be ops for field extract (otherwise it's two shifts) and field insert (otherwise rotate/mask/or/rotate). The problem is that there can be at most two belt inputs to an operation per slot.
"There are quite a few ABIs that use Carry as an in/out boolean flag, or side-band error return indicator. This is, of course, a result of registers being “expensive”, due to fixed register files. If belt slots are “cheaper”, then those can become normal parameters. However, there still is pressure on spilling that comes with this."
Those ABIs assume one operation at a time execution. When there could be half-a-dozen ops trying to use that same carry flag... :-)
"One interesting scenario that comes to mind is the SBCL compiler for Common Lisp. When calling a function in Lisp, it is unknown how many values it will return. In SBCL x86-64, a single valued return returns with carry clear. For multi-valued return, carry is set and the count of returned values is held in RCX (which is a scratch register otherwise)."
For the Mill I'd guess you would simply return the count as a second result in all cases. Is the value returned (when more than one) a list, so the flag is really saying whether thre resulting list is a primitive of a collection?
"While it’s probably a more general question than metadata, how would you address older values on the belt after a function call which returns a variable/unknown number of return values?"
In general you cannot do so; you would have to run everything you wanted to save off to scratchpad, and even that wouldn't work because you would have no way to know how many results you got if you wanted to save them.
I confess my ignorance of Lisp implementations; how does the received of multiple results discover that happened, and what would it want to do with them when it does? I suspect that the Mill would handle this at a higher level, but I need to understand the use-case rather than the conventional implementation for that case.
David
(Sorry for my late reply, I wasn't aware that the email notification option didn't include activity on the whole thread.)
I also did think of one other software situation where the carry flag is important: Emulators, especially when the target architecture is of the same register width as the host.
Anyway, in the Lisp situation, a common case is where a function returns multiple values, but the user of that function only bothers with the first (and idiomatically the most important) return value. Since we can freely pass lambdas around, if we as the caller are only interested in 1 return value, it's unknown whether the function we're eventually calling will return more or not.
A common example is the hash table accessor 'gethash'. It returns 2 values; the value looked up (or NIL if not found), and a boolean explicitly stating whether or not it was found. The 2nd return value is needed for disambiguation if NILs are stored as values in the hash table. This second return value is always generated, but ignored in the majority of cases.
The default calling syntax only passes through the first return value, but you can specifically capture the multiple return values if you're interested in them.
The returned values can each be anything, primitive (immediate register value) or compound/boxed (tagged pointer). Returning multiple values is not the same as returning a list, which is returning one value.
---
In the x86-64 compiler implementation, there are 4 registers unsaved between calls. If there is only 1 return value, the first of these registers is used, with carry clear. If there are more return values, carry is set, the first 3 registers are used for return values, with the count in the 4th. If there are more than 3 return values, they are spilled to the stack.
The nice thing is that if the caller only cares about 1 value, they just read that first output register, ignoring carry & the others. The calling convention keeps everything tidy so stack spill storage is not lost or trampled.
If the caller wants 2 or 3 return values, they're immediately available as registers. The count & carry check can be elided when running with "safety level 0" optimizations, or if the type propagation can guarantee the number of expected return values. It's uncommon, but safe & supported to have more than 3; it just has to go out to memory.
---
Regarding the Mill, I agree that it looks like the calling convention there would likely be two return values per call (1st value and count), with the >1 return values stored externally.
Looking back at the belt video again, right, it doesn't look like the scratchpad can be used to pass data across function boundaries. Since Lisp multiple value returns are effectively extra side-band data to use optionally, it would be a bit unfortunate to have it always manage system memory for writing this data that is often ignored.
However, I'm sure many things could be mitigated, like passing the number of expected return values to a function, or having different function entry points for single- or N- valued return. I've only dived deep into the optimized assembly output for 1 architecture (x86-64), so my view on what goes on inside might not encompass all the tricks used on other platforms.
Just like the Mill is a complete rethink of what goes on in a CPU, it is a natural conclusion that optimizing compilers targeting the family would require a complete rethink of strategies to best take advantage of it. I'm sure your C compilers reflect this already, and the optimization opportunities are especially wide open to be solved for languages that do complex things well beyond "portable assembly code".
ivan
Does the calling code always know how many results it is expecting, or does it look at one result and then figure out whether it wants any more?
The Mill call operation explicitly states how many belt results it is expecting; the wrong number returned is a fault. The need for this is a consequence of the belt actually being a renaming mechanism rather than a shift register; the hardware needs to know how many drops will happen (i.e. how much to advance the names) before the actual drops happen if it is to be able to advance the belt without a delay cycle.
So if the caller always knows, and the callee always does what is expected, then there's no problem if different calls to the same function expect (and receive) different counts. But if it is genuinely dynamic then I see no alternative in the present design to always returning a result count as a second result (which is no problem) and returning the excess in some agreed location.
If varying returns are frequently used then this implementation would be unfortunate and we should consider extending the call protocol to deal with it. If however they are only as common as (for example) VARARGS calls in C then the overhead of returning the (mostly unused) count is minor. In what percentage of all calls does the caller care but not know how many results it is expecting?
David
The calling code can either bind a specific number of return values (implicitly 1 unless you manually bind more), or capture all returned values into a list with no expectation of count. The latter is usually used in interactive & debugging tools more than anything else, so the former is the common case.
However, it is never an error to have a mismatch when expecting a specific number. If there are more than expected, then the rest are ignored; if there are fewer than expected, the remaining bindings are set to NIL. Neither case causes an error. So technically, the caller always "cares" but doesn't know, and freely accepts whatever it gets at runtime.
It's hard to say how often a mismatch occurs in calling user-written & library functions, but it's relatively common in calling the standard functions: Hashtable accesses, mathematical floor/ceiling/truncate/round, file reading functions, parsing strings, and many others return multiple values which are commonly used only for their first return value. I don't have the infrastructure to directly measure percentages.
I do think it counts as "genuinely dynamic", as you describe above, at least in the compiler assumptions for current platforms. What is the hit on an 8-slot belt for having 25% of the belt taken by every return, instead of 12.5%? Do values swapped to the scratchpad or spiller for short term use effectively cost little to nothing?
The Mill has some "global" registers like the thread-local pointer. Are there some that are available for user code as well? They'd be handy here to extend the calling convention, or just in general to extend the ABI that dynamic & GC'd programming languages can build their infrastructure from (slab allocation pointers, multiple stacks, current closure, dynamic environments, local/scoped constant tables, etc). On SBCL PowerPC, a whopping ~14 registers are globally reserved for such infrastructure (though some just hold constant addresses for speed & compactness). Smaller register architectures have to trade off which of these very commonly used pointers will be offloaded to RAM.
ivan
Thanks for the explanation. As currently defined, the Mill doesn't handle this use-case well. Needs thought.
Your suggestion of a few user-accessible registers does not work well, because then there is the problem of what happens if more than one operation writes them, and write->read latencies, and so on; all the hazards and everything else about general registers that the belt avoids. The last thing we want to do is to re-introduce renaming into a Mill just to support Lisp :-)
The existing specRegs don't have this problem, because they are all read-only w/r/t the application. Yes, they change, but only as a side effect of operations that the hardware knows about, like call and return.
Values swapped to the spiller effectively cost nothing unless you swap the spiller's bandwidth; that's easy to do in a test case (a function with a very large number of arguments that recurses as soon as called) but is unseen in practice. Scratchpad is not free - the spill and fill operations cost slots and entropy, and the size of scratchpad has a hardware maximum.
We have considered having a set of globals, or something like a global scratchpad. The problem is the save/restore cost in turf switch, which can be very frequent on a Mill. We decided to defer the idea until we had hard simulation numbers showing what the cost of not having them was.