Mill Computing, Inc. › Forums › The Mill › Architecture › Metadata
- AuthorPosts
- #253 |
Talk by Ivan Godard – 2013-12-11 at the
SFBay Association of C/C++ Users
Slides: PowerPoint (.pptx)Metadata in the Mill CPU
Smarter data for performance and power
The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope. It achieves DSP-like power/performance on general purpose codes, without reprogramming. The Mill is a wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, sustained. The pipeline is very short, with a mispredict penalty of only four cycles.To support such sustained performance, the Mill conveys some of the semantics of execution in the form of operand metadata. For example, size metadata bits attached to each operand eliminate the need for redundant opcodes that serve only to encode size metadata. Another example is the NaR bit, for “not a result”, which among other uses allows improved smart exception handling and allows the Mill to encode the novel “None” singleton operand, which enables smaller and faster generated code. Metadata propagates through execution, following rules specified by the architecture.
Use of metadata provides a number of advantages to the architecture:
- Metadata reduces the number of distinct opcodes by a factor of seven.
- Metadata enables speculative execution without fix-up code.
- Metadata eliminates flag-control overheads in floating point.
- Metadata permits vectorizing of while-loops.
The talk describes these and other technical aspects of metadata and speculation in the Mill design.
- AnonymousJanuary 1, 2014 at 2:36 pmPost count: 7
Very interesting! I have been playing with a processor design where ALU status flags are pre-register (i.e. registers are 33 bits, including the carry bit, plus an overflow flag), but I forgot to think how it could be applied to floating point and vectors!
Intel has their KNI instruction set with vector mask registers, but the idea of just integrating them with the registers themselves seems to work very well. Especially as pick itself is actally a 4-way choice. Each selection boolean can be 0, 1, None, or NaR.
I presume you have 6 status bits per 32 bits, which either gives you one NaR bit per byte (with “None” encoded as a type of NaR), or one NaR bit and 5 IEEE flags per 32-bit float.
(Um… or do ypu have 6 bits per half-precision?)
The question someone asked (youtube@35:20) about how you manage vector pick in the bypass system is definitely something I’d like a more detailed explanation of. For scalars, it’s obviously simple renaming. For vectors, do you rename each byte lane separately? Even if you did, the sources would eventally fill up the bypass registers abd break your “the are always enough output latches” assumption) and then what? Do you bypass in 0 cycles and realize in 1 cycle in parallel?
The other thing I wonder about is what happens if you feed incompatible data to an ALU, like a vector of bytes in one side and a vector of 32-bit words in the other? Fault, interpret as per operand 1, or what?
What happens if you widen a maximum-width scalar? Narrow a minimum-width?
Another question is whether scalar and vector widen share the same opcode? The talk stresses (@20:25) “there’s only one narrow op, and only one widen op” but having the number of results dropped on the belt (1 for scalar widen, 2 for vector) depend on the input metadata seems perverse.
Can you do vector/scalar ops directly, or is there a required “broadcast” instruction? Pick seems to support vector/sclar operation; do others?
The 2-output smear seems like an unnecessary extra write port. I realize you don’t have regular register file write ports, but it’s still an extra bypass path.
Did you consider either of these two options:
1. doing away with smearx, and instead having a pickx operation which shifts the control word. The most significant part of the control word is not used by pickx and can be branched on to control the surrounding loop.
2. Having a smearx which rotates rather than shifts the vector, and a pick0 which ignores the lsbit of its control vector input. A branch can look at the lsbit of the control vector, which is ignored by pick0. (This is a little uglier, but might make 0-cycle pick easier to implement.)
I also note that the strcpy() is using unaligned loads. Have you considered having an “aligned partial load” that fills the unwanted leading bytes with None?
(BTW, bragging about how the Mill’s 5-cycle mispredict penalty is better than other chips’ 20-30 is specious. Those chips execute 2-3 instructions per cycle, so the penalty is ~60 instructions. The Mill’s slower clock means that the time delay is not much less, and it may easily be more instructions.)
Note: Easier to reply if you put the questons in separate postings 🙂
Metadata bit count: implementation dependent. While stealing NaR bits (one per byte) for other use in wider data (such as the FP flags) would save bits, it would complicate the hardware logic; generally bits are cheap.
Four-case pick: None or NaR selectors simply pass through, using the data width rather than the selector width. Implementation defined whether the None/NaR payload passes through or a new payload is created – the Belt crossbar is clock-critical, and pick must be very fast.
Vector-wise pick: passing the data through directly to the consumer is straightforward; just muxing. Creating the additional operand on the belt is more complicated, but Not Filed Yet.
Mixed-width arguments: each operation has a set of width signatures specified for what it will accept (see the next talk, Specification). If it gets something it doesn’t want then it used to produce a NaR, but we changed it a year ago so it faults immediately; wrong width indicates a compiler bug, not a data problem like overflow.
Excess widen/narrow: same as width-error above.
“only one opcode”: “only one opcode per signature” is more correct, but that might take us too far afield in a talk. For widening there are two ops, one for all scalar widths and one for all vector widths. For narrowing there is only one op but with two signatures, one with one argument (scalar) and one with two (vector). The same is true for other widening operations, such as add with the “widen” overflow attribute.
“broadcast” operation: there isn’t one. There is a “splat” operation that replicates a scalar into a vector, but that’s not the same. Just do “add” and it’s happy with either vector or scalar data, whichever it gets. There’s only one adder; the width just sets breaks in the carry tree.
“smearx”: this does have two results, which impacts the operation latency. The second port exists anyway; the Mill has a lot of two-result operations (such as vector widen). The talk doesn’t address latency because it couldn’t cover pipelining (May? maybe), but a loop written as shown, without pipelining and phasing, would need a no-op after the smearx. With phasing and pipelining the whole loop is only one cycle anyway, but that’s two more talks worth of explanation 🙁
“pickx”: not advantageous. It would add another mux to the belt crossbar, slowing the clock rate. In addition, while not used in the example, the bool vector is useful in other loops for more than pick.
rotating smear: Both the smear and the pick0 should be easy, and I don’t think that the pick0 would have the clock cost that pickx does. However, the loop control bool is at the wrong end of the bool vector from where smeari puts it, so there would need to be an extract operation to get it out to a scalar before it can be branched on, so that’s the same second cycle and belt position that the present definition of smearx uses, so no gain.
aligned loads: we have looked at quite a few possibilities in this area, with a goal of getting good performance on funnel-shifting a data stream. It’s quite hard to do that without branches, and we are not entirely happy with the present approach (Not Filed Yet)
clock rate: While for business and risk-reduction reasons the initial Mills will have a low clock, there is nothing in the architecture that precludes getting the same clock rate as any other chip.
mispredict penalty: While you are right that a high-end Mill mispredict stall can lead to loss of as many operation issues as a superscalar with a longer recovery would have, on the lower-end Mill family members (with peak issue widths no greater than conventional machines) the Mill advantage is real. We don’t claim that the Mill is always better at everything; we claim that it is always no worse and very frequently much better 🙂
An aside: I’m impressed with how far you have taken the overview given in the talk. The Mill boggles quite a few people 🙂 I hope you will continue to hang around the forum here.
- AnonymousJanuary 1, 2014 at 10:56 pmPost count: 7
I’ve been following the talks (from a long way away, unfortunately) quite carefully, waiting for the really interesting stuff. Finally things are starting to come together, although there are still quite a few holes. In particular the spiller and how it captures and replays in-flight operations at the time of a call.
> only one opcode per signature
That’s what I thought, I was just making sure.
> “broadcast” operation: there isn’t one. There is a “splat” operation that replicates a scalar into a vector, but that’s not the same. Just do “add” and it’s happy with either vector or scalar data, whichever it gets. There’s only one adder; the width just sets breaks in the carry tree.
No, “splat” sounds exactly what I meant “broadacst” to be. A functional unit could also do this implicitly if one operand was a scalar of the appropriate side, but it’s a lot of wiring and presumably not worth it.
> However, the loop control bool is at the wrong end of the bool vector from where smeari puts it,
Um, huh? So you’re big-endian? I was assuming little-endian, in which case the unused “first” byte result would also be the overall lsbit (you did say that boolean was the lsbit of the field, right?), which would be in the same place as a scalar, and thus presumably in the right place to branch on directly, with no extract.
Implicit splat: We used to have implicit splat; makes the compiler easier. But when the hardware crew started on the implementation it turned out to be a big hit on the clock rate, paid by all operations whether used or not. So it was taken back out.
Endianness: Internally little-endian, your choice for memory.
Bool vector from rotating smear: The bool vector is not a bit vector, it’s a vector of ordinary operand which happen to be either zero or one; the width of each bool element is whatever the width metadata says it is, just like for ints. As a special case, a vector argument to a conditional branch tests the element that corresponds to the highest memory address if the vector is loaded from memory (or computed from something that was loaded). This saves an extract operation in the common case of memory-upwards loops and smeari. For memory-downwards loops (less common) the explicit extract would be necessary to get a testable bool. Instead, smearil (left inclusive smear) produces an already-extracted exit condition bool, the way that smearxr (smear right exclusive) does, as shown in the video.
Smearx requires either a second, scalar, result (as shown), or a rotating smear followed by an extract (at no gain over the existing), or branch variants that test each end of the bool vector (doable but branches are already pretty cluttered). Sort of by definition smear is only useful in loops, and in a pipelined loop latency is irrelevant, so the extra cycle doesn’t bother us
Something interesting that I don’t think you’ve yet addressed yet it’s how to deal with variable latency instructions (aside from loads). For example, floating point ops can be orders of magnitude slower than usual when the operands are subnormals. How does the mill deal with these sorts of corner cases, besides stalling? Does it use the same method as loads, or something else?
Nancy Reagan was right: “Just say no!”.
There are no variable-latency operations except load. Even on a conventional ooo MACHINE, early-out doesn’t pay – it mucks up the bypass and scheduler.
Ivan
- AnonymousJanuary 2, 2014 at 10:43 pmPost count: 7
Thebellster: The Mill really requires fixed latency. The load technique could be used, but would be a huge PITA.
Although many FPUs have slow special-case handling of denormals, it is possible to handle them at full speed. The Am29027 FPU did it, as do some recent IBM POWER processors. And the nVidia Fermi GPU, for reasons very similar to the Mill: variable latency would add more complexity than it saves. It just takes a bit more implementation effort.
None/NaR are neat ideas. The demonstration of them working together with the vector operations to make high-performance strcpy simple is very impressive. Using NaRs to delay faulting until it becomes an externally observable effect is particularly clever, and None closely mirrors a compiler’s reasoning about undefined behavior.
Allowing arithmetic operations to have truncate/fault/saturate/widen semantics in combination with type-tagged data should make vector assembly more readable, assuming it includes type comments. Not having penalties for moving vectors between ALU and FPU units will be useful.
1. How are FPU rounding modes handled? A single control register? Encoded in each instruction? (I’m guessing control register)
2. Is casting a belt item a 0-cycle flow-side operation? If I start with a vector of 4-byte elements {0, 1, NaR, 2} and try to cast it to a vector of 8-byte elements, do I get {0x100000000, NaR}, or a fault?
NaRs might cause strange behavior if there’s a long-running loop that generates a NaR in an early iteration and carries it forward but doesn’t “retire” it until after exiting the loop. Compilers should detect this case and insert an explicit NaR check when faults might retire a long time after occurring.
I hope the belt timings work out when the hardware starts being finalized. There’s a lot of flow-side operations going on that border on magical (vector pick??).
cypherpunks: Intel CPUs since Sandy Bridge have also had constant-latency FPU operations.
Ryan:
FP rounding modes: modes are in the operation, indicated by the mnemonic (one of the choices does pick up from the PCSW). This is useful when you have FP ops with different rounding in the same instruction, for example in interval arithmetic.
vector casts: there is no such operation as your example; all widen and narrow ops are cardinality-preserving (N->N). We could narrow your four words to four shorts, but what value would you expect in the other four shorts of the result vector? However, there is a vector narrow that narrows two vectors to one with half-size elements. Thus 2X4Xword->8Xshort (i.e. 8->8). You can widen or narrow Nones and NaRs like any other data. A narrow that overflows gives you the same truncate/except/saturate choice as any other overflow (the fourth choice, double width result, doesn’t make much sense when narrowing and doesn’t exist).
Belt timings: all picks, including vector pick, are exu-side encoded. After the decoding there’s really no such thing as a “side” any more; execution itself is a collection of FU pipe with no particular “side”.
Is there a carry flag, perhaps like the floating point metadata flags?
I didn’t see one mentioned in the addu variations, and none of them directly showed “wrapped value with indication that a wrap happened” (bit-banging an adduw result notwithstanding).
Thanks!
There are no condition codes as such; it’s kind of hard to do when you can have half-a-dozen CC-setting operations in each instruction, although some architectures try, using multiple CC registers.
To have the effect of a carry bit, if you know that no argument was a Nar and you do an addux and get a NaR out, then you must have gotten an overflow, but that’s pretty kludgy. Or you could do an adduw and test the first bit in the extended data, but that’s pretty kludgy too.
However, there is an alternative: the current operation set includes addc and subc, with two results: a wrapped value and a carry/borrow. It is not clear whether these ops will stay around though; the only use is in an arbitrary-precision library, and for quad arithmetic on members that don’t support native quad; neither of these seem that common, and it’s not clear that the ops (which would have two-cycle latency) are worth it.
We expect a general triage of the operations set later in the process of an FPGA implementation.
- AnonymousJanuary 10, 2014 at 12:10 pmPost count: 7
Well, you have floating-point condition codes; is there a reason those bits can’t be output by the integer ALU as well? At least for “large-enough” word sizes, so you don’t need a carry bit per byte. Doing it using 2-input adds is a bit awkward (think about -2+1+carry and -1+1+carry), but an “add operand 2’s carries to operand 1’s data, and OR (or add) the resultant carries with operand 1’s carries” instruction would do the trick, I think.
I’d output carry and signed overflow bits, and handle overflow saturation in a separate instruction. You could also do a separate “divide by 2” instruction to compute averages without overflow if that was easier than another add opcode.
The FP error flags are still live until realized by a non-speculative operation (or fall off the belt), so it is not possible to reuse them for integer. Consider:
float f = 10.0; int i = int(f/3.0) + 0x7fffffff;
The float divide will set the inexact flag, while the add will integer overflow. Neither is real until out of speculative context, so both must be preserved. There are many (including me) who feel that the inexact flag is bogus, should never have been introduced, and should be deprecated now, but as a commercial product we have to live with the stands we have, and in any case code could have set the other flags instead.
As for “large enough”, a major focus of the Mill design is to avoid corner cases, especially those that get fobbed off on the compiler. We have tried hard to make everything in the operation set uniform. You have identified one case where we failed: the FP flags only exist for FP-sized values, so code like:
float f = ..; uint8_t b = f < 0.0;
is a problem, because the float flags propagate through the eql operation and subsequent narrowing to a one-byte value. This is not a problem in scalar, because there is a flag-set in the operand even though it is small. But it is a problem with vector narrowing, because we couldn’t see paying for five bits of flags in every byte of a vector. Hence we gritted our teeth and defined vector narrow to be a realizing operation that cannot be speculated. 🙁
It would certainly be possible to have a carry metabit in each byte, and we considered it. After all, there is already a NaR bit in each byte. However, we could not find enough usage for such a design to justify the cost. Perhaps we overlooked some important use for carry – did you have a use-case in mind?
- This reply was modified 10 years, 10 months ago by Ivan Godard.
- This reply was modified 10 years, 9 months ago by staff.
The majority of my assembly language background is in 8-bit CPUs (6502) where Carry is widely used, so at first it’s jarring to see it missing. However, much of it is based in mathematical optimization and comparison tricks because of the narrow register width. With a reasonable register width and richer mathematical instructions, that becomes much less of an issue.
It’s also used for bit-packing with shifts, but of course is limited to 1-bit sizes when relying on Carry.
Marshaling and [de]serialization is one aspect that’s always a thorn in the side of execution speed, so being able to directly shift various N-bit-wide values in & out of wider numbers with under/overflow detection would be nice. Of course, if the individual steps in such an operation can be automatically executed in parallel, then the need for more powerful instructions goes away.
There are quite a few ABIs that use Carry as an in/out boolean flag, or side-band error return indicator. This is, of course, a result of registers being “expensive”, due to fixed register files. If belt slots are “cheaper”, then those can become normal parameters. However, there still is pressure on spilling that comes with this.
One interesting scenario that comes to mind is the SBCL compiler for Common Lisp. When calling a function in Lisp, it is unknown how many values it will return. In SBCL x86-64, a single valued return returns with carry clear. For multi-valued return, carry is set and the count of returned values is held in RCX (which is a scratch register otherwise).
While it’s probably a more general question than metadata, how would you address older values on the belt after a function call which returns a variable/unknown number of return values?
“It’s also used for bit-packing with shifts, but of course is limited to 1-bit sizes when relying on Carry.”
There are bit test/set/clear/flip ops for both dynamic and immediate bit numbers. You no longer need to use shifts.
“Marshaling and [de]serialization is one aspect that’s always a thorn in the side of execution speed, so being able to directly shift various N-bit-wide values in & out of wider numbers with under/overflow detection would be nice. Of course, if the individual steps in such an operation can be automatically executed in parallel, then the need for more powerful instructions goes away.”
A full funnel shift is complicated on any machine; we have worked on it, but are not yet satisfied with the general case of parsing a stream of varying-length items. For static fields (such as bit-fields in C) there is a merge operation that is a bitwise “pick” (take a bit from src0 or src1 based on the corresponding bit in a mask src2), which replaces (a&m)|(b& ~m). However, it’s not clear whether a compiler will recognize that idiom. We have gone back and forth on whether there should be ops for field extract (otherwise it’s two shifts) and field insert (otherwise rotate/mask/or/rotate). The problem is that there can be at most two belt inputs to an operation per slot.
“There are quite a few ABIs that use Carry as an in/out boolean flag, or side-band error return indicator. This is, of course, a result of registers being “expensive”, due to fixed register files. If belt slots are “cheaper”, then those can become normal parameters. However, there still is pressure on spilling that comes with this.”
Those ABIs assume one operation at a time execution. When there could be half-a-dozen ops trying to use that same carry flag… 🙂
“One interesting scenario that comes to mind is the SBCL compiler for Common Lisp. When calling a function in Lisp, it is unknown how many values it will return. In SBCL x86-64, a single valued return returns with carry clear. For multi-valued return, carry is set and the count of returned values is held in RCX (which is a scratch register otherwise).”
For the Mill I’d guess you would simply return the count as a second result in all cases. Is the value returned (when more than one) a list, so the flag is really saying whether thre resulting list is a primitive of a collection?
“While it’s probably a more general question than metadata, how would you address older values on the belt after a function call which returns a variable/unknown number of return values?”
In general you cannot do so; you would have to run everything you wanted to save off to scratchpad, and even that wouldn’t work because you would have no way to know how many results you got if you wanted to save them.
I confess my ignorance of Lisp implementations; how does the received of multiple results discover that happened, and what would it want to do with them when it does? I suspect that the Mill would handle this at a higher level, but I need to understand the use-case rather than the conventional implementation for that case.
(Sorry for my late reply, I wasn’t aware that the email notification option didn’t include activity on the whole thread.)
I also did think of one other software situation where the carry flag is important: Emulators, especially when the target architecture is of the same register width as the host.
Anyway, in the Lisp situation, a common case is where a function returns multiple values, but the user of that function only bothers with the first (and idiomatically the most important) return value. Since we can freely pass lambdas around, if we as the caller are only interested in 1 return value, it’s unknown whether the function we’re eventually calling will return more or not.
A common example is the hash table accessor ‘gethash’. It returns 2 values; the value looked up (or NIL if not found), and a boolean explicitly stating whether or not it was found. The 2nd return value is needed for disambiguation if NILs are stored as values in the hash table. This second return value is always generated, but ignored in the majority of cases.
The default calling syntax only passes through the first return value, but you can specifically capture the multiple return values if you’re interested in them.
The returned values can each be anything, primitive (immediate register value) or compound/boxed (tagged pointer). Returning multiple values is not the same as returning a list, which is returning one value.
—
In the x86-64 compiler implementation, there are 4 registers unsaved between calls. If there is only 1 return value, the first of these registers is used, with carry clear. If there are more return values, carry is set, the first 3 registers are used for return values, with the count in the 4th. If there are more than 3 return values, they are spilled to the stack.
The nice thing is that if the caller only cares about 1 value, they just read that first output register, ignoring carry & the others. The calling convention keeps everything tidy so stack spill storage is not lost or trampled.
If the caller wants 2 or 3 return values, they’re immediately available as registers. The count & carry check can be elided when running with “safety level 0” optimizations, or if the type propagation can guarantee the number of expected return values. It’s uncommon, but safe & supported to have more than 3; it just has to go out to memory.
—
Regarding the Mill, I agree that it looks like the calling convention there would likely be two return values per call (1st value and count), with the >1 return values stored externally.
Looking back at the belt video again, right, it doesn’t look like the scratchpad can be used to pass data across function boundaries. Since Lisp multiple value returns are effectively extra side-band data to use optionally, it would be a bit unfortunate to have it always manage system memory for writing this data that is often ignored.
However, I’m sure many things could be mitigated, like passing the number of expected return values to a function, or having different function entry points for single- or N- valued return. I’ve only dived deep into the optimized assembly output for 1 architecture (x86-64), so my view on what goes on inside might not encompass all the tricks used on other platforms.
Just like the Mill is a complete rethink of what goes on in a CPU, it is a natural conclusion that optimizing compilers targeting the family would require a complete rethink of strategies to best take advantage of it. I’m sure your C compilers reflect this already, and the optimization opportunities are especially wide open to be solved for languages that do complex things well beyond “portable assembly code”.
Does the calling code always know how many results it is expecting, or does it look at one result and then figure out whether it wants any more?
The Mill call operation explicitly states how many belt results it is expecting; the wrong number returned is a fault. The need for this is a consequence of the belt actually being a renaming mechanism rather than a shift register; the hardware needs to know how many drops will happen (i.e. how much to advance the names) before the actual drops happen if it is to be able to advance the belt without a delay cycle.
So if the caller always knows, and the callee always does what is expected, then there’s no problem if different calls to the same function expect (and receive) different counts. But if it is genuinely dynamic then I see no alternative in the present design to always returning a result count as a second result (which is no problem) and returning the excess in some agreed location.
If varying returns are frequently used then this implementation would be unfortunate and we should consider extending the call protocol to deal with it. If however they are only as common as (for example) VARARGS calls in C then the overhead of returning the (mostly unused) count is minor. In what percentage of all calls does the caller care but not know how many results it is expecting?
The calling code can either bind a specific number of return values (implicitly 1 unless you manually bind more), or capture all returned values into a list with no expectation of count. The latter is usually used in interactive & debugging tools more than anything else, so the former is the common case.
However, it is never an error to have a mismatch when expecting a specific number. If there are more than expected, then the rest are ignored; if there are fewer than expected, the remaining bindings are set to NIL. Neither case causes an error. So technically, the caller always “cares” but doesn’t know, and freely accepts whatever it gets at runtime.
It’s hard to say how often a mismatch occurs in calling user-written & library functions, but it’s relatively common in calling the standard functions: Hashtable accesses, mathematical floor/ceiling/truncate/round, file reading functions, parsing strings, and many others return multiple values which are commonly used only for their first return value. I don’t have the infrastructure to directly measure percentages.
I do think it counts as “genuinely dynamic”, as you describe above, at least in the compiler assumptions for current platforms. What is the hit on an 8-slot belt for having 25% of the belt taken by every return, instead of 12.5%? Do values swapped to the scratchpad or spiller for short term use effectively cost little to nothing?
The Mill has some “global” registers like the thread-local pointer. Are there some that are available for user code as well? They’d be handy here to extend the calling convention, or just in general to extend the ABI that dynamic & GC’d programming languages can build their infrastructure from (slab allocation pointers, multiple stacks, current closure, dynamic environments, local/scoped constant tables, etc). On SBCL PowerPC, a whopping ~14 registers are globally reserved for such infrastructure (though some just hold constant addresses for speed & compactness). Smaller register architectures have to trade off which of these very commonly used pointers will be offloaded to RAM.
Thanks for the explanation. As currently defined, the Mill doesn’t handle this use-case well. Needs thought.
Your suggestion of a few user-accessible registers does not work well, because then there is the problem of what happens if more than one operation writes them, and write->read latencies, and so on; all the hazards and everything else about general registers that the belt avoids. The last thing we want to do is to re-introduce renaming into a Mill just to support Lisp 🙂
The existing specRegs don’t have this problem, because they are all read-only w/r/t the application. Yes, they change, but only as a side effect of operations that the hardware knows about, like call and return.
Values swapped to the spiller effectively cost nothing unless you swap the spiller’s bandwidth; that’s easy to do in a test case (a function with a very large number of arguments that recurses as soon as called) but is unseen in practice. Scratchpad is not free – the spill and fill operations cost slots and entropy, and the size of scratchpad has a hardware maximum.
We have considered having a set of globals, or something like a global scratchpad. The problem is the save/restore cost in turf switch, which can be very frequent on a Mill. We decided to defer the idea until we had hard simulation numbers showing what the cost of not having them was.
Yes, I agree with the tradeoffs you’ve chosen. It does have some impedance mismatch with what Lisp would ideally want, but that’s already the case in x86 (even the slab allocation pointer is in thread-local storage, not a register), and it’s still speed-competitive there.
Like I said, it’s an interesting issue to wipe the slate of optimization assumptions clean and look at how the top-level goals for this class of language compilers will be accomplished on a new architecture.
p.s. after some more thought
The variable returns won’t work on the belt, and there are no shared registers or scratchpad. In fact, the only shared thing is memory. So what happens if the results (after the first) are communicated in memory?
If the shared-register approach works on a conventional, then sharing pseudo-registers at static addresses (w/r/t the program load image) should work too, albeit not as clean as something like a Forth double-ended stack. The cache line containing the pseudo-registers (PR hereafter) will always be hot and in the top-level cache. If ~14 (32-bit?) registers is enough, then the 16 32-bit values in a line should also be enough, although a 64-bit Lisp might want two lines. If the reserved addresses were placed at the bottom of the dp space then the address offsets would always encode in one byte for the accessing load/store operations. As the l/s opcode and other info are going to be around 16 bits, it means an access will be 24 or so bits of entropy in the encoding, favorable compared to 32-bit RISC codes and probably comparable to x86 (I am not an x86 expert – how many bytes in a load/store if the offset is one byte?) However, the l/s ops occupy flow slots, retire stations, and d$1 bandwidth. This is not free, compared to returning results on the belt.
The stores, inside the called function, are fire-and-forget. However, there is load latency to pick the values up from the PRs. The code doesn’t want to wait for a D$1 cycle before looking at the call result, although I suppose it could overlap the testing for the presence of the extra results, and probably some use of the prime result.
However, there’s another way: issue the load of the result before making the call, with the retire delay set so the load will retire in the instruction after the call. That load can be hoisted arbitrarily high, but if LISP overhead guarantees that any call will last longer than d$1 then hoisting is not necessary.
Unfortunately, this doesn’t buy anything when there actually is a new result to load. The load will allocate a retire station and go to cache for the value and then wait out the call. At the end of the call the store will be detected by the retire station, which will then re-issue the load request. But as the load is supposed to retire at once, and the second request to the d$1 is not instantaneous, the retire station must stall the machine until it gets the just-stored data, which is the same time that would have been if the load were issued after the call rather than before it 🙁
All in all, with the present Mill definition, it looks like extra results would pass in PRs in memory, and not be available until d$1 latency after the call. This works, but is unattractive.
I’ll keep the issue in mind and will report if illumination strikes.
Knowing the expectant number of return values via hardware would help, as far as what I know now. It would at least eliminate having to keep an extra count parameter live from the function entry all the way through to where it’s used at the very end.
I presume that all return values are in a single return instruction, that you can’t stack values up separately in a loop and then have a single shared return? It sounds like it would have to switch/case between “return val”, “return val,nil”, “return val,nil,nil” etc if it’s all set up by the return instruction itself. I’m not sure if there are other better ways of expressing this, but if it’s easy to toss in a specReg, it would definitely open up a decent option.
So what happens if the results (after the first) are communicated in memory? If the shared-register approach works on a conventional, then sharing pseudo-registers at static addresses (w/r/t the program load image) should work too
They’d have to be thread-local, or potentially stack-allocated, not statically addressed. But yes, memory would be the default go-to for passing parameters around,.
(I am not an x86 expert – how many bytes in a load/store if the offset is one byte?)
Looking at some disasms on x86-64, all the ones I’m seeing are 5 bytes long. Opinions about the x86 architecture are not likely to change given this information. 😉
That load can be hoisted arbitrarily high, but if LISP overhead guarantees that any call will last longer than d$1 then hoisting is not necessary.
In existing implementations, the only Lisp overhead in returns is the callee either clearing carry, or setting it & the count of returned values. The caller has no overhead before accessing at least the first 3 returned values.
Now, I haven’t even gotten into passing parameters _into_ a call, which is much more complex. 😉 There are optional parameters, order-independent optional keyword parameters, raising the trailing set of parameters into a list object, referencing the entire parameter list as well as its parts, etc. There is an ‘apply’ function for dynamic buildup or pass-through of parameter lists. Current implementations set a register to list the count of parameters, similar to multi-value returns. A lot of the same issues all come up here, too, and having a specReg for the count of incoming parameters would avert some overhead as well. At least there are no older values on the function’s belt in this case.
Return takes the same belt-argument list that call (and a few other operations line conform and rescue) do. You can’t build it up piecemeal. The arguments don’t have to be adjacent on the belt, and the same belt position can be used for more than argument. There’s no “nil” argument to return (or call etc.); you have to put a nil (whatever that is in the app) on the belt and then pass it as many times as you want.
Thread-local is possible. It is not in general possible for a callee to address the caller’s stack without action on the part of the caller, which would be more trouble than TLS.
Sounds like a null function would be in and out before a d$1 latency, arguing again for putting the loads after the call rather than in-flight over the call.
Varargs calls in the belt are a problem; by the time you have figured out how many arguments you have they may well have fallen off the belt. In C only the fixed arguments pass in the belt and the variable part is passed in the stack, but C VARARGS is rare enough that it doesn’t matter. Sounds like every call in Lisp is facing the issue, which would matter.
I think the in-memory solution, while it might be better than other CPUs, is fundamentally inapt for the Mill function model. Needs thought.
If the caller knows whether it is interested in more than the first result, Lisps can have a calling convention where they tell the callee when they call, perhaps?
So single interesting return values, presumably the common case, are fast pathed.
And debuggers can patch this as they step through code, or they can just go the route of native debuggers and see what’s happening live even if there is only ever one result.
- This reply was modified 10 years, 9 months ago by Will_Edwards.
The Mill call operation knows how many results it expects, but this value is not currently available to the callee. It would be easy enough to stick it in a call-saved specReg with rd() access – would that be useful?
Returning a vector would certainly work, and the Nones are very Mill-ish. But smaller members have only 8-byte vectors. It would also cost some code to pack the vector and unpack it again, which might be worse than going via memory. The optimal might be to return one scalar and the vector; in the common case the vector would be all Nones, available as a popCon from rd() cheaply and ignored by the caller. In the uncommon case you’d pay for the pack/unpack. If exactly two results were common then return two scalars for that case and the caller can test whether it has a scalar or a vector second result.
However, because the code depends on the size of a vector then your Lisp becomes member-dependent, which we really want to avoid. 🙁
Well then that idea is out.
How big are lisp objects, anyways? If they’re too big you might have to deal with memory anyways. E.g. ruby is moving from 5 word objects to 6 or 8 word objects (because they fit more evenly into cache lines). Yes, you read that correctly, that’s 8 words of 8 bytes each or 64 bytes per object, which pushes the total belt size on small mills already, let alone returning half a dozen of them.
- This reply was modified 10 years, 9 months ago by Joe Taber.
Joe: Lisp heap objects are not of fixed size. Pointers to objects contain tag bits, with some of the common types completely contained within in the tag. So cons cells (the basic 2-tuple linked list element) are literally 2 bare words in memory with zero overhead. The fact that it exists as a cell is purely held in the tag bits of pointers reaching that memory location, not on the object itself. More complicated objects obviously have some actualized overhead, generally one word. Of course, objects themselves have no inherent mutexes, hashes, etc, that languages like Java do. Those would be user-managed fields.
Ivan: While this has been focused on Lisp, I think it is a reasonable concern in today’s market to ensure that Javascript, Python, and Java can achieve great performance as well, without hackish workarounds inside their JITs. Javascript and Python especially have a lot of Lisp-like features, with lots of similarities on their input parameter passing. Inline slab allocators, multiple stacks, etc, are common on all these types of systems. Having the hardware track dirty old-generation writes for the garbage collector without needing to trap & interrupt can save a ton of time.
Just something to toss in with the “needs thought” pile. Some of these likely do not affect the core functionality, but might be orthogonal add-ons of varying intensity.
Sorry, I’ve been using the wrong term. Instead of “slab allocator”, I’ve been actually meaning “tlab allocator”. Nursery allocation with simple inline pointer bumping, each thread having its own preallocated buffer. Many systems try to keep such a pointer in a global register.
Transcribed from comp.arch
===================================================================On 1/18/2014 12:19 PM, Stephen Fuld wrote:
> On 1/7/2014 10:37 AM, Stephen Fuld wrote:
>> I watched the talk, and after some time thinking about it, I have a
>> few questions.
>
>
> Another one.
>
>
> If this would be better after the talks on software pipelining and or
> “phasing”, just say so.
>
> I have been thinking about your loop parallelization mechanism. It
> seems from what you have presented so far, the degree of
> parallelization possible is the number of elements in a vector for
> the particular version of the Mill you are running on. But since you
> have so many FUs available, is there a way to “link” multiple vectors
> together in order to gain increased parallelism? e.g. if you could
> somehow process two vectors worth of bytes in parallel, (with the
> associated controls to prevent stores on the second vector), you
> would double the speed of the strcopy example you presented. I am
> not sure what I am asking for here but I could see some possible ways
> to do it.
>
>
>
>If there are enough FUs then the compiler can unroll the vectorized loop sufficiently to soak them up. In the strcpy example, the ops would simply be issued twice, plus you need a bit extra to compute the None mask for the second store and for the branch condition. Roughly:
load1 load2 eql1 eql2 smear1 smear2 pick1 pick2a(done1, smear2_mask, Nones) pick2b(pick2a, load2, Nones) store1 store2 or(smear1_done, smear2_done) branch
The extras are straightforward. The extra “or” is oring the “are we done” results of the two smears so that the loop exits if either thought it was finished.
The pick2a and pick2b build a mask for the second store. Pick2a uses the “are we done” bool from the first smear to produce either the mask from smear2 or all None. That is then used as the control to mask None into the loaded value; if there was a null in the first load then the pick2b will have an all-None control and will yield all None, whereas if there was no null then pick2b has the usual smear mask for control and works like in the talk.
The result is that the unrolled loop is twice as long plus two more ops. It could be fully pipelined for a throughput of 2x vector size per cycle. The pipeline startup time to first store would be the same as for the not-unrolled loop; load->eql->smear->pick->store. Unrolling needs five flow slots (2x(load, store), branch); three exu slots (2xeql, or); and three writer slots (3xpick). Four load/store function units means a pretty high-end Mill.
Some Fortran compilers can do this kind of unrolled vectorization, so I don’t see any reason why a Mill compiler could not find this code. Not high priority work though because of the limited utility in medium and smaller family members.
- This reply was modified 10 years, 9 months ago by Ivan Godard.
How is the metadata perserved when the belts are spilled to memory? Is there generally an extra byte for each belt entry or is there some sort of coelescing that happens?
Also, is it just me or do the values on the belt with their metadata function as monads? In fact, it looks just like Maybe in Haskell except instead of being
Maybe a = Just a | Nothing
instead it’sBelt a = Just a | Nothing | Error
and binding gives Error precedence over Nothing. When I was watching the talk I thought to myself “Oh, it’s just like Haskell” in the same way the Belt talk made me think “Oh, its just like Static Single Assignment form”The scratch and spill preserves metadata.
They are dealing with belt items, and not naked bytes, so just take the extra bits needed to maintain all this item state.
The belt width is model specific, but completely known to the hardware and any software that interacts with it, obviously, so its easy to take care of.
And yes, IMO these parallels with SSA and monads are appropriate 🙂
“Nothing new under the sun”? That what you get with a compiler guy doing architecture 🙂
Will’s reply is correct. The actual memory format is member-specific (software gets an API) and will be the same as is used by extended-scratchpad. We need to have a scratchpad talk 🙁
Maybe a = Just a | Nothing
instead it’sBelt a = Just a | Nothing | Error
and binding givesError
precedence overNothing
Small clarification, Ivan to correct me if I’m wrong, but my understanding is that
None
has precedence overNaR
; aNaR
andNone
isNone
.Actually NaR and None together are impossible. NaRs carry a “kind” indicating the nature of the initial problem, and a None is encoded as a flavor of NaR. That way the None/NaR propagation logic in the speculable FUs doesn’t have to test, and only the non-speculable ops (like store) need to care. As often happens, at the program level these are different but at the machine level they are blended to optimize the hardware.
Yes, I can see how
None
is a kind of not a result, but what I think we meant byNone
taking precedence overNaR
wasNone
taking precedence over other kinds ofNaR
.If two different
NaR
kinds are operands to an arithmetic operation, what is the outputNaR
type?For example if you multiply the two vectors:
2
NaR
NaR
None
and
None
None
2
NaR
, do you get:
None
None
NaR
None
?You’re good 🙂
Half of the kind-field values are devoted to None, so a None can be distinguished from any other kind of NaR by looking at one bit in the payload, and the non-speculable operators need only to look at two bits: the NaR bit in the metadata and the None bit in the payload. There is time to do that in the hardware, even for complicated cases like widen(), where the result format is different and the inbound None/NaRs need to have their payload size changed.
And yes, None takes precedence over NaR.
I’d like to request you guys add an operation (if you haven’t already).
I want to explain the operation, and then explain why I think it’d be
useful. The operation is smear-sum. It takes a vector of small (8-bit,
maybe 16-bit) integers an produces a running sum, and in addition
produces a second value which is the final sum. So, if given the vector
[ 1, 3, 10, 7, 4, 0, 2, 9 ] it produces the vector [ 1, 4, 14, 21, 25,
25, 27, 36 ] and the value 36. An alternative would smear-sumx, which
produces the sum excluding the current value, so the vector [ 1, 3, 10,
7, 4, 0, 2, 9 ] would return the vector [ 0, 1, 4, 14, 21, 25, 25, 27 ]
and the value 36. NaRs and Nones are propagated, but count as 0 for the
purposes of the sum, so the vector [ 1, 3, 10, NaR, None, 0, 2, 9 ]
returns the vector [ 1, 4, 14, NaR, None, 14, 16, 25 ] and the value 25.Why this instruction would be useful: a lot of languages now use copying
GC. One of the advantages of copying GC is that the heap is alway
contiguous, so allocation is bumping a pointer. The allocation routine
then looks like this:extern size_t * heap_top;
extern size_t * heap_limit;size_t * alloc(size_t num_words) {
register size_t * rval = heap_top;
heap_top += num_words;
if (heap_top > heap_limit) {
/* This is a rarely taken branch- less than 1% of the time */
do_a_gc();
/* do_a_gc() resets heap_top and heap_limit so we can now
* succeed at the allocation.
*/
rval = heap_top;
heap_top += num_words;
}
return rval;
}Similar pointer-bump allocations also show up in languages without
garbage collection- for example, when adding items to a vector. Where
smear-sum comes into play is when you want to vectorize a loop that
contains a pointer-bump allocation. You’d simply do a smear-sum to
calculate what the offsets of each iteration’s allocation are from the
current top pointer, then do a vector add of the offsets to the current
top pointer to get their real address, and then off you go. A vector
compare and a normal smear catches the (presumptively rare) case where
an allocation would trigger a gc (or vector resize).The problem with this instruction is, of course, the deep dependency
chain between the adds. You can’t produce the nth value until you know
the (n-1)th value. Even limiting it to small 8-bit adds, I doubt that
this instruction would be a 1-clock “fast” instruction. It’ll probably
be a 2- or 3-clock instruction. And that should be sufficient.You may be, likely are, well ahead of me here, and have already added
the capability to do this. But just in case you haven’t, I thought I’d
put the request in.Wow. Color me impressed.
The operation is of course possible; it’s a flavor of reduction sum, and will be NlogN/2 like other reductions. We’ve gone back and forth about reductions in the Mill, and currently there are no reduction operations except those for bool vectors for which we look only at a single bit per element:
any()
(reduction OR),all()
(reduction and), and the varioussmear()
s.The problem with sigma and pi (reduction sum and product) is that these are essentially subroutines in the hardware. For a microcoded machine you can make them be actual subroutines, but the Mill strongly avoids microcode. If done as direct hardware, reductions completely plug up a compute pipe for an extended time, which causes schedule hazards that the operation scheduling algorithm has to deal with. Questions like what happens if you start a reduction, then take a branch and the target code tries to use the adder in that pipe 🙁
The general strategy the Mill follow is to expose subroutine-like operations to the software, and let the code treat them as open-coded functions rather than operations. For example, the Mill has no
sqrt()
operation. Instead, it provides ansqrtApproximation()
operation that can be used as the seed of a Newton-Rapheson sqrt. The individual stages of the NR are ordinary ops and can be scheduled normally.We have taken the same approach with reductions, except for the bool cases that can be done directly in logic. A sigma can be done by logN stages of an ordinary vector sum and a shuffle. In machines with narrow issue or pipeline hazards, especially when the are only a few supported widths (looking at you, x86) there are advantages to letting this be a single op that expands to internal microcode that knows where the hazards are. On a hazard-free wide-issue machine with generalized widths, like the Mill, there’s no real advantage to such an operation over the explicit shuffle/add/shuffle/add… sequence.
Our current sum-reduction code sequence does not produce the intermediate vector you are looking for; it does pair-wise sums starting with A[0]+A[1], and at the end the reduction is in A[N-1] (pulled out as a scalar
extract()
at the end) and the rest of the vector is garbage for your need. You might take a look at trying to code your case to see what shuffles will get the vector you want. A Mill shuffle produces a same-count vector with arbitrary mapping from source to destination element positions, including duplicates. It would be especially attractive if the same shuffle pattern were used for all stages, so the code could be generated for arbitrary (member and element width dependent) N. Please post what you come up with.Thank you – it’s an interesting use-case and made me think hard about reductions.
I will sheepishly admit that I hadn’t thought of the log(N) reduction idea. If you can you can implement this with some combination of shuffles and vectorized adds in, say, four clocks, then the need for this instruction goes away. Or even six clocks- and I’d be surprised if it took six clocks. It’s simply not a big enough win anymore to make it worth the instruction encoding hit and increased complexity of the CPU. I withdraw the suggestion. :-}
In general, I think I would advocate for a simpler instruction set, all other things being equal. I’ve yet to see an instruction set get simpler as it ages. There is always time to add complexity later, if it’s needed. Once added, however, complexity never seems to get removed. And it has a bad habit of becoming an albatross hung around the neck of the instruction set, unused but impossible to get rid of (see, for example, AAD on the x86, or the Vax instruction set).
Time overall will be determined by the element size (set by the program) and vector size (fixed by the member specification) which will set the value of N for the logN stages. A vector add is one clock, for integers <= 32 bits. A shuffle is one clock too, so you are looking at two clocks per stage. However, I have a sneaking suspicion that it may be more complicated to deal with None/NaR the way you want (a normal reduction would return None/NaR if any element was). I encourage you (and any other interested readers) to play with the code and see what you can do. You may well find that you would really like some kind of specialized shuffle or other specialize operation, and we can and would add such an operation.
Add-reductions keep coming up in my mind when doing 3D (e.g. as game and graphics engines will be doing buckets of). In 3D graphics there are lots of vectors which are 3 or 4 long.
I imagine that, whilst belt vectors are powers-of-2 in length, you can load a non-power-of-2 vector, and the load automatically pads it with Nones? So if you
load(addr,float32,3)
you actually getx
y
z
None
.And you’d want an add reduction to treat Nones as 0 rather than propogate them.
The
shuffle
sounds useful for computing cross-product.Generally in games/graphics you want sqrt, inverse sqrt and dot product. You also likely want to sum a vector again when you do matrix multiplication.
My thinking would be that in the Mill IR sqrt, inverse sqrt, sum reduction, fork/exec/vfork, memcpy and memmove etc are built-in functions, and the specialiser on each target turns that into single or multiple operations as the target supports. So that’s like microcode (or standard function inlining), but in the specialising compiler rather than in the outer compiler or on-CPU. It would be a hassle for a specialiser to have to unravel some old IR that is coding its own sqrt loop using a lower-level operation if there is ever hardware with better built-in sqrt, for example?
And as for hazards, we all want to avoid them, but pragmatically if its the specialiser that has to know about them and it has to know about one of them, it might as well open the floodgates and have a few more 😉
There is currently no way to load a partial vector. The interface with the cache could handle it (we send a byte-mask anyway), but it would be difficult to encode: we already need morsels for base, index amd width, and up to the four bytes of manifest for an offset. If (for example) the target is 32 bytes high and it’s a byte vector to load, that’s another 32 bits in the encoding for the mask. That doesn’t fit in one slot.
The “normal” reduction expansion produces None if any element is None and NaR if any is NaR. To get the behavior you suggest (to treat None as the identity element of the reduction operation) then you would: (assume is a byte vector)
rd(bv(0)), // get a vector of zero-bytes isNone(<data>), // test the data for none pick(<isNone>, <rd>, <data>); // replace the Nones with zero <reduce sequence using <pick> > // reduce
Your suggestion that there should be a sqrt op that the specializer replaces with emulation is in fact what happens now; we don’t expect to ever do a sqrt (or friends) in native, but it’s easy to co-opt the specializer’s emulation mechanism for intrinsic routines. I don’t think of them as microcode, because they are visible to the user and get scheduled mixed in with other macrocode, but YMMV.
Hazards are a pain. The specializer can do them (and does for FMA), but they louse up the schedule for the rest of the code. When a hazardful op offers no real advantage over hazard-free emulation then we’ll leave it out.
You can indeed do a cumulative sum of N elements with log(N) sums of vectors of length N/2. The problem is all the reshuffling, storing, retrieving and duplicating of intermediate values that you have to do to get there. For small vectors like the ones we are talking about, there is probably little to no advantage over the naive method.
Now that it’s filed as part of the filing for the Execution talk, let me introduce you to the “alternate” op:
alternate(a, b)->{c,d}
where a/b/c/d are vectors. For example using 4 element vectors, where a is {a0, a1, a2, a3} and b is {b0, b1, b2, b3}, c is {a0, b0, a1, b1} and d is {a2, b2, a3, b3}.Reduction is logN stages with an alternate between stages.
I’m not sure that’s exactly what we want. To get maximum performance, we should probably use this algorithm:
1: for d = 1 to log2 n do 2: for all k in parallel do 3: if k >= 2^d then 4: x[k] = x[k – 2^(d-1)] + x[k] (from NVIDIA'as GPU Gems page)
Alternate would be useful for d = 3, but the other steps require a shuffle.
By the way, I would be personally excited for faster cumulative sum because it is useful for implementing fast gaussian blurs.
Your sample code is a progressive sum rather than a reduction, although of course the final element of a progressive sum is the overall reduction value. Alternate is for reductions, and as a result is independent of the value of d. It’s not clear to me how to have a d-independent stage for a progressive sum, or even if it is possible. You got an algorithm, or a proof that it is impossible?
Currently the shuffle op can provide d-dependent staging as used by your code snippet; the drawback to shuffle is that it is expensive in entropy, hence the introduction of alternate as a special case. Clearly we could add a set of d-dependent ops, covering all possible vector widths on a given member, to replace the shuffles. It’s not clear that progressive sum (or progressive anything in general) is common enough to be worth the clutter in the instruction set. If you can come up with a d-independent stage then I’d be much more inclined to put it in the Mill op set.
Ivan
I think it’s worth to mention that progressive sum (or scan / prefix sum as it’s often called) is a very useful building block for a number of parallel algorithms, often ones that involve mapping each input element into 0 to N outputs, with the output amount varying for each input.
One interesting case is using it to do radix sorting, where you first loop through a large array of integers, counting histograms for each of your radix “bins”, then compute base offsets for each bin using prefix sum, and finally scatter the array elements into their new places using those base offsets.
Having a native instruction (or an IR instruction that is specialized to machine code) to compute native vector-wide prefix sums would be very nice, since longer prefix sums can easily be built in terms of smaller ones.
Of course, prefix sum can always be implemented manually with shuffles, but I could imagine that it might be inconvenient if you don’t know the hardware vector size at compile time (you need log N steps, but you need to know what N is, and I understood the hardware vector size varies between Mill family members).
- AuthorPosts
You must be logged in to reply to this topic.