Forum Replies Created
- AuthorPosts
- in reply to: Introduction to the Mill CPU Programming Model #622
Yes, you can have a wider path to DRAM, although there are board-level hardware issues; the costs and benefits are independent of the ISA. And you can reduce the DRAM latency by stacking and similar approaches; this to is independent of ISA.
Where the ISA can help is twofold: reducing the total DRAM traffic; and changing the timing. The Memory talk showed ways that the Mill cuts the traffic bandwidth; not a lot (perhaps 25%), but every bit helps. And the Mill has ways to alter the timing, but the patents aren’t in on those yet.
As for trading, there not much left to trade. On many conventionals there are serious overheads setting the machine up for vectors, but the Mill doesn’t have that problem (and there’s no fundamental reason others should either). You might as well vectorize a short loop on a Mill; even if it turns out that there’s only one element in the array, the cost of a vector and a scalar code will be very close, and vector wins big once the element count exceeds the vector height. The only real advantage of a scalar loop for very short arrays is in power usage – they are equally fast, but dealing with the Nones in the unused elements does cost power.
Not sure I’ve answered your question; feel free to try again.
Your sample code is a progressive sum rather than a reduction, although of course the final element of a progressive sum is the overall reduction value. Alternate is for reductions, and as a result is independent of the value of d. It’s not clear to me how to have a d-independent stage for a progressive sum, or even if it is possible. You got an algorithm, or a proof that it is impossible?
Currently the shuffle op can provide d-dependent staging as used by your code snippet; the drawback to shuffle is that it is expensive in entropy, hence the introduction of alternate as a special case. Clearly we could add a set of d-dependent ops, covering all possible vector widths on a given member, to replace the shuffles. It’s not clear that progressive sum (or progressive anything in general) is common enough to be worth the clutter in the instruction set. If you can come up with a d-independent stage then I’d be much more inclined to put it in the Mill op set.
Ivan
p.s. Forgot to respond to your protection question:
The spiller has its own region of the address space. Spiller space is used for all processes, and the save state of the different processes are not in the process’s address space and so cannot be directly addressed by the program. In effect you can think of the spiller having its own private PLB, although it need not (and so far is not) implemented that way.
Utilities like debuggers and stack unwinders get access to saved state through an API that runs with PLB rights to spiller space; in effect the API is another process. The API is restricted; you cannot arbitrarily change the downstack links for example. As a result, the usual stack-smash exploit is impossible on the Mill.
Because the application cannot address spiller space, there is no synchronization needed between app use of memory and spiller use of memory; they are necessarily disjoint.
There are essentially three layers in the spiller. At the top layer, data is stored in flipflops, effectively registers except not addressable by the program. These are buffers, used as parking places for values that are still live on the belt and are in the result register of a functional unit but the FU has another result coming out and needs its result register back. These live operands are first daisy-chained up the latency line within the containing FU pipeline, but eventually the pipeline runs out and the operand, if still live, gets moved to a spiller buffer. This cost of the move is the same as a register-to-register copy on a conventional machine. A running program that doesn’t do a function call or return will live entirely in these registers, with no spiller traffic below that.
The second layer is a block of SRAM, connected to those spiller registers. When you do a call operation, we save the belt, which means the belt state in pipe and spiller registers is saved. The save is lazy, but gradually live-but-inactive operands are copied from the needed registers into the spiller SRAM. The SRAM is big enough to hold several belts (and scratchpads and other non-memory frame-related state), so you can nest several deep in calls with everything fitting in the spiller SRAM. Most programs spend most of their time withing a frame working set of five or so, calling in a few, returning out a few, calling in a few, over and over. Such behavior fits entirely internal to the spiller.
However, if a program suddenly switches to a deep run of calls, a run of returns, or if there’s a task switch, the state of all those nested calls (or the new thread) will exceed the spiller’s SRAM and the spiller then uses the third level, which is the regular memory hierarchy. The spiller does not go direct to DRAM; it talks to the L2 cache instead, which provides still more buffering.
If you compare the spiller with the explicit state save used by a conventional legacy register machine, you will see that the spiller top level is akin to the register machine’s rename and architected registers; if a function fits in the registers then there’s no traffic, just as if it fits in the belt and scratchpad there’s no traffic.
If there are nested calls on a conventional then the register state gets saved to the hierarchy using normal store operations. These go to the D$1 cache and are buffered there. This is akin to the spiller’s SRAM, but the spiller has three advantages: it uses no program code to do the save, so the power, instruction entropy, and store buffer contention cost of the stores is avoided; it uses a private repository, so saves are not cluttering the D$1 (which is a very scarce resource); and it’s not program visible, removing many possibilities of program bugs and exploits.
Lastly, if you have deeply nested calls on a conventional, the saved state exceeds the capacity of the D$1 and will overflow into the D$2 and eventually DRAM. The spiller does the same when it runs out of private SRAM. Put all this together, and you see that the spiller is in effect a bunch of registers hooked to its own cache, and the overall benefit is to shift state save/restore out of the top level D$ cache and into the spiller SRAM which is in effect a private spiller cache, freeing up space for real data.
One last point: the total Mill state traffic is less than that on a conventional. A conventional callee-save protocol winds up saving registers that are in fact dead, but the callee doesn’t know that and saves them anyway. And the existence of the scratchpad on a Mill means that many function locals that would be kept in memory are in the scratchpad and so do not contribute to cache load and memory bandwidth. Combine these effects, and our sims suggest that the Mill save/restore and locals traffic is about a factor of two less than that of a conventional. This saves not only bandwidth but also power.
We do not have large-scale sims yet so the overall results are guesstimates, but it does appear that actual DRAM traffic on a Mill will be overwhelmingly composed of I/O and very large external data sets, which have the same traffic load as on a conventional; save/restore, locals, and the working set of globals will never see DRAM. That’s why we the Mill has the Virtual Zero that lets it use “memory” that has no backing DRAM.
All this needs pictures, and we will get to the spiller in the talks eventually.
Now that it’s filed as part of the filing for the Execution talk, let me introduce you to the “alternate” op:
alternate(a, b)->{c,d}
where a/b/c/d are vectors. For example using 4 element vectors, where a is {a0, a1, a2, a3} and b is {b0, b1, b2, b3}, c is {a0, b0, a1, b1} and d is {a2, b2, a3, b3}.Reduction is logN stages with an alternate between stages.
- in reply to: Introduction to the Mill CPU Programming Model #617
Unobtainium yet? Coming right up.
Note on terminology: we refer to the MIMD dimension as the width of the machine, and the vector size as the height. Width is thus an encoding matter, which is related to but not the same as the number of Belt positions, which we call the length.
- in reply to: Introduction to the Mill CPU Programming Model #616
Recall that there are two dimensions of parallelism on a Mill: MIMD width (number of slots, == number of concurrent operations), and vector height (in bytes). It’s hard to find a code that can use more than a Silver in open code; there’s just not that much ILP in the code, even with phasing. In pipelined loops there’s all the ILP you want, but once you have every op in the loop running pipelined then all you can do is unroll and do several iterations at once. Even HPC loops are usually only one or two instructions on a Gold, so it’s not clear that there is much use for greater width. While wider is certainly possible, we spec’d Gold to be at what we expected to be the point of diminishing returns for width.
However, even on a narrower configuration there is some use for higher, i.e. more SIMD even if not so much MIMD. Those pipelined loops can often be vectorized, and if the iteration count is big enough to use the height then that’s probably the best way to get raw horsepower.
Of course, all the horsepower in the world will bot give you anything if you are memory bandwidth limited, and, as you suggest, bandwidth is likely to be the practical limit.
- in reply to: Instruction Encoding #580
In the recommended OS model for Mill these sorts of exploits are impossible. Security on the Mill is the subject of a future talk, but briefly: the Mill is intended to use a nano-kernel style OS structure. There is no “kernel mode” or “user mode”, no supervisor, and no privileged operations; “stack cracking” is impossible. Where architecture comes in: there is no performance penalty for a secure system on the Mill.
IMO, putting a user-reachable JIT where it can access memory that it is not intended to use, all in the name of performance, is appalling. The inevitable result is well deserved.
There is no complete technological solution to security; all precautions fall before blackmail, cute agents, and other social penetrations. However, admitting that is not the same as posting a large “enter here” sign on an unlocked door.
There is currently no way to load a partial vector. The interface with the cache could handle it (we send a byte-mask anyway), but it would be difficult to encode: we already need morsels for base, index amd width, and up to the four bytes of manifest for an offset. If (for example) the target is 32 bytes high and it’s a byte vector to load, that’s another 32 bits in the encoding for the mask. That doesn’t fit in one slot.
The “normal” reduction expansion produces None if any element is None and NaR if any is NaR. To get the behavior you suggest (to treat None as the identity element of the reduction operation) then you would: (assume is a byte vector)
rd(bv(0)), // get a vector of zero-bytes isNone(<data>), // test the data for none pick(<isNone>, <rd>, <data>); // replace the Nones with zero <reduce sequence using <pick> > // reduce
Your suggestion that there should be a sqrt op that the specializer replaces with emulation is in fact what happens now; we don’t expect to ever do a sqrt (or friends) in native, but it’s easy to co-opt the specializer’s emulation mechanism for intrinsic routines. I don’t think of them as microcode, because they are visible to the user and get scheduled mixed in with other macrocode, but YMMV.
Hazards are a pain. The specializer can do them (and does for FMA), but they louse up the schedule for the rest of the code. When a hazardful op offers no real advantage over hazard-free emulation then we’ll leave it out.
California. Pacific time.
Time overall will be determined by the element size (set by the program) and vector size (fixed by the member specification) which will set the value of N for the logN stages. A vector add is one clock, for integers <= 32 bits. A shuffle is one clock too, so you are looking at two clocks per stage. However, I have a sneaking suspicion that it may be more complicated to deal with None/NaR the way you want (a normal reduction would return None/NaR if any element was). I encourage you (and any other interested readers) to play with the code and see what you can do. You may well find that you would really like some kind of specialized shuffle or other specialize operation, and we can and would add such an operation.
Wow. Color me impressed.
The operation is of course possible; it’s a flavor of reduction sum, and will be NlogN/2 like other reductions. We’ve gone back and forth about reductions in the Mill, and currently there are no reduction operations except those for bool vectors for which we look only at a single bit per element:
any()
(reduction OR),all()
(reduction and), and the varioussmear()
s.The problem with sigma and pi (reduction sum and product) is that these are essentially subroutines in the hardware. For a microcoded machine you can make them be actual subroutines, but the Mill strongly avoids microcode. If done as direct hardware, reductions completely plug up a compute pipe for an extended time, which causes schedule hazards that the operation scheduling algorithm has to deal with. Questions like what happens if you start a reduction, then take a branch and the target code tries to use the adder in that pipe 🙁
The general strategy the Mill follow is to expose subroutine-like operations to the software, and let the code treat them as open-coded functions rather than operations. For example, the Mill has no
sqrt()
operation. Instead, it provides ansqrtApproximation()
operation that can be used as the seed of a Newton-Rapheson sqrt. The individual stages of the NR are ordinary ops and can be scheduled normally.We have taken the same approach with reductions, except for the bool cases that can be done directly in logic. A sigma can be done by logN stages of an ordinary vector sum and a shuffle. In machines with narrow issue or pipeline hazards, especially when the are only a few supported widths (looking at you, x86) there are advantages to letting this be a single op that expands to internal microcode that knows where the hazards are. On a hazard-free wide-issue machine with generalized widths, like the Mill, there’s no real advantage to such an operation over the explicit shuffle/add/shuffle/add… sequence.
Our current sum-reduction code sequence does not produce the intermediate vector you are looking for; it does pair-wise sums starting with A[0]+A[1], and at the end the reduction is in A[N-1] (pulled out as a scalar
extract()
at the end) and the rest of the vector is garbage for your need. You might take a look at trying to code your case to see what shuffles will get the vector you want. A Mill shuffle produces a same-count vector with arbitrary mapping from source to destination element positions, including duplicates. It would be especially attractive if the same shuffle pattern were used for all stages, so the code could be generated for arbitrary (member and element width dependent) N. Please post what you come up with.Thank you – it’s an interesting use-case and made me think hard about reductions.
You’re good 🙂
Half of the kind-field values are devoted to None, so a None can be distinguished from any other kind of NaR by looking at one bit in the payload, and the non-speculable operators need only to look at two bits: the NaR bit in the metadata and the None bit in the payload. There is time to do that in the hardware, even for complicated cases like widen(), where the result format is different and the inbound None/NaRs need to have their payload size changed.
And yes, None takes precedence over NaR.
Actually NaR and None together are impossible. NaRs carry a “kind” indicating the nature of the initial problem, and a None is encoded as a flavor of NaR. That way the None/NaR propagation logic in the speculable FUs doesn’t have to test, and only the non-speculable ops (like store) need to care. As often happens, at the program level these are different but at the machine level they are blended to optimize the hardware.
- AuthorPosts