Metadata

ivan

You're good :-)

Half of the kind-field values are devoted to None, so a None can be distinguished from any other kind of NaR by looking at one bit in the payload, and the non-speculable operators need only to look at two bits: the NaR bit in the metadata and the None bit in the payload. There is time to do that in the hardware, even for complicated cases like widen(), where the result format is different and the inbound None/NaRs need to have their payload size changed.

And yes, None takes precedence over NaR.

ivan

Wow. Color me impressed.

The operation is of course possible; it's a flavor of reduction sum, and will be NlogN/2 like other reductions. We've gone back and forth about reductions in the Mill, and currently there are no reduction operations except those for bool vectors for which we look only at a single bit per element: any() (reduction OR), all() (reduction and), and the various smear()s.

The problem with sigma and pi (reduction sum and product) is that these are essentially subroutines in the hardware. For a microcoded machine you can make them be actual subroutines, but the Mill strongly avoids microcode. If done as direct hardware, reductions completely plug up a compute pipe for an extended time, which causes schedule hazards that the operation scheduling algorithm has to deal with. Questions like what happens if you start a reduction, then take a branch and the target code tries to use the adder in that pipe :-(

The general strategy the Mill follow is to expose subroutine-like operations to the software, and let the code treat them as open-coded functions rather than operations. For example, the Mill has no sqrt() operation. Instead, it provides an sqrtApproximation() operation that can be used as the seed of a Newton-Rapheson sqrt. The individual stages of the NR are ordinary ops and can be scheduled normally.

We have taken the same approach with reductions, except for the bool cases that can be done directly in logic. A sigma can be done by logN stages of an ordinary vector sum and a shuffle. In machines with narrow issue or pipeline hazards, especially when the are only a few supported widths (looking at you, x86) there are advantages to letting this be a single op that expands to internal microcode that knows where the hazards are. On a hazard-free wide-issue machine with generalized widths, like the Mill, there's no real advantage to such an operation over the explicit shuffle/add/shuffle/add... sequence.

Our current sum-reduction code sequence does not produce the intermediate vector you are looking for; it does pair-wise sums starting with A[0]+A[1], and at the end the reduction is in A[N-1] (pulled out as a scalar extract() at the end) and the rest of the vector is garbage for your need. You might take a look at trying to code your case to see what shuffles will get the vector you want. A Mill shuffle produces a same-count vector with arbitrary mapping from source to destination element positions, including duplicates. It would be especially attractive if the same shuffle pattern were used for all stages, so the code could be generated for arbitrary (member and element width dependent) N. Please post what you come up with.

Thank you - it's an interesting use-case and made me think hard about reductions.

bhurt

I will sheepishly admit that I hadn't thought of the log(N) reduction idea. If you can you can implement this with some combination of shuffles and vectorized adds in, say, four clocks, then the need for this instruction goes away. Or even six clocks- and I'd be surprised if it took six clocks. It's simply not a big enough win anymore to make it worth the instruction encoding hit and increased complexity of the CPU. I withdraw the suggestion. :-}

In general, I think I would advocate for a simpler instruction set, all other things being equal. I've yet to see an instruction set get simpler as it ages. There is always time to add complexity later, if it's needed. Once added, however, complexity never seems to get removed. And it has a bad habit of becoming an albatross hung around the neck of the instruction set, unused but impossible to get rid of (see, for example, AAD on the x86, or the Vax instruction set).

ivan

Time overall will be determined by the element size (set by the program) and vector size (fixed by the member specification) which will set the value of N for the logN stages. A vector add is one clock, for integers <= 32 bits. A shuffle is one clock too, so you are looking at two clocks per stage. However, I have a sneaking suspicion that it may be more complicated to deal with None/NaR the way you want (a normal reduction would return None/NaR if any element was). I encourage you (and any other interested readers) to play with the code and see what you can do. You may well find that you would really like some kind of specialized shuffle or other specialize operation, and we can and would add such an operation.

Will_Edwards

Add-reductions keep coming up in my mind when doing 3D (e.g. as game and graphics engines will be doing buckets of). In 3D graphics there are lots of vectors which are 3 or 4 long.

I imagine that, whilst belt vectors are powers-of-2 in length, you can load a non-power-of-2 vector, and the load automatically pads it with Nones? So if you load(addr,float32,3) you actually get xyzNone.

And you'd want an add reduction to treat Nones as 0 rather than propogate them.

The shuffle sounds useful for computing cross-product.

Generally in games/graphics you want sqrt, inverse sqrt and dot product. You also likely want to sum a vector again when you do matrix multiplication.

My thinking would be that in the Mill IR sqrt, inverse sqrt, sum reduction, fork/exec/vfork, memcpy and memmove etc are built-in functions, and the specialiser on each target turns that into single or multiple operations as the target supports. So that's like microcode (or standard function inlining), but in the specialising compiler rather than in the outer compiler or on-CPU. It would be a hassle for a specialiser to have to unravel some old IR that is coding its own sqrt loop using a lower-level operation if there is ever hardware with better built-in sqrt, for example?

And as for hazards, we all want to avoid them, but pragmatically if its the specialiser that has to know about them and it has to know about one of them, it might as well open the floodgates and have a few more ;)

ivan

There is currently no way to load a partial vector. The interface with the cache could handle it (we send a byte-mask anyway), but it would be difficult to encode: we already need morsels for base, index amd width, and up to the four bytes of manifest for an offset. If (for example) the target is 32 bytes high and it's a byte vector to load, that's another 32 bits in the encoding for the mask. That doesn't fit in one slot.

The "normal" reduction expansion produces None if any element is None and NaR if any is NaR. To get the behavior you suggest (to treat None as the identity element of the reduction operation) then you would: (assume is a byte vector)

rd(bv(0)),                         // get a vector of zero-bytes

isNone(<data>),                    // test the data for none

pick(<isNone>, <rd>, <data>);      // replace the Nones with zero

<reduce sequence using <pick> >    // reduce

Your suggestion that there should be a sqrt op that the specializer replaces with emulation is in fact what happens now; we don't expect to ever do a sqrt (or friends) in native, but it's easy to co-opt the specializer's emulation mechanism for intrinsic routines. I don't think of them as microcode, because they are visible to the user and get scheduled mixed in with other macrocode, but YMMV.

Hazards are a pain. The specializer can do them (and does for FMA), but they louse up the schedule for the rest of the code. When a hazardful op offers no real advantage over hazard-free emulation then we'll leave it out.

mermerico

You can indeed do a cumulative sum of N elements with log(N) sums of vectors of length N/2. The problem is all the reshuffling, storing, retrieving and duplicating of intermediate values that you have to do to get there. For small vectors like the ones we are talking about, there is probably little to no advantage over the naive method.

ivan

Now that it's filed as part of the filing for the Execution talk, let me introduce you to the "alternate" op:
alternate(a, b)->{c,d}
where a/b/c/d are vectors. For example using 4 element vectors, where a is {a0, a1, a2, a3} and b is {b0, b1, b2, b3}, c is {a0, b0, a1, b1} and d is {a2, b2, a3, b3}.

Reduction is logN stages with an alternate between stages.

mermerico

I'm not sure that's exactly what we want. To get maximum performance, we should probably use this algorithm: Naive algorithm taken from Nvidia GPU Gems

Naive algorithm taken from Nvidia GPU Gems

1: for d = 1 to log2 n do

2:     for all k in parallel do

3:          if k >= 2^d  then

4:              x[k] = x[k – 2^(d-1)] + x[k]

(from NVIDIA'as GPU Gems page)

Alternate would be useful for d = 3, but the other steps require a shuffle.

By the way, I would be personally excited for faster cumulative sum because it is useful for implementing fast gaussian blurs.

ivan

Your sample code is a progressive sum rather than a reduction, although of course the final element of a progressive sum is the overall reduction value. Alternate is for reductions, and as a result is independent of the value of d. It's not clear to me how to have a d-independent stage for a progressive sum, or even if it is possible. You got an algorithm, or a proof that it is impossible?

Currently the shuffle op can provide d-dependent staging as used by your code snippet; the drawback to shuffle is that it is expensive in entropy, hence the introduction of alternate as a special case. Clearly we could add a set of d-dependent ops, covering all possible vector widths on a given member, to replace the shuffles. It's not clear that progressive sum (or progressive anything in general) is common enough to be worth the clutter in the instruction set. If you can come up with a d-independent stage then I'd be much more inclined to put it in the Mill op set.

Ivan

JussiK

I think it's worth to mention that progressive sum (or scan / prefix sum as it's often called) is a very useful building block for a number of parallel algorithms, often ones that involve mapping each input element into 0 to N outputs, with the output amount varying for each input.

One interesting case is using it to do radix sorting, where you first loop through a large array of integers, counting histograms for each of your radix "bins", then compute base offsets for each bin using prefix sum, and finally scatter the array elements into their new places using those base offsets.

Having a native instruction (or an IR instruction that is specialized to machine code) to compute native vector-wide prefix sums would be very nice, since longer prefix sums can easily be built in terms of smaller ones.

Of course, prefix sum can always be implemented manually with shuffles, but I could imagine that it might be inconvenient if you don't know the hardware vector size at compile time (you need log N steps, but you need to know what N is, and I understood the hardware vector size varies between Mill family members).

« Previous Page