Wow. Color me impressed.
The operation is of course possible; it’s a flavor of reduction sum, and will be NlogN/2 like other reductions. We’ve gone back and forth about reductions in the Mill, and currently there are no reduction operations except those for bool vectors for which we look only at a single bit per element:
any() (reduction OR),
all() (reduction and), and the various
The problem with sigma and pi (reduction sum and product) is that these are essentially subroutines in the hardware. For a microcoded machine you can make them be actual subroutines, but the Mill strongly avoids microcode. If done as direct hardware, reductions completely plug up a compute pipe for an extended time, which causes schedule hazards that the operation scheduling algorithm has to deal with. Questions like what happens if you start a reduction, then take a branch and the target code tries to use the adder in that pipe 🙁
The general strategy the Mill follow is to expose subroutine-like operations to the software, and let the code treat them as open-coded functions rather than operations. For example, the Mill has no
sqrt() operation. Instead, it provides an
sqrtApproximation() operation that can be used as the seed of a Newton-Rapheson sqrt. The individual stages of the NR are ordinary ops and can be scheduled normally.
We have taken the same approach with reductions, except for the bool cases that can be done directly in logic. A sigma can be done by logN stages of an ordinary vector sum and a shuffle. In machines with narrow issue or pipeline hazards, especially when the are only a few supported widths (looking at you, x86) there are advantages to letting this be a single op that expands to internal microcode that knows where the hazards are. On a hazard-free wide-issue machine with generalized widths, like the Mill, there’s no real advantage to such an operation over the explicit shuffle/add/shuffle/add… sequence.
Our current sum-reduction code sequence does not produce the intermediate vector you are looking for; it does pair-wise sums starting with A+A, and at the end the reduction is in A[N-1] (pulled out as a scalar
extract() at the end) and the rest of the vector is garbage for your need. You might take a look at trying to code your case to see what shuffles will get the vector you want. A Mill shuffle produces a same-count vector with arbitrary mapping from source to destination element positions, including duplicates. It would be especially attractive if the same shuffle pattern were used for all stages, so the code could be generated for arbitrary (member and element width dependent) N. Please post what you come up with.
Thank you – it’s an interesting use-case and made me think hard about reductions.