Division: Algorithms and user-submitted genAsm code

From Mill Computing Wiki
Jump to: navigation, search

Integer Division on the Mill: User-contributions.

This page is for collecting ideas, pseudo-code and hopefully genAsm emulation sequences to do (integer) division on the Mill. (The floating-point division emulation code is apparently either done or well along.)

As in the parent page, please do NOT speculate on the Mill, per se. But alternative algorithms are most welcome. So, please feel free to add links to different ways to implement division.

N-R is shorthand for the Newton-Raphson methodfor finding the roots of a function f(x), given an initial guess for x and a way to compute the derivative, df/dx. Wikipedia's description of the method is at [1]


Assumptions:

The Mill's rdiv* operator family (e.g. rdivu [[2]] and rdivs) provides an approximation to the reciprocal of its input argument. Scaling and precision are still TBD. Note that the Wikipedia/Newton-Raphson algorithm, as given, does not use any special instructions to estimate reciprocals, since most CPUs don't have instructions/operations to estimate 1/x, given x. Since the Mill does have such an operation, the hope is that we can use it to converge some flavor of iterative division in fewer operations than the conventional N-R algorithm.

Having rdiv as a (fairly fast, I hope) Mill operation suggests a divide algorithm, in which we use rdiv(trialQuotient_i) * trialRemainder_i as a correction to trialQuotient_i, to get a new starting value, trialQuotient_i+1, for the next iteration. (Terje was already thinking along these lines, before I did.) What's not yet clear to me is whether there's better convergence to be had, either in combination with classic N-R or instead of N-R. (And whether it'll be faster than textbook N-R.

We can use integer operations (other than division), such as add, subtract, multiply, shift and count leading (and/or trailing) zeros.

Ivan says to unroll any loops -- that is, simply do the required number of iterations to converge. Unrolled code, while less elegant, should permit the specializer to schedule operations for other data flows in parallel with divisions.

It's not yet clear to me what matters most in coding for speed. While there are some special cases -- where we could return results faster than the general case. Those cases are usually data-dependent, so the specializer cannot usually handle them at specialization time. (I'm assuming that LLVM will have already found and handled the special cases where the compiler can avoid division, e.g. by shifting when the divisor is a constant and a power of two.)

The kicker is that trying to short-circuit those special cases (that LLVM cannot) means extra operations (to detect at runtime whether we have one of the special cases), and those extra operations cost cycles on all division ops. So it's not yet clear to me whether we want to code to minimize the worst-case speed of division vs. code for fastest division on a statistical basis. One of the things that makes this tricky to understand is that, on a wide Mill, we can usually do those checks fairly cheaply. Whereas on a low-end Mill, we often won't have "spare" ALU ops that can be calculating and checking predicates. This same issue will also matter to whether/not doing "are we done yet?" tests will help or hurt performance.

??What can we assume about the speeds of integer operations??

If short (e.g. shorter than 32-bit) integer math is no faster than 32 bit, then it will probably make sense to promote shorter divisions to 32 bit. However, writing width-aware genAsm code is not something we have examples of, that I'm aware of to date. (Hint, hint.)

I (LarryP) suspect that 128-bit division will be trickier than the shorter widths, since we have no support for 256-bit intermediates.


Links: Wikipedia article on division algorithms: [3]

Wikipedia subsection on fast division algorithms: [4]

Note that the above link seems to make an implicit assumption that floating-point math will be used. That's almost certainly not the case on the Mill, since we need to emulate division on family members that lack native floating point. Similarly, the pre-scaling outlined is optimized for (normalized) floating point values. So -- among other things -- the suggested constants may well not be correct (let alone optimal) for integer math to emulate division.


LarryP, 21April2015: I originally thought that doing Newton-Raphson root finding would run fastest if we had both a reciprocal approximation as well as a slope of the reciprocal function. However, due to properties of the "reciprofying" hyperbola, it appears (see the Wikipedia subsection linked above on fast division) that all we need is a "good," suitably-scaled reciprocal approximation, namely rdiv*. How good an approximation to 1/x we need (vs. the speed of the emulated division is an open question, right now, so far as I know.

On the interval x memberof( [1 -- 2^n-1]), the reciprocal itself, 1/x, is always positive, and the slope is always negative.



Some rough pseudocode is now in:

Larry's PseudoCode for Emulating Division

Q: How do we get the Wiki to do sane CODE formatting??

A: Bracket the code with <pre> and </pre> tags.


Timing and scheduling implications:

Mill operations not implemented as a subroutine retire results after a time known at specializer time, independent of data width(?). Division algorithms depend on a number of passes that varies with data width, presenting a complication for unrolled loops to be inter-spaced with other operations. Data size is often not knowable at specializer time.


Goldschmidt division

--Mkbosmans (talk) 15:11, 27 April 2015 (UTC):
On the Wikipedia article on division algorithms: [5] there is an alternative to NR presented, Goldschmidt division. In my limited testing a fixed point version of the algorithm it appeared to require 1 or 2 more iterations than NR to converge to the correct value. However, it has the following advantages:

  • Only 2 mul and 1 add instead of 2 mul, 2 add per iteration.
  • A total of 2*N_iter - 1 multiplies needed, instead of 2*N_iter + 2 for NR.
  • Only 1 (or possibly neither, I'm not sure) of the 2 fixed point multiplies needs to be at double width.
  • The 2 muls for an iteration can be scheduled together, where the 2 NR muls have a data dependency between them.

There is a good paper on software integer division by someone at Microsoft Research [6]