Division: Algorithms and user-submitted genAsm code
Integer Division on the Mill: User-contributions.
This page is for collecting ideas, pseudo-code and hopefully genAsm emulation sequences.
As in the parent page, please do NOT speculate on the Mill, per se. But alternative algorithms are most welcome. So, please feel free to add links to different ways to implement division.
Assumptions:
The rdiv* operator family (e.g. rdivu [[1]] and rdivs) provides an approximation to the reciprocal of its input argument. Scaling and precision are still TBD.
We can use integer operations (other than division), such as add, subtract, multiply, shift and count leading (and/or trailing) zeros.
??What can we assume about the speeds of integer operations??
If short (e.g. shorter than 32-bit) integer math is no faster than 32 bit, then it will probably make sense to promote shorter divisions to 32 bit. However, writing width-aware genAsm code is not something we have examples of, that I'm aware of to date. (Hint, hint.)
I (LarryP) suspect that 128-bit division will be tricky, since we have no support for 256-bit intermediates.
Links: Wikipedia article on division algorithms: [2]
Wikipedia subsection on fast division algorithms: [3]
Note that the above link seems to make an implicit assumtion that floating-point math will be used. That's almost certainly not the case on the Mill, since we need to emulate division on family members that lack native floating point. Similarly, the pre-scaling outlined is optimized for (normalized) floating point values. So -- among other things -- the suggested constants may well not be correct (let alone optimal) for integer math to emulate division.
LarryP, 21April2015: I originally thought that doing Newton-Raphson root finding would run fastest if we had both a reciprocal approximation as well as a slope of the reciprocal function. However, due to properties of the "reciprofying" hyperbola, it appears (see the Wikipedia subsection linked above on fast division) that all we need is a "good," suitably-scaled reciprocal approximaton, namely rdiv*. How good an approximation to 1/x we need (vs. the speed of the emulated division is an open question, right now, so far as I know.
On the interval x memberof( [1 -- 2^n-1]), the reciprocal itself, 1/x, is always positive, and the slope is always negative.
Some rough pseudocode follows.
How do we get the !@#$% Wiki to do sane CODE formatting??
Note, I'm defaulting to the variable names used in the Wikipedia Newton-Raphson division algorithm, but lower-cased wherever possible.
Function (OK, really more of a macro for expansion)
divu(n,d) --> q, r
// For now, assume width of both n and d are (a) the same and (b) are less than 128 bits.
if (isNaR(n) || isNar(d)) {return NaR, NaR} // Handle NaR inputs
if (isNone(n) || isNone(d)) {return None, None} // Handle NaR inputs
if (0 == d) {return NaR, NaR} // Handle zero divisor
/* How do we determine what width the arguments are?
* * The width matters, especially when either of the inputs * is already at max width (128 bits!!) * * For now, I'm assuming BOTH input args are a width were we can apply widen, * and get a result that's * the same number of elements as the input. This is bogus, but is a starting point. */
if (MAX_INT_BITS == width(d)|| MAX_INT_BITS == width(n)) GOTO another algorithm
d = widen(d);
n = widen(n);
lzd = countlz(d);
d = (d << (lzd + 1)); // the extra one is to get "floating-point-like"scaling
// WRT our implicit binary point at the midpoint of the widened // num and denom.
n = (n << (lzd + 1));
// Now have an implicit binary point at the midpoint of our width // And d is in the interval [1 -- 2) (can be 1, can't be 2 // with respect to our implicit binary point
// Why? Because, for a first cut, I want to try following the Wikipedia N-R algorithm, // including its suggested scaling. // Still looking for genAsm examples of width-aware code.
x = rdivu(d) * n; // Initialize via rdiv*. Assumes that rdivu is better than
// approximating X0 as = (48/17) - (32/17)*d, per Wikipedia
// I don't think we want a widening multiply; must check
//********************************************************************
// X := X + X × (1 - D' × X), done without fused multiply-adds :-(
// we want NON-WIDENING multiplied here, I believe.
t1 = d * x;
t2 = (1 << (half_our_width)) - t1; // How do we determine our width?
// The shift is a scaling of the constant one, // to match our implicit binary point
t3 = x * t2;
x = x + t3;
// Repeat above 4 calcs a TDB (and width-dependent!) number of times. //*********************************************************************
q = n * x;
q = q >> 1; // undo the "floating point style" scaling to be in the lower half word.
q = narrow(q); // force result back to same width as starting args
return q;
// OPTIONALLY calc and return remainder, INCLUDING scaling