Difference between revisions of "Division: Algorithms and user-submitted genAsm code"

From Mill Computing Wiki
Jump to: navigation, search
m (add link to division paper)
 
(8 intermediate revisions by 3 users not shown)
Line 1:Line 1:
 
'''Integer Division on the Mill: User-contributions.'''
 
'''Integer Division on the Mill: User-contributions.'''
  
This page is for collecting ideas, pseudo-code and hopefully genAsm emulation sequences.
+
This page is for collecting ideas, pseudo-code and hopefully genAsm emulation sequences to do (integer) division on the Mill. (The floating-point division emulation code is apparently either done or well along.)
  
 
As in the parent page, please do NOT speculate on the Mill, per se.  But alternative algorithms are most welcome.  So, please feel free to add links to different ways to implement division.
 
As in the parent page, please do NOT speculate on the Mill, per se.  But alternative algorithms are most welcome.  So, please feel free to add links to different ways to implement division.
  
 +
N-R is shorthand for the Newton-Raphson methodfor finding the roots of a function f(x), given an initial guess for x and a way to compute the derivative, df/dx. 
 +
Wikipedia's description of the method is at [http://en.wikipedia.org/w/index.php?title=Newton%27s_method]
 +
 +
----
 
Assumptions:
 
Assumptions:
  
The rdiv* operator family (e.g. rdivu [[http://millcomputing.com/wiki/Instruction_Set/rdivu]] and rdivs) provides an approximation to the reciprocal of its input argument.  Scaling and precision are still TBD.   
+
The Mill's rdiv* operator family (e.g. rdivu [[http://millcomputing.com/wiki/Instruction_Set/rdivu]] and rdivs) provides an approximation to the reciprocal of its input argument.  Scaling and precision are still TBD.  Note that the Wikipedia/Newton-Raphson algorithm, as given, does not use any special instructions to estimate reciprocals, since most CPUs don't have instructions/operations to estimate 1/x, given x.  Since the Mill '''does''' have such an operation, the hope is that we can use it to converge some flavor of iterative division in fewer operations than the conventional N-R algorithm.
 +
 
 +
Having rdiv as a (fairly fast, I hope) Mill operation suggests a divide algorithm, in which we use rdiv(trialQuotient_i) * trialRemainder_i as a correction to trialQuotient_i, to get a new starting value, trialQuotient_i+1, for the next iteration.  (Terje was already thinking along these lines, before I did.)  What's not yet clear to me is whether there's better convergence to be had, either in combination with classic N-R or instead of N-R.  (And whether it'll be faster than textbook N-R.
  
 
We can use integer operations (other than division), such as add, subtract, multiply, shift and count leading (and/or trailing) zeros.
 
We can use integer operations (other than division), such as add, subtract, multiply, shift and count leading (and/or trailing) zeros.
 +
 +
Ivan says to unroll any loops -- that is, simply do the required number of iterations to converge.  Unrolled code, while less elegant, should permit the specializer to schedule operations for other data flows in parallel with divisions. 
 +
 +
It's not yet clear to me what matters most in coding for speed.  While there are some special cases -- where we could return results faster than the general case.  Those cases are usually data-dependent, so the specializer cannot usually handle them at specialization time.  (I'm assuming that LLVM will have already found and handled the special cases where the compiler can avoid division, e.g. by shifting when the divisor is a constant and a power of two.) 
 +
 +
The kicker is that trying to short-circuit those special cases (that LLVM cannot) means extra operations (to detect at runtime whether we have one of the special cases), and those extra operations cost cycles on '''all''' division ops.  So it's not yet clear to me whether we want to code to minimize the worst-case speed of division vs. code for fastest division on a statistical basis.  One of the things that makes this tricky to understand is that, on a wide Mill, we can usually do those checks fairly cheaply.  Whereas on a low-end Mill, we often won't have "spare" ALU ops that can be calculating and checking predicates.  This same issue will also matter to whether/not doing "are we done yet?" tests will help or hurt performance. 
  
 
??What can we assume about the speeds of integer operations??   
 
??What can we assume about the speeds of integer operations??   
  
If short (e.g. shorter than 32-bit) integer math is no faster than 32 bit, then it will probably make sense to promote shorter divisions to 32 bit.  However, writing width-aware genAsm code is not something we have examples of, that I'm aware of to date.  (Hint, hint.)
+
If short (e.g. shorter than 32-bit) integer math is no faster than 32 bit, then it will probably make sense to promote shorter divisions to 32 bit.  '''However, writing width-aware genAsm code is not something we have examples of, that I'm aware of to date.  (Hint, hint.)'''
  
I (LarryP) suspect that 128-bit division will be tricky, since we have no support for 256-bit intermediates.
+
I (LarryP) suspect that 128-bit division will be trickier than the shorter widths, since we have no support for 256-bit intermediates.
  
 
----
 
----
Line 25:Line 37:
 
[http://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods]
 
[http://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods]
  
Note that the above link seems to make an implicit assumtion that floating-point math will be used.  That's almost certainly not the case on the Mill, since we need to emulate division on family members that lack native floating point.  Similarly, the pre-scaling outlined is optimized for (normalized) floating point values.  So -- among other things -- the suggested constants may well not be correct (let alone optimal) for integer math to emulate division.
+
Note that the above link seems to make an implicit assumption that floating-point math will be used.  That's almost certainly not the case on the Mill, since we need to emulate division on family members that lack native floating point.  Similarly, the pre-scaling outlined is optimized for (normalized) floating point values.  So -- among other things -- the suggested constants may well not be correct (let alone optimal) for integer math to emulate division.
 
----
 
----
  
LarryP, 21April2015:  I originally thought that doing Newton-Raphson root finding would run fastest if we had both a reciprocal approximation as well as a slope of the reciprocal function.  However, due to properties of the "reciprofying" hyperbola, it appears (see the Wikipedia subsection linked above on fast division) that all we need is a "good," suitably-scaled reciprocal approximaton, namely rdiv*.  How good an approximation to 1/x we need (vs. the speed of the emulated division is an open question, right now, so far as I know.
+
LarryP, 21April2015:  I originally thought that doing Newton-Raphson root finding would run fastest if we had both a reciprocal approximation as well as a slope of the reciprocal function.  However, due to properties of the "reciprofying" hyperbola, it appears (see the Wikipedia subsection linked above on fast division) that all we need is a "good," suitably-scaled reciprocal approximation, namely rdiv*.  How good an approximation to 1/x we need (vs. the speed of the emulated division is an open question, right now, so far as I know.
  
 
On the interval x memberof( [1 -- 2^n-1]), the reciprocal itself, 1/x, is always positive, and the slope is always negative.
 
On the interval x memberof( [1 -- 2^n-1]), the reciprocal itself, 1/x, is always positive, and the slope is always negative.
Line 35:Line 47:
 
----
 
----
  
Some rough pseudocode follows. 
+
Some rough pseudocode is now in:
  
How do we get the !@#$% Wiki to do sane CODE formatting??
+
[[Larry's PseudoCode for Emulating Division]]
  
Note, I'm defaulting to the variable names used in the Wikipedia Newton-Raphson division algorithm, but lower-cased wherever possible. 
+
Q: How do we get the Wiki to do sane CODE formatting??
  
Function (OK, really more of a macro for expansion)
+
A: Bracket the code with <nowiki><pre> and </pre></nowiki> tags.
  
'''divu(n,d) --> q, r'''
+
----
  
// For now, assume width of both n and d are (a) the same and (b) are less than 128 bits.
+
Timing and scheduling implications:
  
if (isNaR(n) || isNar(d)) {return NaR, NaR} // Handle NaR inputs
+
Mill operations not implemented as a subroutine retire results after a time known at specializer time, independent of data width(?).  Division algorithms depend on a number of passes that varies with data width, presenting a complication for unrolled loops to be inter-spaced with other operations.  Data size is often not knowable at specializer time.
  
if (isNone(n) || isNone(d)) {return None, None} // Handle NaR inputs
+
----
 +
'''Goldschmidt division'''
  
if (0 == d) {return NaR, NaR}  // Handle zero divisor
+
--[[User:Mkbosmans|Mkbosmans]] ([[User talk:Mkbosmans|talk]]) 15:11, 27 April 2015 (UTC):<br>
 +
On the Wikipedia article on division algorithms: [http://en.wikipedia.org/wiki/Division_algorithm] there is an alternative to NR presented, Goldschmidt division. In my limited testing a fixed point version of the algorithm it appeared to require 1 or 2 more iterations than NR to converge to the correct value. However, it has the following advantages:
 +
* Only 2 mul and 1 add instead of 2 mul, 2 add per iteration.
 +
* A total of 2*N_iter - 1 multiplies needed, instead of 2*N_iter + 2 for NR.
 +
* Only 1 (or possibly neither, I'm not sure) of the 2 fixed point multiplies needs to be at double width.
 +
* The 2 muls for an iteration can be scheduled together, where the 2 NR muls have a data dependency between them.
  
/* How do we determine what width the arguments are?
+
----
*
+
There is a good paper on software integer division by someone at Microsoft Research [http://research.microsoft.com/pubs/70645/tr-2008-141.pdf]
* The width matters, especially when either of the inputs
+
* is already at max width (128 bits!!)
+
*
+
* For now, I'm assuming BOTH input args are a width were we can apply widen,
+
* and get a result that's
+
* the same number of elements as the input.  This is bogus, but is a starting point.
+
*/
+
 
+
 
+
if (MAX_INT_BITS == width(d)|| MAX_INT_BITS == width(n)) '''GOTO another algorithm'''
+
 
+
d = widen(d);
+
 
+
n = widen(n); 
+
 
+
lzd = countlz(d);
+
 
+
d = (d << (lzd + 1));  // the extra one is to get "floating-point-like"scaling
+
                        // WRT our implicit binary point at the midpoint of the widened
+
                        // num and denom.
+
 
+
n = (n << (lzd + 1));
+
 
+
// Now have an implicit binary point at the midpoint of our width
+
// And d is in the interval [1 -- 2) (can be 1, can't be 2
+
// with respect to our implicit binary point
+
 
+
// Why?  Because, for a first cut, I want to try following the Wikipedia N-R algorithm,
+
// including its suggested scaling.
+
// '''Still looking for genAsm examples of width-aware code.'''
+
 
+
 
+
 
+
x = rdivu(d) * n;      // Initialize via rdiv*.  Assumes that rdivu is better than
+
                        // approximating X0 as = (48/17) - (32/17)*d, per Wikipedia
+
 
+
                        // I don't think we want a widening multiply; must check
+
 
+
//********************************************************************
+
 
+
// X := X + X × (1 - D' × X), done without fused multiply-adds :-(
+
 
+
// we want NON-WIDENING multiplied here, I believe.
+
 
+
t1 = d * x; 
+
 
+
t2 = (1 << ('''half_our_width''')) - t1;  // How do we determine our width? 
+
                                    // The shift is a scaling of the constant one,
+
                                    // to match our implicit binary point
+
 
+
t3 = x * t2;
+
 
+
x = x + t3;
+
 
+
// '''Repeat above 4 calcs a TDB (and width-dependent!) number of times.'''
+
//*********************************************************************
+
 
+
q = n * x;
+
 
+
q = q >> 1;  // undo the "floating point style" scaling to be in the lower half word.
+
 
+
q = narrow(q); // force result back to same width as starting args
+
 
+
'''return q;'''
+
 
+
// OPTIONALLY calc and return remainder, INCLUDING scaling
+

Latest revision as of 04:11, 16 August 2015

Integer Division on the Mill: User-contributions.

This page is for collecting ideas, pseudo-code and hopefully genAsm emulation sequences to do (integer) division on the Mill. (The floating-point division emulation code is apparently either done or well along.)

As in the parent page, please do NOT speculate on the Mill, per se. But alternative algorithms are most welcome. So, please feel free to add links to different ways to implement division.

N-R is shorthand for the Newton-Raphson methodfor finding the roots of a function f(x), given an initial guess for x and a way to compute the derivative, df/dx. Wikipedia's description of the method is at [1]


Assumptions:

The Mill's rdiv* operator family (e.g. rdivu [[2]] and rdivs) provides an approximation to the reciprocal of its input argument. Scaling and precision are still TBD. Note that the Wikipedia/Newton-Raphson algorithm, as given, does not use any special instructions to estimate reciprocals, since most CPUs don't have instructions/operations to estimate 1/x, given x. Since the Mill does have such an operation, the hope is that we can use it to converge some flavor of iterative division in fewer operations than the conventional N-R algorithm.

Having rdiv as a (fairly fast, I hope) Mill operation suggests a divide algorithm, in which we use rdiv(trialQuotient_i) * trialRemainder_i as a correction to trialQuotient_i, to get a new starting value, trialQuotient_i+1, for the next iteration. (Terje was already thinking along these lines, before I did.) What's not yet clear to me is whether there's better convergence to be had, either in combination with classic N-R or instead of N-R. (And whether it'll be faster than textbook N-R.

We can use integer operations (other than division), such as add, subtract, multiply, shift and count leading (and/or trailing) zeros.

Ivan says to unroll any loops -- that is, simply do the required number of iterations to converge. Unrolled code, while less elegant, should permit the specializer to schedule operations for other data flows in parallel with divisions.

It's not yet clear to me what matters most in coding for speed. While there are some special cases -- where we could return results faster than the general case. Those cases are usually data-dependent, so the specializer cannot usually handle them at specialization time. (I'm assuming that LLVM will have already found and handled the special cases where the compiler can avoid division, e.g. by shifting when the divisor is a constant and a power of two.)

The kicker is that trying to short-circuit those special cases (that LLVM cannot) means extra operations (to detect at runtime whether we have one of the special cases), and those extra operations cost cycles on all division ops. So it's not yet clear to me whether we want to code to minimize the worst-case speed of division vs. code for fastest division on a statistical basis. One of the things that makes this tricky to understand is that, on a wide Mill, we can usually do those checks fairly cheaply. Whereas on a low-end Mill, we often won't have "spare" ALU ops that can be calculating and checking predicates. This same issue will also matter to whether/not doing "are we done yet?" tests will help or hurt performance.

??What can we assume about the speeds of integer operations??

If short (e.g. shorter than 32-bit) integer math is no faster than 32 bit, then it will probably make sense to promote shorter divisions to 32 bit. However, writing width-aware genAsm code is not something we have examples of, that I'm aware of to date. (Hint, hint.)

I (LarryP) suspect that 128-bit division will be trickier than the shorter widths, since we have no support for 256-bit intermediates.


Links: Wikipedia article on division algorithms: [3]

Wikipedia subsection on fast division algorithms: [4]

Note that the above link seems to make an implicit assumption that floating-point math will be used. That's almost certainly not the case on the Mill, since we need to emulate division on family members that lack native floating point. Similarly, the pre-scaling outlined is optimized for (normalized) floating point values. So -- among other things -- the suggested constants may well not be correct (let alone optimal) for integer math to emulate division.


LarryP, 21April2015: I originally thought that doing Newton-Raphson root finding would run fastest if we had both a reciprocal approximation as well as a slope of the reciprocal function. However, due to properties of the "reciprofying" hyperbola, it appears (see the Wikipedia subsection linked above on fast division) that all we need is a "good," suitably-scaled reciprocal approximation, namely rdiv*. How good an approximation to 1/x we need (vs. the speed of the emulated division is an open question, right now, so far as I know.

On the interval x memberof( [1 -- 2^n-1]), the reciprocal itself, 1/x, is always positive, and the slope is always negative.



Some rough pseudocode is now in:

Larry's PseudoCode for Emulating Division

Q: How do we get the Wiki to do sane CODE formatting??

A: Bracket the code with <pre> and </pre> tags.


Timing and scheduling implications:

Mill operations not implemented as a subroutine retire results after a time known at specializer time, independent of data width(?). Division algorithms depend on a number of passes that varies with data width, presenting a complication for unrolled loops to be inter-spaced with other operations. Data size is often not knowable at specializer time.


Goldschmidt division

--Mkbosmans (talk) 15:11, 27 April 2015 (UTC):
On the Wikipedia article on division algorithms: [5] there is an alternative to NR presented, Goldschmidt division. In my limited testing a fixed point version of the algorithm it appeared to require 1 or 2 more iterations than NR to converge to the correct value. However, it has the following advantages:

  • Only 2 mul and 1 add instead of 2 mul, 2 add per iteration.
  • A total of 2*N_iter - 1 multiplies needed, instead of 2*N_iter + 2 for NR.
  • Only 1 (or possibly neither, I'm not sure) of the 2 fixed point multiplies needs to be at double width.
  • The 2 muls for an iteration can be scheduled together, where the 2 NR muls have a data dependency between them.

There is a good paper on software integer division by someone at Microsoft Research [6]