Forum Replies Created
- AuthorPosts
- in reply to: 2016 closeout thoughts #2741
Eagerly awaiting!
- in reply to: Single register #2171
Each morsel is big enough to hold a belt operand number, so (depending on member) morsels are 3/4/5/6 bits long
A six-bit morsel suggests a Mill with a belt length of 64! (Thus wider than the gold model.) Is such a design in the works? What’s the anticipated market, high-performance/scientific computing?
- in reply to: Prediction #2049
Greetings all,
For some perspective on the economic value of improved prediction, you might enjoy reading about a recent, large ($862 million US) patent-infringement award against Apple. Here’s a news article:
http://arstechnica.com/tech-policy/2015/10/apple-faces-862m-patent-damage-claim-from-university-of-wisconsin/And the underlying patent (US5781752):
https://www.google.com/patents/US5781752To me, this says that:
* Improvements to prediction meaningfully improve performance of real CPUs.
* Even incremental improvements in prediction are worth hundreds of millions of dollars!
It will be most interesting to see how the Mill’s prediction (and anti-false-aliasing deferred load) mechanisms work in practice to improve single-thread performance.
Oops; missed that. Apologies.
Create/link compiler talk to a forum topic?
Greetings all,
Would whoever can do so, please link the compiler talk to a (new, I presume) forum topic, so that we may comment on it and see other folks comments, all in the same place?
Thanks,
Larry
Greetings all,
I’ve started a Wiki page for user-contributed division suggestions, algorithms, pseudocode — and eventually genAsm code, at:
http://millcomputing.com/wiki/Division:_Algorithms_and_user-submitted_genAsm_code
I don’t see a “code” formatting button on the Wiki, as I do here in the forums.
Would somebody who knows the Wiki look into adding a “code” formatting button on the Wiki?
Ivan,
If you folks haven’t defined rdivu, may I please have permission to suggest (namely speculate publicly) on what rdivu might calculate? Or would you prefer I email that to you first, for your approval.
I really doubt that the rudiments of numerical analysis, Taylor series and the Newton-Raphson algorithm{1} would have anything patentable, but I’d like to help — and thus try my best not to cause Millcomputing any IP headaches.
Unless I’m mis-remembering, the convergence of any Newton-Raphson-like algorithm will depend strongly on the accuracy of what rdivu produces. So there’ll probably be a trade-off between the hardware/speed cost of implementing rdivu and the precision — and thus how many iterations are needed to converge (for a given width of input args.)
Is there an existing forum category you’d like to see this under?
Actually, my instinct is to put it on the Wiki (once I have something coherent), under community-contributed software. That way we have history and multiple people can try their hand at emulating division.—-
{1} According to a lecture I heard some years ago at NIST (wish I could recall the speaker’s name), Newton’s original approximation method, in his Principia Mathematica is rather different from the modern Newton-Raphson root-finding algorithm. (But Raphson still got second billing. So it goes….)
Ivan et all at Millcomputing,
No promises, but I can take a stab at emulating integer division, probably unsigned to start out. In order to do that, I’ll need a clear definition of exactly what the rdivu operation calculates, including width(s) and any constant of proportionality.
I find examples very helpful to understand key details. For simplicity, let’s start with uint8 as the size of the input numerator and denominator. Would you kindly post what rdivu(x) returns for, respectively: 0, 1, 63, 64, 127, 128, and 255?
(I’ve got an idea what I’d like rdiv to return, but I don’t want to mis-assume — nor do I know ALUish hardware well enough to guess what’s fast and cheap to implement as a native op.) Far better to start with what rdivu really does.
Also, what should integer division (OK divrem) on the Mill return for quotient and remainder when given a zero divisor? NaRs?
Thanks,
Effective IPC during divide emulation?
What happens to a Mill’s effective IPC while it’s emulating a divide operation?
An iterated-in-software approach to divide (and relatives) would appear to require in the general case both multiple cycles and an unknown-at-compile-time number of cycles. If a divide emulation takes on average N cycles and there’s no way to get anything else done — because the number of iterations is unknown at compile time — that situation seems to imply that the effective ops/clock would drop to something proportional to 1/N, unless there’s a way to make use of the rest of Mill’s width (or at least a good portion thereof), during the emulated divide. If there’s no way to schedule other ops during the emulation, that sounds like it would be a substantial performance hit!I won’t speculate publicly (I hate that constraint, but understand the need!),
but I’m very curious about:a) How can the Mill CPU and toolchain get anything else useful done during an emulated divide?
b) What does the Mill’s approach to divide do to its effective IPC in codes that need to do a substantial number of division or remainder ops?
Apologies if this is a sensitive issue, but eventually people are going to want to know. (On the off chance that you’re not satisfied with your solution, I’d be happy to share my speculations via email.)
As always, thanks in advance for anything you can share publicly.
Re long constants,
Looking at the variations for the con operation (one variant of which takes five operands!!), I’d bet a pitcher of good beer that flow-side ganging is involved.@david, re 68-bit literals,
Quite possibly. If you look at the flow-side decoding, outlined in:
http://millcomputing.com/wiki/Decode#Flow_Stream, you’ll see that it’s very different from the exu side encode/decode that Ivan described in the encoding talk.Flow-side has both Manifests (up to 32 bits) and extensions (up to 3x morsel width, member dependent, 5 bits on gold), which suggests that each const operation could drop a 32 bit (or on gold potentially 32 + 3*5 = 57 bit) constant. Mill members with enough flow-side slots can thus drop a whole lot of bits of constant per instruction.
What I’m less clear on — and hope to learn — is how the results of multiple const operations can be combined into a 64-bit (or evidently slightly longer) constant, not only in the same instruction, but in time to be used in opPhase, for the (ganged subtract/less-than) comparison in Ivan’s single-instruction example above. Ivan’s rather busy today, I’d imagine! 😉
But maybe one of the other Millcomputing gurus will answer in the near future.- This reply was modified 9 years, 8 months ago by LarryP.
Please, please post the slides as soon after the talk as possible! I know video editing takes time, but I hate to wait, and the slides have been first rate.
- in reply to: Scratchpad design decision #3482
@ivan,
Your explanation of rescue’s need for a reach-back of twice the belt length clarifies a cryptic (to me, at least) comment you made (in the belt lecture, I think) about needing (operand/metadata) buffers for up to twice the belt length. - AuthorPosts