Forum Replies Created

Viewing 15 posts - 1 through 15 (of 74 total)
  • Author
    Posts
  • LarryP
    Participant
    Post count: 78
    in reply to: news? #3481

    What’s a daxpy?

    one daxpy per cycle

    And when you write “cycle,” do you mean a clock cycle or a Mill instruction?

    Thanks,

  • LarryP
    Participant
    Post count: 78

    Eagerly awaiting!

  • LarryP
    Participant
    Post count: 78
    in reply to: Single register #2171

    Each morsel is big enough to hold a belt operand number, so (depending on member) morsels are 3/4/5/6 bits long

    A six-bit morsel suggests a Mill with a belt length of 64! (Thus wider than the gold model.) Is such a design in the works? What’s the anticipated market, high-performance/scientific computing?

  • LarryP
    Participant
    Post count: 78
    in reply to: Prediction #2049

    Greetings all,

    For some perspective on the economic value of improved prediction, you might enjoy reading about a recent, large ($862 million US) patent-infringement award against Apple. Here’s a news article:
    http://arstechnica.com/tech-policy/2015/10/apple-faces-862m-patent-damage-claim-from-university-of-wisconsin/

    And the underlying patent (US5781752):
    https://www.google.com/patents/US5781752

    To me, this says that:

    * Improvements to prediction meaningfully improve performance of real CPUs.

    * Even incremental improvements in prediction are worth hundreds of millions of dollars!

    It will be most interesting to see how the Mill’s prediction (and anti-false-aliasing deferred load) mechanisms work in practice to improve single-thread performance.

  • LarryP
    Participant
    Post count: 78

    Oops; missed that. Apologies.

  • LarryP
    Participant
    Post count: 78

    Create/link compiler talk to a forum topic?

    Greetings all,

    Would whoever can do so, please link the compiler talk to a (new, I presume) forum topic, so that we may comment on it and see other folks comments, all in the same place?

    Thanks,

    Larry

  • LarryP
    Participant
    Post count: 78
    in reply to: Execution #1752

    Veedrac,

    Thank you much!

    Larry

  • LarryP
    Participant
    Post count: 78
    in reply to: Execution #1750

    Greetings all,

    I’ve started a Wiki page for user-contributed division suggestions, algorithms, pseudocode — and eventually genAsm code, at:

    http://millcomputing.com/wiki/Division:_Algorithms_and_user-submitted_genAsm_code

    I don’t see a “code” formatting button on the Wiki, as I do here in the forums.

    Would somebody who knows the Wiki look into adding a “code” formatting button on the Wiki?

  • LarryP
    Participant
    Post count: 78
    in reply to: Execution #1746

    Ivan,

    If you folks haven’t defined rdivu, may I please have permission to suggest (namely speculate publicly) on what rdivu might calculate? Or would you prefer I email that to you first, for your approval.

    I really doubt that the rudiments of numerical analysis, Taylor series and the Newton-Raphson algorithm{1} would have anything patentable, but I’d like to help — and thus try my best not to cause Millcomputing any IP headaches.

    Unless I’m mis-remembering, the convergence of any Newton-Raphson-like algorithm will depend strongly on the accuracy of what rdivu produces. So there’ll probably be a trade-off between the hardware/speed cost of implementing rdivu and the precision — and thus how many iterations are needed to converge (for a given width of input args.)

    Is there an existing forum category you’d like to see this under?
    Actually, my instinct is to put it on the Wiki (once I have something coherent), under community-contributed software. That way we have history and multiple people can try their hand at emulating division.

    —-

    {1} According to a lecture I heard some years ago at NIST (wish I could recall the speaker’s name), Newton’s original approximation method, in his Principia Mathematica is rather different from the modern Newton-Raphson root-finding algorithm. (But Raphson still got second billing. So it goes….)

  • LarryP
    Participant
    Post count: 78
    in reply to: Execution #1744

    Ivan et all at Millcomputing,

    No promises, but I can take a stab at emulating integer division, probably unsigned to start out. In order to do that, I’ll need a clear definition of exactly what the rdivu operation calculates, including width(s) and any constant of proportionality.

    I find examples very helpful to understand key details. For simplicity, let’s start with uint8 as the size of the input numerator and denominator. Would you kindly post what rdivu(x) returns for, respectively: 0, 1, 63, 64, 127, 128, and 255?

    (I’ve got an idea what I’d like rdiv to return, but I don’t want to mis-assume — nor do I know ALUish hardware well enough to guess what’s fast and cheap to implement as a native op.) Far better to start with what rdivu really does.

    Also, what should integer division (OK divrem) on the Mill return for quotient and remainder when given a zero divisor? NaRs?

    Thanks,

  • LarryP
    Participant
    Post count: 78
    in reply to: Execution #1736

    Effective IPC during divide emulation?

    What happens to a Mill’s effective IPC while it’s emulating a divide operation?
    An iterated-in-software approach to divide (and relatives) would appear to require in the general case both multiple cycles and an unknown-at-compile-time number of cycles. If a divide emulation takes on average N cycles and there’s no way to get anything else done — because the number of iterations is unknown at compile time — that situation seems to imply that the effective ops/clock would drop to something proportional to 1/N, unless there’s a way to make use of the rest of Mill’s width (or at least a good portion thereof), during the emulated divide. If there’s no way to schedule other ops during the emulation, that sounds like it would be a substantial performance hit!

    I won’t speculate publicly (I hate that constraint, but understand the need!),
    but I’m very curious about:

    a) How can the Mill CPU and toolchain get anything else useful done during an emulated divide?

    b) What does the Mill’s approach to divide do to its effective IPC in codes that need to do a substantial number of division or remainder ops?

    Apologies if this is a sensitive issue, but eventually people are going to want to know. (On the off chance that you’re not satisfied with your solution, I’d be happy to share my speculations via email.)

    As always, thanks in advance for anything you can share publicly.

  • LarryP
    Participant
    Post count: 78

    Re long constants,
    Looking at the variations for the con operation (one variant of which takes five operands!!), I’d bet a pitcher of good beer that flow-side ganging is involved.

  • LarryP
    Participant
    Post count: 78

    @david, re 68-bit literals,

    Quite possibly. If you look at the flow-side decoding, outlined in:
    http://millcomputing.com/wiki/Decode#Flow_Stream, you’ll see that it’s very different from the exu side encode/decode that Ivan described in the encoding talk.

    Flow-side has both Manifests (up to 32 bits) and extensions (up to 3x morsel width, member dependent, 5 bits on gold), which suggests that each const operation could drop a 32 bit (or on gold potentially 32 + 3*5 = 57 bit) constant. Mill members with enough flow-side slots can thus drop a whole lot of bits of constant per instruction.

    What I’m less clear on — and hope to learn — is how the results of multiple const operations can be combined into a 64-bit (or evidently slightly longer) constant, not only in the same instruction, but in time to be used in opPhase, for the (ganged subtract/less-than) comparison in Ivan’s single-instruction example above. Ivan’s rather busy today, I’d imagine! 😉
    But maybe one of the other Millcomputing gurus will answer in the near future.

    • This reply was modified 9 years, 7 months ago by  LarryP.
  • LarryP
    Participant
    Post count: 78

    Please, please post the slides as soon after the talk as possible! I know video editing takes time, but I hate to wait, and the slides have been first rate.

  • LarryP
    Participant
    Post count: 78

    @ivan,
    Your explanation of rescue’s need for a reach-back of twice the belt length clarifies a cryptic (to me, at least) comment you made (in the belt lecture, I think) about needing (operand/metadata) buffers for up to twice the belt length.

Viewing 15 posts - 1 through 15 (of 74 total)