Forum Replies Created

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
    Posts
  • peceed
    Participant
    Post count: 3

    nihil novi
    I had exactly the same idea, my obvious case was quicksort that switches to more regular quadratic algorithms for short arrays.
    This way I have rediscovered Loop Termination Prediction, Count Register, Counted Loop Branch, etc.

    Prediction offers little value in lengthy loops

    Yes, but there is surprisingly high amount of relatively short loops.
    And don’t forget, that vectorization makes loops shorter up to 32 times!

    When we execute the branch and discover that the prediction was wrong, or if we take a exit that wasn’t predicted, the hardware has already issued two cycles down the wrong path. Miss recovery has to unwind that, fetch the right target, and decode it, which takes 5 cycles if the target is in cache

    5 cycles on Mill is like 15 cycles on conventional processor – quite a big loss!

    In essence you propose a deferred branch whose transfer is detached from the evaluation of the predicate, similar to the way Mill separates load issue from load retire.

    IIRC Elbrus E2k uses this technique, it has Prepare Branch and Execute Branch instructions.

    One is the time required to reset the fetch. If the target is not in the I0 microcache then reset would take roughly as long as mispredict recovery, i.e. five cycles in our test configs.

    It looks like you should split speculative fetch and speculative code prefetch.

    How often can we eval a predicate five cycles before the transfer? Not often I’d guess, but I have no numbers.

    But you can start resetting the fetch earlier! Prepare Branch and the following instructions before Branch Execute are valid…

    A semantic issue is how the DB interacts with other branches. Suppose a regular branch is executed between DB issue/eval and transfer? Who wins?

    It doesn’t matter as long as it is consistent! We could treat it in the same way as delay slots, so the most “powerful” answer is:both of them, unfortunately it won’t work with Belt I suppose.

    • This reply was modified 1 year, 3 months ago by  peceed.
    • peceed
      Participant
      Post count: 3

      One is the time required to reset the fetch. If the target is not in the I0 microcache then reset would take roughly as long as mispredict recovery, i.e. five cycles in our test configs. Even an I0 hit would likely need three cycles to reset.

      It is hard to believe that reset on such early stage of instruction execution (no registry/belt/memory operations) is something different than stall of subsequent “correct” stages. So there should be no lower limit on its duration when the information is available earlier. Just issue correct instruction and wait for its “execution front”.

  • peceed
    Participant
    Post count: 3
    in reply to: news? #3687

    There are tuning issues in the smaller members too, Tin and Copper. There the issue is belt size. Even an 8-position belt is enough for the tests’ transient data, but the codes also have long-lived data too, and the working sets of that data don’t fit on the smaller belts, and some not on the 16-belt of a Silver. As a result the working sets get spilled to scratch and the core essentially runs out of scratch with tons of fill ops; much like codes on original 8086 or DG Nova. This is especially noticeable on FP codes like the numerics library for which the working set is full of big FP constants. Working out of scratch doesn’t impact small-member performance much, but has scratch bandwidth consequences, and power too that we can only guess at. We may need to config the smaller members with bigger belts, but that too has consequences. In other words, the usual tuning tradeoffs.

    I think you can use a small register set that is a logical extension of belt, using additional bit in argument address.
    Encoding cost is acceptable and it solves problem of “frequently used arguments”.
    It can be entropy optimized by restricting number of register arguments to one per operation or by limiting number of functional units that can use register arguments. Small models can use more bits for register specifier.

    The C++ library is coming up because we are doing the OS kernel in C++

    I am under strong impression that you are trying to innovate too much at once.
    Your initial goal should be a “software stack accelerator”: processor that needs minimal OS modifications and is fully compatible with existing applications (Linux/Java/Android).
    Forget single address space: it doesn’t save a lot of power (TLB uses ~15% IIRC) but is the biggest blocker in quick adoption. You can easy make it optional.
    You can win the market by offering “only” double performance to power and performance to cost ratios, as long as you are software compatible/sane. “Datacenters and smartphones” are sensitive enough to 2-3x power advantage, but they are not able to rewrite their software!
    Time is running out – volume of computations is moving into visual/AI domain.

Viewing 3 posts - 1 through 3 (of 3 total)