Forum Replies Created

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
    Posts
  • bhurt
    Participant
    Post count: 5
    in reply to: Specification #1127

    I think this is the wrong venue in which to bring this up, but it’s the most correct venue I can think of. In other venues, you’ve asked what instructions could you add to low-end Mills that would make writing a software floating point library easy(-ier). I figured I’d chime in with my thoughts on the matter, for what they’re worth.

    The biggest problem, and the most error prone aspect, of writing a software floating point library, is unpacking and repacking the floating point numbers themselves, and dealing with the edge cases (denormalized numbers, infinities, NaNs). Unpacking a floating point number into it’s components is a longish series of shifts, ands, and adds. The idea is to just have two instructions. One instruction takes a floating point number and unpacks it, producing the mantissa, exponent, and type (is it a NaN, Infinity, etc.?). As a side note, the belt is really nice for designing instructions. 🙂 The other instruction takes a mantissa, exponent, and type and reconstitutes the original floating point number.

    Note that these instructions are useful in other circumstances, not just in software fp libraries, but in functions lke frexp and ldexp and isNaN. These would actually be useful instructions to have even on a Mill Platinum architecture, aimed at super computers and given boatloads of floating point units. Their utility in writing a software FP library should be obvious.

    Two variants might warrant consideration. One is that instead of a single type value, drop multiple flags on the belt- isNaN, isInfinity, etc. The advantage of this is that in a single instruction you could unpack a floating point number and multiway branch on the edge cases to special handling code (“Oh, it’s a NaN- in that case, jump here.”). Another would be instead of the mantissa as a single value, split it up into two values. If you don’t have a 64 bit x 64 bit -> 128 bit multiply (likely on a low-end Mill), this makes it easier to do the obvious four 32×32->64 bit multiplies. The problem with both of these variants is that they drop a lot more values on the belt- if you only have 8 belt positions, having a single operation drop 6 or more values on the belt can be a problem.

    Hope this helps.

  • bhurt
    Participant
    Post count: 5
    in reply to: Metadata #557

    I’d like to request you guys add an operation (if you haven’t already).
    I want to explain the operation, and then explain why I think it’d be
    useful. The operation is smear-sum. It takes a vector of small (8-bit,
    maybe 16-bit) integers an produces a running sum, and in addition
    produces a second value which is the final sum. So, if given the vector
    [ 1, 3, 10, 7, 4, 0, 2, 9 ] it produces the vector [ 1, 4, 14, 21, 25,
    25, 27, 36 ] and the value 36. An alternative would smear-sumx, which
    produces the sum excluding the current value, so the vector [ 1, 3, 10,
    7, 4, 0, 2, 9 ] would return the vector [ 0, 1, 4, 14, 21, 25, 25, 27 ]
    and the value 36. NaRs and Nones are propagated, but count as 0 for the
    purposes of the sum, so the vector [ 1, 3, 10, NaR, None, 0, 2, 9 ]
    returns the vector [ 1, 4, 14, NaR, None, 14, 16, 25 ] and the value 25.

    Why this instruction would be useful: a lot of languages now use copying
    GC. One of the advantages of copying GC is that the heap is alway
    contiguous, so allocation is bumping a pointer. The allocation routine
    then looks like this:

    extern size_t * heap_top;
    extern size_t * heap_limit;

    size_t * alloc(size_t num_words) {
    register size_t * rval = heap_top;
    heap_top += num_words;
    if (heap_top > heap_limit) {
    /* This is a rarely taken branch- less than 1% of the time */
    do_a_gc();
    /* do_a_gc() resets heap_top and heap_limit so we can now
    * succeed at the allocation.
    */
    rval = heap_top;
    heap_top += num_words;
    }
    return rval;
    }

    Similar pointer-bump allocations also show up in languages without
    garbage collection- for example, when adding items to a vector. Where
    smear-sum comes into play is when you want to vectorize a loop that
    contains a pointer-bump allocation. You’d simply do a smear-sum to
    calculate what the offsets of each iteration’s allocation are from the
    current top pointer, then do a vector add of the offsets to the current
    top pointer to get their real address, and then off you go. A vector
    compare and a normal smear catches the (presumptively rare) case where
    an allocation would trigger a gc (or vector resize).

    The problem with this instruction is, of course, the deep dependency
    chain between the adds. You can’t produce the nth value until you know
    the (n-1)th value. Even limiting it to small 8-bit adds, I doubt that
    this instruction would be a 1-clock “fast” instruction. It’ll probably
    be a 2- or 3-clock instruction. And that should be sufficient.

    You may be, likely are, well ahead of me here, and have already added
    the capability to do this. But just in case you haven’t, I thought I’d
    put the request in.

  • bhurt
    Participant
    Post count: 5
    in reply to: Many core mill (GPU) #464

    I don’t think the Mill would be significantly *better* than a GPU at what a GPU is good for, but with a goodly number of FP functional units, I think the Mill could be more or less *equal* to a GPU at what a GPU does. At least a mid-range GPU.

    Where this becomes interesting is in situations where you don’t want to spend the power and cost budget for specialized GPUs- think tablets, smart phones, and net books. Here, havng a CPU that can do “triple duty”- have the power/cost of an embedded CPU, desktop CPU performance on general purpose workloads, *and* a decent GPU as needed, and you’ve got something *very* interesting. NVidia might not be afraid of the Mill, but ARM should be freaking paranoid.

  • bhurt
    Participant
    Post count: 5
    in reply to: Metadata #560

    I will sheepishly admit that I hadn’t thought of the log(N) reduction idea. If you can you can implement this with some combination of shuffles and vectorized adds in, say, four clocks, then the need for this instruction goes away. Or even six clocks- and I’d be surprised if it took six clocks. It’s simply not a big enough win anymore to make it worth the instruction encoding hit and increased complexity of the CPU. I withdraw the suggestion. :-}

    In general, I think I would advocate for a simpler instruction set, all other things being equal. I’ve yet to see an instruction set get simpler as it ages. There is always time to add complexity later, if it’s needed. Once added, however, complexity never seems to get removed. And it has a bad habit of becoming an albatross hung around the neck of the instruction set, unused but impossible to get rid of (see, for example, AAD on the x86, or the Vax instruction set).

  • bhurt
    Participant
    Post count: 5
    in reply to: Many core mill (GPU) #475

    Reply to Ivan Goddard #466

    Also, there’s a reason why graphics cards use different memory chips (gDDR) from regular CPUs. The biggest difference is (as I understand it) is gDDR chips have higher throughput but also higher latency.

    I think I would advocate the Mill have single-cycle FP add/compare, for one reason: Javascript. Javascript uses FP for it’s numbers, and in addition to being the browser-side language, is increasingly used on the server side (for reasons passing my understanding). So lots of computations that in a C or Java program would be integers are done as FP computations in Javascript. Javascript compilers do, I think, convert some (many? most?) of these to integer ops, but high speed simple FP computations would still be an enormous benefit.

Viewing 5 posts - 1 through 5 (of 5 total)