Forum Replies Created

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
    Posts
  • Laurent_Birtz
    Member
    Post count: 10

    For the intra case.

    Get the content of this block.

    
    [XXXX]
    [XXXX]
    [XXXX]
    [XXXX]
    

    From a vector of pixels.
    [XXXXXXXX]

    
    // Pass every row.
    int angle_sum = 0;
    for (int y = 0; y < bs; y++)
    {
        angle_sum += angle;
        int off = angle_sum>>5;
        int frac = angle_sum&31;
    
        // Interpolate.
        if (frac)
            for (int x = 0; x < bs; x++)
                dst[y*bs+x] = ((32-frac)*ref[off+x] + frac*ref[off+x+1] + 16)>>5;
    
        // Copy.
        else
            for (int x = 0; x < bs; x++)
                dst[y*bs+x] = ref[off+x];
    }
    

    For the SAD case.

    
    int sad = 0;
    for (int y = 0; y < height; y++, src0 += stride0, src1 += stride1)
        for (int x = 0; x < width; x++)
            sad += F265_ABS(src0[x] - src1[x]);
    return sad;
    

    For the DCT case.

    
    void fenc_dct_8_1d(int16_t *dst, int16_t *src, int shift)
    {
        int add = 1 << (shift - 1);
    
        for (int i = 0; i < 8; i++, dst++, src += 8)
        {
            int add_0_7 = src[0]+src[7], add_1_6 = src[1]+src[6], add_2_5 = src[2]+src[5], add_3_4 = src[3]+src[4];
            int sub_0_7 = src[0]-src[7], sub_1_6 = src[1]-src[6], sub_2_5 = src[2]-src[5], sub_3_4 = src[3]-src[4];
    
            dst[0]  = (64*add_0_7 + 64*add_1_6 + 64*add_2_5 + 64*add_3_4 + add) >> shift;
            dst[8]  = (89*sub_0_7 + 75*sub_1_6 + 50*sub_2_5 + 18*sub_3_4 + add) >> shift;
            dst[16] = (83*add_0_7 + 36*add_1_6 - 36*add_2_5 - 83*add_3_4 + add) >> shift;
            dst[24] = (75*sub_0_7 - 18*sub_1_6 - 89*sub_2_5 - 50*sub_3_4 + add) >> shift;
            dst[32] = (64*add_0_7 - 64*add_1_6 - 64*add_2_5 + 64*add_3_4 + add) >> shift;
            dst[40] = (50*sub_0_7 - 89*sub_1_6 + 18*sub_2_5 + 75*sub_3_4 + add) >> shift;
            dst[48] = (36*add_0_7 - 83*add_1_6 + 83*add_2_5 - 36*add_3_4 + add) >> shift;
            dst[56] = (18*sub_0_7 - 50*sub_1_6 + 75*sub_2_5 - 89*sub_3_4 + add) >> shift;
        }
    }
    
    // This function is the assembly function.
    void fenc_dct_8_c(int16_t *dst, f265_pix *src, int src_stride, f265_pix *pred, int pred_stride)
    {
        int lg_bs = 3, bd = 8;
        int bs = 1<<lg_bs, bs2 = 1<<(lg_bs<<1);
        int shift1 = lg_bs + bd - 9, shift2 = lg_bs + 6;
        int16_t diff[bs2], tmp[bs2];
        fenc_get_block_residual(diff, src, src_stride, pred, pred_stride, bs);
        fenc_dct_8_1d(tmp, diff, shift1);
        fenc_dct_8_1d(dst, tmp, shift2);
    }
    
  • Laurent_Birtz
    Member
    Post count: 10

    Example 3: DCTs/IDCTs.

    This operation corresponds to matrix multiplications, with some exploitable symmetry in the matrix. The supported block sizes are 4×4, 8×8, 16×16, 32×32.

    
    [....]   [....]   [....]   [....]
    [....] * [....] * [....] = [....]
    [....]   [....]   [....]   [....]
    [....]   [....]   [....]   [....]
    

    The core algorithm uses shuffles to align the pixels, and multiply-add with 16-bit and 32-bit factors. The shuffle operation issue rate is a limiting factor.

    Although we avoid doing full matrix transposes in the implementation because that’s too slow, we still perform the (optimized) equivalent. A naive transpose is done with a 100% shuffle workload. Suppose we have 4 registers (A-D), each containing a row of 4 pixels. We’d proceed with unpack operations (1->2, 2->4).

    
    A: abcd    aebf    aeim
    B: efgh => cgdh => bfjn
    C: ijkl    imjn    cgko
    D: mnop    kolp    dhlp
    

    That type of reordering happens often.

    Actual code.

    http://f265.org/browse/f265/tree/f265/asm/avx2/dct.asm?h=develop

    Conclusions.

    In my application, shuffle operations are very, very common. I would say that 50% of the algorithms have a workload where more than 33% of the operations are shuffles. The data needs to be shuffled before, during, and after the processing (shuffle-load, shuffle-process, shuffle-store).

    With Haswell, there are 3 vector ports, and each port is very specialized. Only 1 shuffle uop can issue per cycle (on port 5) and this is the main bottleneck. One can sometimes cheat a little by using vector shifts and blend operations in lieu of a shuffle operation, but that’s not sufficient to avoid the bottleneck.

    As a rule of thumb, my application would be best served if the hardware could sustain a workload with 50% shuffle operations. The most useful instruction on AVX2 is PSHUFB, which allows the arbitrary reordering of a 16-byte lane. Many other shuffle operations are useful. In general, making more shuffle modes available lead to an important decrease in the total operation count.

    Type expansion and narrowing is common enough. 8-bit-to-16-bit multiply-add is (ab)used very frequently for a variety of purposes.

    Shuffling and vector type association are orthogonal concepts. For the Mill, you could support arbitrary shuffle operations that do not affect the underlying vector data type. That’s better than the Intel way of defining the same intruction twice for the integer and floating point domains. I don’t know if that’s compatible with the 2-stage bypass network used by the Mill.

    Now, to address your main point: that the Mill uses a model-dependent register size and that it must be programmed without knowing the effective register size. I do not know how to program effectively such a machine. As I discussed above, the optimal algorithm directly depends on the vector register size, due to the implicit shuffle range allowed within that register.

    Now, if you define a virtual register size, say 32-bytes, and you define your shuffle/gather operations based on that virtual size, then I can work with that. If on some models you implement the specified operations with 2×16-byte or 4×8-byte vector units, I don’t care, this is transparent to me, the programmer.

    Your goal of “write once, run anywhere” is a desirable property. That said, when one is willing to do assembly programming to get more performance, it should be clear that performance is a #1 priority. On Intel, we forego using intrinsics because we lose 10-15% performance due to bad compiler optimizations. If you tell me, “your code will be dependent on model X, but you will gain 20% performance by assuming reg_size = 32″, I’ll choose that trade-off. Ultimately it’s the difference between $1 000 000 of hardware and $1 250 000.

    Regards,
    Laurent

  • Laurent_Birtz
    Member
    Post count: 10

    Example 2: motion estimation using the sum of absolute differences (SAD).

    The motion estimation is done by repeatedly comparing a block of pixels in a reference frame with a source block of pixels, e.g. a block of 16×16 pixels. For each reference (‘a’) and source pixel (‘b’) of the block, we compute |a-b|, then sum the values (e.g. 256 values for a 16×16 block), yielding the SAD.

    On AVX2, there is a SAD instruction that computes |a-b| between two registers and then sums 8 contiguous values, yielding 1 8-byte value per group of 8-bytes in the register. Without that instruction, we’d have to subtract with saturation as 8-bit (MAX(|a-b|, |b-a|)).

    The natural processing is to loop per row. For each loop iteration, load the two rows, compute the SAD “vertically” (i.e. per pixel or pixel group), and sum in an accumulator register. This part of the computation follows the natural alignment and can be done efficiently.

    In the last part, we must “fold” the results in a register horizontally into a single value. With X values in the register, there are lg(X) folding operations (e.g. 8 => 4 => 2 => 1). This can done in two ways. We can add the values horizontally within the register (e.g. PHADD on AVX2). Or, we can shuffle to align the values, then sum. On Intel, it is actually faster to shuffle because the PHADD instruction does 2 slow shuffle uops internally.

    This type of processing is encountered often. Fast vertical processing, followed by none-too-fast horizontal processing.

    Actual code.

    http://f265.org/browse/f265/tree/f265/asm/avx2/pixel.asm?h=develop

  • Laurent_Birtz
    Member
    Post count: 10

    All vector examples come from the algorithms in an H.265 video encoder using the AVX2 instruction set. I’m using several posts because there’s a lot of text.

    Example 1: intra prediction on a block of 4×4 8-bit pixels.

    
    ijklmnopq
    h....
    g....
    f....
    e....
    d
    c
    b
    a
    

    The pixels shown as ‘.’ are predicted from 17 neighbor pixels ‘a-q’ and a projection angle. There are 33 projection angles defined, ranging from bottom-left to top-right.

    Opereation 1: linearize the neighbors.

    Given the chosen prediction angle, linearize the neighbors for the projection. For example, if the pixels are projected toward the bottom and a little to the right, the neighbor pixels could be linearized as follow.
    abcdefghijklmnopq => fhijklm

    This operation is best done by a shuffle operation. Load the neighbors in a register, compute the offsets in another register (per byte), and reorder (PSHUFB).

    Operation 2: filter the neighbors.

    For some angles, the neighbors are filtered to smooth the projection. Essentially, for 3 contiguous neighbors ‘abc’, the operation is
    b’=(a+2*b+c)/4.

    Note that the pixels are temporarily expanded to 16-bit during this computation to avoid overflows.

    This can be done by shuffling the neighbors in a register to align them, or by loading the neighbors from memory at an offset of +-1.

    Operation 3: project the neighbors.

    For each pixel of the 4×4 block, compute the offset(s) of the corresponding neighbor pixel(s). If the projection does not align cleanly on a neighbor position, the two closest pixels are weighted with two i/32, (32-i)/32 fractions as 16-bit.

    The projection algorithm depends on the angle direction. If the direction is mostly vertical, then the neighbors can be shuffled within a register or loaded as a group from memory, with an offset specific to each row. Shuffling is more efficient and avoids failed store forwarding issues.

    If the direction is horizontal, one way to do it is to pretend the angle is vertical, then transpose the matrix at the end (I discuss this in the other examples). This can also be done more efficiently by shuffling at each row.

    In practice, the pixel blocks range from 4×4 to 32×32 pixels. The optimal algorithm differs considerably by block size and depends on the vector register size since that determines the effective shuffle range.

    Actual code.

    http://f265.org/browse/f265/tree/f265/asm/avx2/intra.asm?h=develop

  • Laurent_Birtz
    Member
    Post count: 10

    Thanks for checking.

    I’ll trust you for the instruction sequence. I can’t follow all the reasoning without the ISA spec. Note that the SIMD code would require a scatter operation if you implemented this in the naive way (combining the two 1D functions gets you a better algorithm overall).

    There are some things that the compiler cannot know — for example, the first 1D DCT can be done with 16-bit multiplies due to restrictions on the input range. Also, it’s often possible to “batch” operations together, e.g. a super function that does 33 intra predictions at a time. So, in practice, hand-crafted (SIMD) code can be better than it initially looks when you consider just one operation.

    Hopefully, it will be possible to write a tool that takes some instrinsics from the programmer and reorders/pads them to respect the timing. I may well be that MIMD Mill can perform as well as SIMD Haswell (that sentence sounds wrong), but of course I’d like a 5x-10x improvement over Haswell :-)

    With that said, good luck with your CPU.

  • Laurent_Birtz
    Member
    Post count: 10

    Very clear, thanks.

    The only drawback I see to pipelining more than required to mask the L1 latency is that you incur an overhead, which may or may not be significant depending on the number of loop iterations.

  • Laurent_Birtz
    Member
    Post count: 10

    I saw all your presentations. Very interesting, even though I didn’t understand it all.

    My understanding is that you’ll get a stall for the first pickup that is ready in the steady state of the non-unrolled loop. By that time, you might have a few other loads in flight, e.g. 4 in my example, but not 10 loads in-flight.

  • Laurent_Birtz
    Member
    Post count: 10

    Thanks for the replies Ivan.

    I agree with your comment that ultimately it’s the (reorder) buffer size that limits the number of cache misses discovered before a stall in a OoO. Still, there is a fundamental difference between a OoO and a Mill: the Mill is much wider!

    On one hand, it means that a typical OoO would fill its “retire” buffer much slower than a Mill would for the same buffer size, thus potentially hiding latency for more instructions before running out of buffer space. On the other hand, if the compute/stall ratio is 70/30 for an app on a OoO, and if the Mill executes 10x faster, then the new ratio become 7/30. So, same stall time overall, but proportionally much more problematic on a Mill.

    Unlike a OoO, if you do speculative execution side-effect-free, then you are not limited by buffer size, but only by encountering a branch involving a NaR (AFAICT).

    For the vector operations, I will check my assembly code Monday and come up with a few examples (http://f265.org/browse/f265/tree/f265/asm/avx2/dct.asm?h=develop).

    Thanks,
    Laurent

  • Laurent_Birtz
    Member
    Post count: 10

    Thanks for the reply.

    Static scheduling will only get you so far. If you have one load per loop iteration with a medium loop body, you can’t really schedule 10 loads up-front without unrolling a lot (i.e. each load must somehow be associated with instructions to process it). There are also more complex scenarios where discoverability would be useful (e.g. loads-on-loads).

    For the load/store aliases, things are indeed interesting. I especially like the mechanism that Mill uses to prevent false sharing within a cache line.

    As requested, I won’t further discuss what Mill could do. Note though that enthusiasts typically like to discuss new architectures openly, in this forum or elsewhere. The idea proposed by png (IP on the forums) to include a standard blurb in a post to prevent patent issues might be worth investigating. To me, that seems preferable than asking people to voluntarily censor themselves.

    Regards,
    Laurent Birtz

Viewing 9 posts - 1 through 9 (of 9 total)