Speculative execution

Author
Posts
Laurent_Birtz
Member
September 26, 2014 at 1:40 pm
Post count: 10
#1430 |
Hi
I find the Mill CPU to be a long-needed breath of fresh air in the CPU architecture field. I am a software programmer (with no hardware design experience) interested in getting the best performance out of low-level optimized code. I think the Mill has real potential in that regard.
There are two aspects of the Mill that bother me as a developer. Mainly, I worry about the Mill’s ability to efficiently process a sequence of instructions where each instruction involves a cache miss. Suppose for example that the application has a scattered memory access pattern or a large working set that overflows the L1 cache.
Then, the in-order pipeline will stall as soon as a load “pickup” operation fails (AFAICT). On a OoO CPU, the CPU would continue executing the following instructions, and trigger more cache misses. Both the Mill and the OoO CPU would eventually stall waiting for the DRAM, but the OoO might stall with 10 loads in flight while the Mill stalls with only 4.
I’ve read on the forums that you don’t like people discussing improvements to the Mill due to patent issues, so I’ll just hand wave a potential solution to this problem. I claim no copyright on this text. Essentially, “snapshot”, “speculative execution”, “not available = nop”, “speculative = not externally visible / store inhibition”, “prefetch all discovered accesses while stalling”. Hopefully that’s enough to get my point across. Are you planning something like that eventually?
The second aspect that bothers me is the meta data for the vector operations. I do much integer vector programming and I find that often there is no such thing as a fixed data type for a vector. One instruction operates on bytes, the next on 16-bits, the third on 64-bits for a shuffle, etc. If one needs explicit type-change operations to do that sort of things, then code density and/or throughput will suffer.
With that said, I wish you the best of luck with the Mill and I look forward to your first implementation.
Regards,
Laurent Birtz
Will_Edwards
Moderator
September 26, 2014 at 2:12 pm
Post count: 98
#1432
Hi Laurent,
Thanks for the encouragement!
On the first point, if the OoO can do all 10 loads in parallel then all 10 must be independent. If all 10 are independent then the compiler can statically schedule them in parallel too.
Things actually get interesting if the loads alias stores, as on the Mill the loads snoop/snarf on the stores.
Re patents: it’s best to to hint nothing at all! Please don’t go this route of disowning ideas publically.
The metadata aspect is interesting. Of course Mill SIMD is very familiar to people used to thinking about SIMD on other platforms, but its broader too.
The widening and narrowing ops are very fast and the range of ops that work on vectors is so much wider than other ISAs.
The tracking of error and None NaRs allows vectorization of so much more (see the auto vectorization in the Metadata talk and the Pipelining talk examples, and think how you can use masking in general open code too) so its swings and roundabouts but generally a really big win.
- Ivan Godard
  Keymaster
  September 26, 2014 at 2:52 pm
  Post count: 689
  #1436
  You have colliding repliers
  Let me expand on Will’s comments re vectors. The present Mill opset assumes that single data elements, either scalar or vector element, are accessed with an inherent width represented in the metadata. Operations can produce new drops with a different width, but there’s no way to view a single value as having simultaneously different widths. That is, there’s no hardware equivalent of a reinterpret_cast< ...>.
  The advantage of this assumption is that ordinary code from HLLs does not need to represent widths in the operation encoding, and will run unchanged on Mill members with radically different data encodings.. The drawback is that code that is playing machine-level games at the representation level cannot do some of the things it might want to do.
  The problem with the code you describe is that it is explicitly machine dependent. If you are interpreting the same bucket of bits with completely different formats, then your code is assuming what that bucket looks like. That won’t run on a machine with a different size or shape bucket. For example, try to write your vector code in such a way that it works correctly on a 386, MMX, and SSE – without testing the hardware availability of those facilities. Any algorithm for which you have to ask “How big is a vector” is machine-dependent. The Mill doesn’t do machine-dependent, it does portable The Mill claims that it will run language-standard code correctly on all members, without rewrite or member-dependent testing. The Mill does not claim that it will run machine dependent assembler (or equivalent intrinsics) from other machines.
  That said, as we get more real code examples through the tool chain we may find holes for which the present opset is insufficient. We have a backlog of candiate ops, many of then for vectors, waiting on more experience. If there is a way to express these candidates that makes sense on every Mill member then we are likely o put them in; the Mill is still a work in process. You can help, if you would: look at some of your vector code of concern, and think about what you are actually doing (mentally) to the data to give you the different view for the next operation. Feel free to post examples here.
  - Laurent_Birtz
    Member
    September 26, 2014 at 3:45 pm
    Post count: 10
    #1440
    Thanks for the replies Ivan.
    I agree with your comment that ultimately it’s the (reorder) buffer size that limits the number of cache misses discovered before a stall in a OoO. Still, there is a fundamental difference between a OoO and a Mill: the Mill is much wider!
    On one hand, it means that a typical OoO would fill its “retire” buffer much slower than a Mill would for the same buffer size, thus potentially hiding latency for more instructions before running out of buffer space. On the other hand, if the compute/stall ratio is 70/30 for an app on a OoO, and if the Mill executes 10x faster, then the new ratio become 7/30. So, same stall time overall, but proportionally much more problematic on a Mill.
    Unlike a OoO, if you do speculative execution side-effect-free, then you are not limited by buffer size, but only by encountering a branch involving a NaR (AFAICT).
    For the vector operations, I will check my assembly code Monday and come up with a few examples (http://f265.org/browse/f265/tree/f265/asm/avx2/dct.asm?h=develop).
    Thanks,
    Laurent
Ivan Godard
Keymaster
September 26, 2014 at 2:27 pm
Post count: 689
#1434
In a sense nearly all Mill operations are speculative; only those (like store) that change globally visible state cannot be speculated. So you can issue as many loads as there are functional units to handle the issue and retire stations to hold the values. An OOO machine has a similar constraint: it cannot have more loads in flight than it has issue and buffer capacity to handle. Consequently, in this the Mill and an OOO are equally constrained for a given amount of hardware resources.
Now to your example. You assume that the Mill will issue four loads and stall picking up the first one, while the OOO can issue six more. The problem with the example is: why didn’t the Mill issue 10 loads before it tried the first pickup? If there is code after the load that doesn’t depend on a load, the compiler is supposed to hoist that code to either before the load issue or between the lod issue and the load pickup. If the compiler did not hoist that code then there is a compiler bug, not an architecture problem. We’ll have compiler bugs for sure (I write compilers) but we fix them.
Incidentally, the purpose of the Mill deferred load facility is to mask the L1 and L2 latencies, not to mask the DRAM latency; masking the DRAM would require being able to hoist issue an impractical distance before pickup. Instead there are other (NYF) mechanisms to mask the DRAM latency.
- Laurent_Birtz
  Member
  September 26, 2014 at 2:53 pm
  Post count: 10
  #1438
  Thanks for the reply.
  Static scheduling will only get you so far. If you have one load per loop iteration with a medium loop body, you can’t really schedule 10 loads up-front without unrolling a lot (i.e. each load must somehow be associated with instructions to process it). There are also more complex scenarios where discoverability would be useful (e.g. loads-on-loads).
  For the load/store aliases, things are indeed interesting. I especially like the mechanism that Mill uses to prevent false sharing within a cache line.
  As requested, I won’t further discuss what Mill could do. Note though that enthusiasts typically like to discuss new architectures openly, in this forum or elsewhere. The idea proposed by png (IP on the forums) to include a standard blurb in a post to prevent patent issues might be worth investigating. To me, that seems preferable than asking people to voluntarily censor themselves.
  Regards,
  Laurent Birtz
  - Ivan Godard
    Keymaster
    September 26, 2014 at 5:38 pm
    Post count: 689
    #1442
    Unrolling is unnecessary. See the Pipelining talk for how this works.
    - Laurent_Birtz
      Member
      September 26, 2014 at 5:59 pm
      Post count: 10
      #1444
      I saw all your presentations. Very interesting, even though I didn’t understand it all.
      My understanding is that you’ll get a stall for the first pickup that is ready in the steady state of the non-unrolled loop. By that time, you might have a few other loads in flight, e.g. 4 in my example, but not 10 loads in-flight.
      - Ivan Godard
        Keymaster
        September 26, 2014 at 7:31 pm
        Post count: 689
        #1446
        You can pipeline the loads as far ahead as you want, up to the maximum deferral expressible in the encoding, equivalently, the maximum number of distinct pick-tokens expressible.
        OOO potentially does do a better job than Mill at hiding DRAM latency for simple loads; that is not the purpose of deferral, although it does help somewhat. There are even some cases where OOO can do a better job of hiding cache latency; those cases mostly involve cascaded loads. We believe such cases to be insignificant, but as yet we have no numbers to back that belief.
        The Mill does extensive prefetch on the code side; the mechanism was explained in the Prediction talk. What it does on the data side must await a future talk.
      - Laurent_Birtz
        Member
        September 26, 2014 at 10:10 pm
        Post count: 10
        #1448
        Very clear, thanks.
        The only drawback I see to pipelining more than required to mask the L1 latency is that you incur an overhead, which may or may not be significant depending on the number of loop iterations.
Laurent_Birtz
Member
September 29, 2014 at 11:50 am
Post count: 10
#1450
All vector examples come from the algorithms in an H.265 video encoder using the AVX2 instruction set. I’m using several posts because there’s a lot of text.
Example 1: intra prediction on a block of 4×4 8-bit pixels.
```
ijklmnopq
h....
g....
f....
e....
d
c
b
a
```
The pixels shown as ‘.’ are predicted from 17 neighbor pixels ‘a-q’ and a projection angle. There are 33 projection angles defined, ranging from bottom-left to top-right.
Opereation 1: linearize the neighbors.
Given the chosen prediction angle, linearize the neighbors for the projection. For example, if the pixels are projected toward the bottom and a little to the right, the neighbor pixels could be linearized as follow.
abcdefghijklmnopq => fhijklm
This operation is best done by a shuffle operation. Load the neighbors in a register, compute the offsets in another register (per byte), and reorder (PSHUFB).
Operation 2: filter the neighbors.
For some angles, the neighbors are filtered to smooth the projection. Essentially, for 3 contiguous neighbors ‘abc’, the operation is
b’=(a+2*b+c)/4.
Note that the pixels are temporarily expanded to 16-bit during this computation to avoid overflows.
This can be done by shuffling the neighbors in a register to align them, or by loading the neighbors from memory at an offset of +-1.
Operation 3: project the neighbors.
For each pixel of the 4×4 block, compute the offset(s) of the corresponding neighbor pixel(s). If the projection does not align cleanly on a neighbor position, the two closest pixels are weighted with two i/32, (32-i)/32 fractions as 16-bit.
The projection algorithm depends on the angle direction. If the direction is mostly vertical, then the neighbors can be shuffled within a register or loaded as a group from memory, with an offset specific to each row. Shuffling is more efficient and avoids failed store forwarding issues.
If the direction is horizontal, one way to do it is to pretend the angle is vertical, then transpose the matrix at the end (I discuss this in the other examples). This can also be done more efficiently by shuffling at each row.
In practice, the pixel blocks range from 4×4 to 32×32 pixels. The optimal algorithm differs considerably by block size and depends on the vector register size since that determines the effective shuffle range.
Actual code.
http://f265.org/browse/f265/tree/f265/asm/avx2/intra.asm?h=develop
Laurent_Birtz
Member
September 29, 2014 at 11:51 am
Post count: 10
#1452
Example 2: motion estimation using the sum of absolute differences (SAD).
The motion estimation is done by repeatedly comparing a block of pixels in a reference frame with a source block of pixels, e.g. a block of 16×16 pixels. For each reference (‘a’) and source pixel (‘b’) of the block, we compute |a-b|, then sum the values (e.g. 256 values for a 16×16 block), yielding the SAD.
On AVX2, there is a SAD instruction that computes |a-b| between two registers and then sums 8 contiguous values, yielding 1 8-byte value per group of 8-bytes in the register. Without that instruction, we’d have to subtract with saturation as 8-bit (MAX(|a-b|, |b-a|)).
The natural processing is to loop per row. For each loop iteration, load the two rows, compute the SAD “vertically” (i.e. per pixel or pixel group), and sum in an accumulator register. This part of the computation follows the natural alignment and can be done efficiently.
In the last part, we must “fold” the results in a register horizontally into a single value. With X values in the register, there are lg(X) folding operations (e.g. 8 => 4 => 2 => 1). This can done in two ways. We can add the values horizontally within the register (e.g. PHADD on AVX2). Or, we can shuffle to align the values, then sum. On Intel, it is actually faster to shuffle because the PHADD instruction does 2 slow shuffle uops internally.
This type of processing is encountered often. Fast vertical processing, followed by none-too-fast horizontal processing.
Actual code.
http://f265.org/browse/f265/tree/f265/asm/avx2/pixel.asm?h=develop
Laurent_Birtz
Member
September 29, 2014 at 11:52 am
Post count: 10
#1454
Example 3: DCTs/IDCTs.
This operation corresponds to matrix multiplications, with some exploitable symmetry in the matrix. The supported block sizes are 4×4, 8×8, 16×16, 32×32.
```
[....]   [....]   [....]   [....]
[....] * [....] * [....] = [....]
[....]   [....]   [....]   [....]
[....]   [....]   [....]   [....]
```
The core algorithm uses shuffles to align the pixels, and multiply-add with 16-bit and 32-bit factors. The shuffle operation issue rate is a limiting factor.
Although we avoid doing full matrix transposes in the implementation because that’s too slow, we still perform the (optimized) equivalent. A naive transpose is done with a 100% shuffle workload. Suppose we have 4 registers (A-D), each containing a row of 4 pixels. We’d proceed with unpack operations (1->2, 2->4).
```
A: abcd    aebf    aeim
B: efgh => cgdh => bfjn
C: ijkl    imjn    cgko
D: mnop    kolp    dhlp
```
That type of reordering happens often.
Actual code.
http://f265.org/browse/f265/tree/f265/asm/avx2/dct.asm?h=develop
Conclusions.
In my application, shuffle operations are very, very common. I would say that 50% of the algorithms have a workload where more than 33% of the operations are shuffles. The data needs to be shuffled before, during, and after the processing (shuffle-load, shuffle-process, shuffle-store).
With Haswell, there are 3 vector ports, and each port is very specialized. Only 1 shuffle uop can issue per cycle (on port 5) and this is the main bottleneck. One can sometimes cheat a little by using vector shifts and blend operations in lieu of a shuffle operation, but that’s not sufficient to avoid the bottleneck.
As a rule of thumb, my application would be best served if the hardware could sustain a workload with 50% shuffle operations. The most useful instruction on AVX2 is PSHUFB, which allows the arbitrary reordering of a 16-byte lane. Many other shuffle operations are useful. In general, making more shuffle modes available lead to an important decrease in the total operation count.
Type expansion and narrowing is common enough. 8-bit-to-16-bit multiply-add is (ab)used very frequently for a variety of purposes.
Shuffling and vector type association are orthogonal concepts. For the Mill, you could support arbitrary shuffle operations that do not affect the underlying vector data type. That’s better than the Intel way of defining the same intruction twice for the integer and floating point domains. I don’t know if that’s compatible with the 2-stage bypass network used by the Mill.
Now, to address your main point: that the Mill uses a model-dependent register size and that it must be programmed without knowing the effective register size. I do not know how to program effectively such a machine. As I discussed above, the optimal algorithm directly depends on the vector register size, due to the implicit shuffle range allowed within that register.
Now, if you define a virtual register size, say 32-bytes, and you define your shuffle/gather operations based on that virtual size, then I can work with that. If on some models you implement the specified operations with 2×16-byte or 4×8-byte vector units, I don’t care, this is transparent to me, the programmer.
Your goal of “write once, run anywhere” is a desirable property. That said, when one is willing to do assembly programming to get more performance, it should be clear that performance is a #1 priority. On Intel, we forego using intrinsics because we lose 10-15% performance due to bad compiler optimizations. If you tell me, “your code will be dependent on model X, but you will gain 20% performance by assuming reg_size = 32″, I’ll choose that trade-off. Ultimately it’s the difference between $1 000 000 of hardware and $1 250 000.
Regards,
Laurent
Ivan Godard
Keymaster
September 29, 2014 at 12:53 pm
Post count: 689
#1456
This isn’t quite what I was hoping for from you. These codes are already hand-tuned for a particular x86. A Mill can of course emulate an x86, but it won’t be fast. On some Mill members there may even be (an approximation to) AVX2 operations; you have to pick a member with the same vector height as a Haswell. You could then code your critical function in conAsm assembler for that member, and you’d get roughly the performance that a Haswell on the same clock and process would give you, or better depending on how many shuffle units were on the member you chose. I’m sure some people will use Mills in this way because it’s how they are used to coding, but we don’t recommend it.
What I was hoping to get from you was the actual algorithm that your code is a machine-dependent implementation of. Express it in scalar if need be. There won’t be any shuffle operations in the algorithm of course, nor any register sizes, nor registers for that matter. It’s impractical for me to try and deduce what your code is supposed to do by looking at what it actually does; decompilation is provably impossible in general.
The Mill does have an arbitrary shuffle. It’s not clear that any of your codes would actually used a shuffle, unless that were the only hammer you had.
Ivan
Laurent_Birtz
Member
September 29, 2014 at 1:11 pm
Post count: 10
#1458
For the intra case.
Get the content of this block.
```
[XXXX]
[XXXX]
[XXXX]
[XXXX]
```
From a vector of pixels.
[XXXXXXXX]
```
// Pass every row.
int angle_sum = 0;
for (int y = 0; y < bs; y++)
{
    angle_sum += angle;
    int off = angle_sum>>5;
    int frac = angle_sum&31;

    // Interpolate.
    if (frac)
        for (int x = 0; x < bs; x++)
            dst[y*bs+x] = ((32-frac)*ref[off+x] + frac*ref[off+x+1] + 16)>>5;

    // Copy.
    else
        for (int x = 0; x < bs; x++)
            dst[y*bs+x] = ref[off+x];
}
```
For the SAD case.
```
int sad = 0;
for (int y = 0; y < height; y++, src0 += stride0, src1 += stride1)
    for (int x = 0; x < width; x++)
        sad += F265_ABS(src0[x] - src1[x]);
return sad;
```
For the DCT case.
```
void fenc_dct_8_1d(int16_t *dst, int16_t *src, int shift)
{
    int add = 1 << (shift - 1);

    for (int i = 0; i < 8; i++, dst++, src += 8)
    {
        int add_0_7 = src[0]+src[7], add_1_6 = src[1]+src[6], add_2_5 = src[2]+src[5], add_3_4 = src[3]+src[4];
        int sub_0_7 = src[0]-src[7], sub_1_6 = src[1]-src[6], sub_2_5 = src[2]-src[5], sub_3_4 = src[3]-src[4];

        dst[0]  = (64*add_0_7 + 64*add_1_6 + 64*add_2_5 + 64*add_3_4 + add) >> shift;
        dst[8]  = (89*sub_0_7 + 75*sub_1_6 + 50*sub_2_5 + 18*sub_3_4 + add) >> shift;
        dst[16] = (83*add_0_7 + 36*add_1_6 - 36*add_2_5 - 83*add_3_4 + add) >> shift;
        dst[24] = (75*sub_0_7 - 18*sub_1_6 - 89*sub_2_5 - 50*sub_3_4 + add) >> shift;
        dst[32] = (64*add_0_7 - 64*add_1_6 - 64*add_2_5 + 64*add_3_4 + add) >> shift;
        dst[40] = (50*sub_0_7 - 89*sub_1_6 + 18*sub_2_5 + 75*sub_3_4 + add) >> shift;
        dst[48] = (36*add_0_7 - 83*add_1_6 + 83*add_2_5 - 36*add_3_4 + add) >> shift;
        dst[56] = (18*sub_0_7 - 50*sub_1_6 + 75*sub_2_5 - 89*sub_3_4 + add) >> shift;
    }
}

// This function is the assembly function.
void fenc_dct_8_c(int16_t *dst, f265_pix *src, int src_stride, f265_pix *pred, int pred_stride)
{
    int lg_bs = 3, bd = 8;
    int bs = 1<<lg_bs, bs2 = 1<<(lg_bs<<1);
    int shift1 = lg_bs + bd - 9, shift2 = lg_bs + 6;
    int16_t diff[bs2], tmp[bs2];
    fenc_get_block_residual(diff, src, src_stride, pred, pred_stride, bs);
    fenc_dct_8_1d(tmp, diff, shift1);
    fenc_dct_8_1d(dst, tmp, shift2);
}
```
- Ivan Godard
  Keymaster
  September 29, 2014 at 2:52 pm
  Post count: 689
  #1461
  Thanks for the details.
  I’m working by eyeball here, but it appears that both the interpolate and copy vectorize and pipeline. Ditto SAD. The dct is interesting. You’d get pretty good results in scalar on a wider Mill; there’s no great reason to use vector at all, except for encode density. It’s got 48 scalar ALU ops and 32 multiplies on its face; on a Gold that would be 10 cycles, plus a cycle for function overhead, per call, with no compiler optimization at all. However, there’s quite a bit of CSE in the code, and the shift suggests that you’re actually working in fixed-point, not integer, so that might shrink when the compiler is done with it.
  The interesting question is whether a SIMD code (like you are using on Haswell) is actually any better than a straightforward MIMD implementation, and if it is, whether a tool chain could find it. Looking at the 0_7 column, the constants are clearly a permute of each other. In Mill SIMD with more than one multiplier you would not want to implement them as a permute though; better to load them, unless you are hurting for L1 for other reasons. They are all small values, so it might figure out to load them as a byte vector and widen for use on members with less than 32-byte vectors.
  Then there’s building the add/sub_0_7 vector. That’s two splats, a mask, and a pick on a Mill, but what’s really going on is an interleave. The opset has a de-interleave (“alternate”), but no interleave; this is a case where one would be useful. However, it’s doubtful that the tools would figure out to use it. The interleave could be emulated by a double-vector shuffle, but that’s a two-slot gang op and has a longer latency than a specific interleave would be (interleave would be just a dyadic splat, one cycle). A dyadic permute will also work in this case, and is a better choice than the general shuffle.
  The multiply is then a straight vector op. The +/- between the columns is a pair of ALUs, a mask and a pick. So if I have this right, the critical path using vectors is: splat, pick, mul, ALU, pick, ALU, pick, ALU, shift. That’s 8 clocks because the picks hide in the phasing. However, we won’t have 32 multipliers, so those have to be piped, which will add three more clocks likely.
  Given the slop in these guesstimates, it’s not clear whether a SIMD or MIMD would win out, although for sure the MIMD would warm up the iCache The MIMD is trivial for the tool chain; recognizing the SIMD would be an interesting exercise. Another exercise is what to do if the member vector is >32 bytes; again, trivial in MIMD, not so much in SIMD.
  - Laurent_Birtz
    Member
    September 29, 2014 at 3:42 pm
    Post count: 10
    #1463
    Thanks for checking.
    I’ll trust you for the instruction sequence. I can’t follow all the reasoning without the ISA spec. Note that the SIMD code would require a scatter operation if you implemented this in the naive way (combining the two 1D functions gets you a better algorithm overall).
    There are some things that the compiler cannot know — for example, the first 1D DCT can be done with 16-bit multiplies due to restrictions on the input range. Also, it’s often possible to “batch” operations together, e.g. a super function that does 33 intra predictions at a time. So, in practice, hand-crafted (SIMD) code can be better than it initially looks when you consider just one operation.
    Hopefully, it will be possible to write a tool that takes some instrinsics from the programmer and reorders/pads them to respect the timing. I may well be that MIMD Mill can perform as well as SIMD Haswell (that sentence sounds wrong), but of course I’d like a 5x-10x improvement over Haswell
    With that said, good luck with your CPU.
Author
Posts

You must be logged in to reply to this topic.