Example 3: DCTs/IDCTs.
This operation corresponds to matrix multiplications, with some exploitable symmetry in the matrix. The supported block sizes are 4×4, 8×8, 16×16, 32×32.
[....] [....] [....] [....] [....] * [....] * [....] = [....] [....] [....] [....] [....] [....] [....] [....] [....]
The core algorithm uses shuffles to align the pixels, and multiply-add with 16-bit and 32-bit factors. The shuffle operation issue rate is a limiting factor.
Although we avoid doing full matrix transposes in the implementation because that’s too slow, we still perform the (optimized) equivalent. A naive transpose is done with a 100% shuffle workload. Suppose we have 4 registers (A-D), each containing a row of 4 pixels. We’d proceed with unpack operations (1->2, 2->4).
A: abcd aebf aeim B: efgh => cgdh => bfjn C: ijkl imjn cgko D: mnop kolp dhlp
That type of reordering happens often.
In my application, shuffle operations are very, very common. I would say that 50% of the algorithms have a workload where more than 33% of the operations are shuffles. The data needs to be shuffled before, during, and after the processing (shuffle-load, shuffle-process, shuffle-store).
With Haswell, there are 3 vector ports, and each port is very specialized. Only 1 shuffle uop can issue per cycle (on port 5) and this is the main bottleneck. One can sometimes cheat a little by using vector shifts and blend operations in lieu of a shuffle operation, but that’s not sufficient to avoid the bottleneck.
As a rule of thumb, my application would be best served if the hardware could sustain a workload with 50% shuffle operations. The most useful instruction on AVX2 is PSHUFB, which allows the arbitrary reordering of a 16-byte lane. Many other shuffle operations are useful. In general, making more shuffle modes available lead to an important decrease in the total operation count.
Type expansion and narrowing is common enough. 8-bit-to-16-bit multiply-add is (ab)used very frequently for a variety of purposes.
Shuffling and vector type association are orthogonal concepts. For the Mill, you could support arbitrary shuffle operations that do not affect the underlying vector data type. That’s better than the Intel way of defining the same intruction twice for the integer and floating point domains. I don’t know if that’s compatible with the 2-stage bypass network used by the Mill.
Now, to address your main point: that the Mill uses a model-dependent register size and that it must be programmed without knowing the effective register size. I do not know how to program effectively such a machine. As I discussed above, the optimal algorithm directly depends on the vector register size, due to the implicit shuffle range allowed within that register.
Now, if you define a virtual register size, say 32-bytes, and you define your shuffle/gather operations based on that virtual size, then I can work with that. If on some models you implement the specified operations with 2×16-byte or 4×8-byte vector units, I don’t care, this is transparent to me, the programmer.
Your goal of “write once, run anywhere” is a desirable property. That said, when one is willing to do assembly programming to get more performance, it should be clear that performance is a #1 priority. On Intel, we forego using intrinsics because we lose 10-15% performance due to bad compiler optimizations. If you tell me, “your code will be dependent on model X, but you will gain 20% performance by assuming reg_size = 32″, I’ll choose that trade-off. Ultimately it’s the difference between $1 000 000 of hardware and $1 250 000.