Thanks for the details.
I’m working by eyeball here, but it appears that both the interpolate and copy vectorize and pipeline. Ditto SAD. The dct is interesting. You’d get pretty good results in scalar on a wider Mill; there’s no great reason to use vector at all, except for encode density. It’s got 48 scalar ALU ops and 32 multiplies on its face; on a Gold that would be 10 cycles, plus a cycle for function overhead, per call, with no compiler optimization at all. However, there’s quite a bit of CSE in the code, and the shift suggests that you’re actually working in fixed-point, not integer, so that might shrink when the compiler is done with it.
The interesting question is whether a SIMD code (like you are using on Haswell) is actually any better than a straightforward MIMD implementation, and if it is, whether a tool chain could find it. Looking at the 0_7 column, the constants are clearly a permute of each other. In Mill SIMD with more than one multiplier you would not want to implement them as a permute though; better to load them, unless you are hurting for L1 for other reasons. They are all small values, so it might figure out to load them as a byte vector and widen for use on members with less than 32-byte vectors.
Then there’s building the add/sub_0_7 vector. That’s two splats, a mask, and a pick on a Mill, but what’s really going on is an interleave. The opset has a de-interleave (“alternate”), but no interleave; this is a case where one would be useful. However, it’s doubtful that the tools would figure out to use it. The interleave could be emulated by a double-vector shuffle, but that’s a two-slot gang op and has a longer latency than a specific interleave would be (interleave would be just a dyadic splat, one cycle). A dyadic permute will also work in this case, and is a better choice than the general shuffle.
The multiply is then a straight vector op. The +/- between the columns is a pair of ALUs, a mask and a pick. So if I have this right, the critical path using vectors is: splat, pick, mul, ALU, pick, ALU, pick, ALU, shift. That’s 8 clocks because the picks hide in the phasing. However, we won’t have 32 multipliers, so those have to be piped, which will add three more clocks likely.
Given the slop in these guesstimates, it’s not clear whether a SIMD or MIMD would win out, although for sure the MIMD would warm up the iCache The MIMD is trivial for the tool chain; recognizing the SIMD would be an interesting exercise. Another exercise is what to do if the member vector is >32 bytes; again, trivial in MIMD, not so much in SIMD.