Thanks for checking.
I’ll trust you for the instruction sequence. I can’t follow all the reasoning without the ISA spec. Note that the SIMD code would require a scatter operation if you implemented this in the naive way (combining the two 1D functions gets you a better algorithm overall).
There are some things that the compiler cannot know — for example, the first 1D DCT can be done with 16-bit multiplies due to restrictions on the input range. Also, it’s often possible to “batch” operations together, e.g. a super function that does 33 intra predictions at a time. So, in practice, hand-crafted (SIMD) code can be better than it initially looks when you consider just one operation.
Hopefully, it will be possible to write a tool that takes some instrinsics from the programmer and reorders/pads them to respect the timing. I may well be that MIMD Mill can perform as well as SIMD Haswell (that sentence sounds wrong), but of course I’d like a 5x-10x improvement over Haswell
With that said, good luck with your CPU.