Thanks for the replies Ivan.
I agree with your comment that ultimately it’s the (reorder) buffer size that limits the number of cache misses discovered before a stall in a OoO. Still, there is a fundamental difference between a OoO and a Mill: the Mill is much wider!
On one hand, it means that a typical OoO would fill its “retire” buffer much slower than a Mill would for the same buffer size, thus potentially hiding latency for more instructions before running out of buffer space. On the other hand, if the compute/stall ratio is 70/30 for an app on a OoO, and if the Mill executes 10x faster, then the new ratio become 7/30. So, same stall time overall, but proportionally much more problematic on a Mill.
Unlike a OoO, if you do speculative execution side-effect-free, then you are not limited by buffer size, but only by encountering a branch involving a NaR (AFAICT).
For the vector operations, I will check my assembly code Monday and come up with a few examples (http://f265.org/browse/f265/tree/f265/asm/avx2/dct.asm?h=develop).