Recall that there are two dimensions of parallelism on a Mill: MIMD width (number of slots, == number of concurrent operations), and vector height (in bytes). It’s hard to find a code that can use more than a Silver in open code; there’s just not that much ILP in the code, even with phasing. In pipelined loops there’s all the ILP you want, but once you have every op in the loop running pipelined then all you can do is unroll and do several iterations at once. Even HPC loops are usually only one or two instructions on a Gold, so it’s not clear that there is much use for greater width. While wider is certainly possible, we spec’d Gold to be at what we expected to be the point of diminishing returns for width.
However, even on a narrower configuration there is some use for higher, i.e. more SIMD even if not so much MIMD. Those pipelined loops can often be vectorized, and if the iteration count is big enough to use the height then that’s probably the best way to get raw horsepower.
Of course, all the horsepower in the world will bot give you anything if you are memory bandwidth limited, and, as you suggest, bandwidth is likely to be the practical limit.