I will sheepishly admit that I hadn’t thought of the log(N) reduction idea. If you can you can implement this with some combination of shuffles and vectorized adds in, say, four clocks, then the need for this instruction goes away. Or even six clocks- and I’d be surprised if it took six clocks. It’s simply not a big enough win anymore to make it worth the instruction encoding hit and increased complexity of the CPU. I withdraw the suggestion. :-}
In general, I think I would advocate for a simpler instruction set, all other things being equal. I’ve yet to see an instruction set get simpler as it ages. There is always time to add complexity later, if it’s needed. Once added, however, complexity never seems to get removed. And it has a bad habit of becoming an albatross hung around the neck of the instruction set, unused but impossible to get rid of (see, for example, AAD on the x86, or the Vax instruction set).