If you work for Oracle, your employer is going to be very unhappy about you spilling the beans on their revolutionary new CPU.
Otherwise, you might look a bit more closely about the design of the Sparc you mention. To be able to execute 30 operations of the slide-26 kind (a constant, an add, and a store) in one cycle, you would need ten load-store units and twenty ALUs – actually, 30 ALUs, because the Sparc takes two instructions to build a 32-bit constant. I’m not up on the most recent Sparc offerings, but I think the biggest they have has two load-store units and two ALUs, which is a little short of what you need. 🙂
So why not build a Sparc with more functional units? Well, knowing why not is why hardware engineers get paid, but the short answer is that the units have to be able to talk with each other, and the cost of the connections increases as to the square of the number of units. Rather quickly you reach a point at which the power required will melt the chip. You might Google “dark silicon problem” for more.
The Mill avoids this barrier using a method long used in the embedded world: static scheduling with exposed pipeline. That solves the melting problem, but unfortunately such designs give very bad performance on general purpose programs. The issues are run-time (cache misses and the like), so compiler improvements don’t help. The Mill has solved those issues, and is able to bring DSP power-performance numbers to general purpose code.
I wish it were as easy as you believe; I could have spent the last decade on a beach. 🙂