With wafer scale machine learning and quantum computing we are so down the road to special purpose architectures, that GP is really treated like orchestration code.
The big market gap in machine learning is extreme sparsity and extreme sparsity requires indirect memory access due to the data structures required for sparse arrays. So the ML challenge is keeping shared, parallel, random memory access on-die. This is a gap that has existed ever since GPUs were repurposed for ML since graphics (up until ray trace) had little use for sparse matrix multiplies. This has, in turn, biased the entire field toward dense models — drunks looking for their keys under the lamp post. The hundreds of billions of parameters in the large models are dense but it’s been demonstrated that weight distillation can bring that down by at least a factor of 10 without loss of perplexity score. One case achieved 97% reduction. This is consistent with what we know about neocortical neuron connectivity. Moreover, Algorithmic Information Theory tells us that the gold standard for data-driven induction of models is parameter minimization approximating Kolmogorov Complexity of the data.
Quantum computing is a pig in a poke.
At the high-end in cloud servers transistor budgets for cores and Watts to operate them seem much more compelling to pay for architecture switches, but I don’t know if you could scale the Mill meaningfully to dozens of cores in a die.
It’s important to think in terms of 50e9 transistors per die as a way of keeping shared RAM access on-die. That’s where we are for both indirect memory access (sparsity in ML) and GP.