It will be interesting to see how Intel’s new Knight’s Landing (72 in-order x86 cores giving 3 TFlops double-precision(!)) is received. I’ve chatted to someone who played with Knights Corner but as I recall they struggled to apply it to their problems. Sadly I’ve forgotten any deep insights they may have mentioned.
I guess the big challenge when you have a lot of independent cores flying in close formation is meshing them together? And the granularity of the tasks has to be really quite large I imagine; if you play with, say, Intel’s Thread Building Blocks or openMQ (where parallelism is in-lined, rather than explicitly crafting a large number of tasks), you’ll be staggered at how many iterations of a loop you need to propose to do before its worth spreading them across multiple cores.
Of course the Go goroutines and Erlang lightweight processes for CSP can perhaps use some more cores in a mainstream way, for server workloads.
The other approach to massively parallel on-chip is GPGPU, which is notoriously non-GP-friendly and hard to apply to many otherwise-parallel problems. I persevered with hybrid CPU (4 core i7) and CUDA (meaty card, fermi IIRC, I was borrowing it on a remote machine, forget spec) when I was doing recmath contest entries, and typically the CUDA would give me nearly 2x the total performance of the 4xi7, which is not to be sneezed at but hardly unleashing all those flops! And conditions really killed it.
AMD is pushing hard towards the APU and Intel also unified the address space for their integrated GPUs IIRC, so things do come to pass pretty much as John Carmack predicts each QuakeCon. His views on raytracing triangles for games are terribly exciting, and suggest to me a move towards more GP and MIMD GPUs in future too.
So it’ll be exciting to see how people innovate with the Mill.