I don’t seem to be able to edit, but more blah.
My suspicion, as an ex-compiler guy and now middle-ware corporate, and occasional wannabe hardware dude, is that single-threaded performance requires a few things. Obviously massive op/instruction. But more importantly, collapsing of local branches to enable that. Without that, ILP of general-purpose code is branch-limited. So, why not have predicated execution of operations in an instruction, as a bit-mask, as an extension of single-instruction predication (per 32-bit ARM)? The masked-out instructions produce None, or are NOPs. For wider ops/instruction, allow multiple masks.
I share the same cynicism that some of the VLIW audience did at one of your talks – namely, there just isn’t enough apparent ILP available. We have to waste speculative energy to change that.
For JVM middleware, most compiler speculation is around virtual method calling versus concrete run-time types. You can unwind that, just like C++ in the JIT, to predict virtual call targets, and/or you can specialize for known concrete run-time types. It kinda looks like branches in the JIT code. In addition, most commercial (managed environment) software is now strictly pointer-bound. Unlike carefully-crafted C(++), there is no language support for stack-local structures. Apparent load/store activity is huge. Garbage collection is important, and it’s unclear exactly what hardware facilities would be useful. Local allocation areas (Eden space) are organised as stacks, independent of the conceptual call-stack, but promotion out of Eden needs to be handled.
I guess I’m saying that SpecInt is a thing, g++ performance is a thing, but what we really need is support for what is running most big corporate workloads. It’s managed environments (JVM/Microsoft). I have written some JIT’s in my distant past, but I am rusty as all hell.
I suspect, though, that the ideal target architecture is different from what might be expected from SpecInt. (And yes, efficient multi-threading is part of that, but doesn’t capture the dynamic CPU demand).