I was at Hot Chips and saw this presentation. Frankly I was stunned that Nvidia found this worth while.
The chip counts execution of the native ARM code to locate hot spots in the code: typically loops, but other code as well. It traps when it finds something, and then software re-optimizes the code. When the hot spot is again executed, the hardware replaces the original ARM sequence with optimized code. Essentially this is a hardware-accelerated version of what Cliff Click’s Hot Spot JIT does. The optimizer can run in another core while the app continues executing the native ARM code. According to the presentation, the software optimizer does:
Reorders Loads and Stores
Improves control flow
Removes unused computation
Hoists redundant computation
Sinks uncommonly executed computation
i.e. what any old compiler does at -O4 or so. The post-optimized code is micro-ops, not native ARM, although in response to a question the presenter said that “many” micros were the same as the corresponding native op.
The stunner: Nvidia claimed 2X improvement.
2X is more than the typical difference between -O0 and -O5 in regular compilers, so Nvidia’s result cannot be just a consequence of a truly appallingly bad compiler producing the native code. The example they showed was the “crafty” benchmark, which uses 64-bit data, so one possible source of the gain is if the native ARM code did everything in 32-bit emulation of 64-bit and the JIT replaced that with the hardware-supported 64-bit ops.
Another possibility: the hardware has two-way decode and (I think) score-boarding, so decode may be a bottleneck. If the microcode is wider-issue then they may be able to get more ILP working directly in microcode (crafty has a lot of ILP). Lastly, the optimizer may be doing trace scheduling (assuming the microcode engine can handle that) so branch behavior may be better.
But bottom line: IMO 2X on crafty reflects a chosen benchmark and an expensive workaround for defects in the rest of the system. I don’t think that the approach would give a significant gain over a decent compiler on an x86, much less a Mill. So no, I don’t see us ever taking up the approach, nor do I expect it to be adopted by other chip architectures.
My opinion only, subject to revision as the white papers and other lit becomes available.