Mill Computing, Inc. › Forums › The Mill › Tools › Compilers › Dynamic Code Optimization
- AuthorPosts
- #1361 |
NVidia has recently brought dynamic code optimization back into the news with their recent presentation of their Project Denver at Hot Chips. Adding this in to the basic Mill design would probably be a bad idea, but looking forward it seems the Mill has some features that would make this sort of optimization or translation much easier.
The Mill is already specializing binaries before running them, and storing the specialization. The Mill is already keeping track of performance information from the branch predictor and feeding that back to the binary after the process exits that will be loaded again when the executable is re-run. And of course there are all those x86 or ARM binaries that it would be nice if the Mill was able to use.
Now, NVidia does apparently have aspects of their hardware devoted to making all of this easier for them which I presume the Mill doesn’t but maybe those are things that could be added to the Mill 2.0? I’m sure the Mill team has put some thought into this and I’d be interested in hearing about it if you can divulge it now.
I wasn’t at Hot Chips, although Ivan was, but like many I read about the Nvidia transcoding.
It reminded me strongly of Transmeta.
Because the chip must still run untranslated ARM code at a reasonable speed, it must basically be an OoO superscalar chip, and all the inefficiencies that implies. It must still have the full decode logic etc.
And therefore the microops they cache must be very close to the ARM ops they represent.
This aside, I expect they execute very well. Its a good halfway house and underlines how expensive CISC and even RISC to uop decode is; one imagines x86 chips getting much the same advantage if they store their uop decode caches too.
I was at Hot Chips and saw this presentation. Frankly I was stunned that Nvidia found this worth while.
The chip counts execution of the native ARM code to locate hot spots in the code: typically loops, but other code as well. It traps when it finds something, and then software re-optimizes the code. When the hot spot is again executed, the hardware replaces the original ARM sequence with optimized code. Essentially this is a hardware-accelerated version of what Cliff Click’s Hot Spot JIT does. The optimizer can run in another core while the app continues executing the native ARM code. According to the presentation, the software optimizer does:
Unrolls Loops
Renames registers
Reorders Loads and Stores
Improves control flow
Removes unused computation
Hoists redundant computation
Sinks uncommonly executed computation
Improves scheduling
i.e. what any old compiler does at -O4 or so. The post-optimized code is micro-ops, not native ARM, although in response to a question the presenter said that “many” micros were the same as the corresponding native op.The stunner: Nvidia claimed 2X improvement.
2X is more than the typical difference between -O0 and -O5 in regular compilers, so Nvidia’s result cannot be just a consequence of a truly appallingly bad compiler producing the native code. The example they showed was the “crafty” benchmark, which uses 64-bit data, so one possible source of the gain is if the native ARM code did everything in 32-bit emulation of 64-bit and the JIT replaced that with the hardware-supported 64-bit ops.
Another possibility: the hardware has two-way decode and (I think) score-boarding, so decode may be a bottleneck. If the microcode is wider-issue then they may be able to get more ILP working directly in microcode (crafty has a lot of ILP). Lastly, the optimizer may be doing trace scheduling (assuming the microcode engine can handle that) so branch behavior may be better.
But bottom line: IMO 2X on crafty reflects a chosen benchmark and an expensive workaround for defects in the rest of the system. I don’t think that the approach would give a significant gain over a decent compiler on an x86, much less a Mill. So no, I don’t see us ever taking up the approach, nor do I expect it to be adopted by other chip architectures.
My opinion only, subject to revision as the white papers and other lit becomes available.
Well, it’s also converting the code from ARM64 with 32 registers to a 7-wide VLIW format with 64 registers, which could presumably let them do more in terms of optimization than if they had just done ARM64 to ARM64 dynamic optimizations.
NVidia does have all of Transmeta’s IP and from outside this looks like an Efficeon with some capacity to execute ARM instructions in hardware letting them only optimize the hot spots, and a few other changes. I don’t see that translating ARM instructions into Mill instructions would be any more difficult than what Transmeta initially did with VLIW, but I suppose that might be the reason NVidia seems to be doing better than Transmeta did.
Good luck licensing your load semantics to NVidia then?
Come to think of it I don’t remember if they mentioned anything about the different instruction formats in the Hot Chips slides I was able to see. They do talk about it in the whitepaper they put out, however.
Thx for the link. I haven’t read it yet, but do they talk about auto-vectorizing non-vector code? Can they take code that doesn’t have neon or whatever and turn it into vectors and such?
- AuthorPosts
You must be logged in to reply to this topic.