Yes, you’ve got it right: we are comparing a 68k emulator running on Silver against the same emulator running on an (emulated) 68k. The 68k binary for the emulator was just compiled on a normal 68k-target compiler; I think he used the Apple compiler, but didn’t ask. I’m pretty sure that the 68k code didn’t do pipelining, and for sure it didn’t vectorize because 68k didn’t have vectors, but the 68k binary had all the normal compiler optimizations produced by a modern compiler for that target.
The 68k emulator models instruction-by-instruction execution – it’s an emulator, not a simulator – but that would be close to an in-order single issue machine as far as instruction counts; stalls it doesn’t model. That’s why we didn’t compare cycles, just instruction (bundles and ops) counts. Ignoring stalls (which were probably only from memory and similar for both), each 68k instruction and Mill bundle represent one issue cycle; the ratio reflect the Mill IPC resulting from the bundle width. Each 68k instruction and Mill operation represents an action taken by a functional unit; the ratio reflects the more powerful Mill operations. At a guess, the bulk of that difference is in call sequences, where the 68k is using a lengthy multi-instruction sequence and the Mill uses a single op.
The Mill code is running on our Mill simulator modeling the Silver configuration, not on chip or FPGA hardware. The sim in turn is running on one of our x86-based development servers, but that’s invisible behind the simulated machine. The FPGA work won’t be after our next funding round; stay tuned.
The missing inlining etc is all in the compiled Mill code of the emulator; those compiler features are just coming up, and the emulator is a big program and to get it to compile Josh disabled some of the newer and more buggy parts of the compiler. I suspect inlining and pipelining would make little difference to the counts when enabled because they improve cycle time and overall program latency (which are not measured by the emulator) but do little to change total instruction counts. Vectorization would reduce instruction counts (as well as improve latency) but there’s usually little opportunity to vectorize in the kind of control-flow-heavy code typical of emulators.