- Ivan GodardKeymasterJuly 29, 2019 at 11:53 amPost count: 565
Josh got the xv68k emulator (https://www.v68k.org/xv68k/) sorta working on the Silver Mill config. “sorta” == no pipelining, inlining, or vectorization, pre-alpha code gen quality.
Xv68k emulating itself emulating a hand 68k-assembler version of helloWorld takes 55335 68k instructions executed. The same emulation run on Silver takes 16954 Mill instruction bundles executed. Most of that reflects Mill width, but Mill used only 41482 instruction operations, reflecting a less loquacious ISA too.
Granted, the 68k ISA is a pretty easy target, but this seems encouraging.
- joseph.h.garvinParticipantJuly 29, 2019 at 1:47 pmPost count: 15
Ivan could you expand a bit on what is being compared?
You’re saying Josh compiled the xv68k emulator, targeting Mill silver, and then compiled it again targeting the 68k, then ran the emulator on the emulator, then inside that inner emulator ran hello world?
And then we are comparing the number of instructions the Mill executed during the whole run vs the number reported by the outer emulator that a real 68k would have run emulating itself running hello world?
Also is this using the C++-as-mill-assembly approach detailed in the talks or is this FPGA?
Also the lack of inlining, pipelining etc is regarding the Mill, not the emulated 68k, right?
- Ivan GodardKeymasterJuly 29, 2019 at 3:58 pmPost count: 565
Yes, you’ve got it right: we are comparing a 68k emulator running on Silver against the same emulator running on an (emulated) 68k. The 68k binary for the emulator was just compiled on a normal 68k-target compiler; I think he used the Apple compiler, but didn’t ask. I’m pretty sure that the 68k code didn’t do pipelining, and for sure it didn’t vectorize because 68k didn’t have vectors, but the 68k binary had all the normal compiler optimizations produced by a modern compiler for that target.
The 68k emulator models instruction-by-instruction execution – it’s an emulator, not a simulator – but that would be close to an in-order single issue machine as far as instruction counts; stalls it doesn’t model. That’s why we didn’t compare cycles, just instruction (bundles and ops) counts. Ignoring stalls (which were probably only from memory and similar for both), each 68k instruction and Mill bundle represent one issue cycle; the ratio reflect the Mill IPC resulting from the bundle width. Each 68k instruction and Mill operation represents an action taken by a functional unit; the ratio reflects the more powerful Mill operations. At a guess, the bulk of that difference is in call sequences, where the 68k is using a lengthy multi-instruction sequence and the Mill uses a single op.
The Mill code is running on our Mill simulator modeling the Silver configuration, not on chip or FPGA hardware. The sim in turn is running on one of our x86-based development servers, but that’s invisible behind the simulated machine. The FPGA work won’t be after our next funding round; stay tuned.
The missing inlining etc is all in the compiled Mill code of the emulator; those compiler features are just coming up, and the emulator is a big program and to get it to compile Josh disabled some of the newer and more buggy parts of the compiler. I suspect inlining and pipelining would make little difference to the counts when enabled because they improve cycle time and overall program latency (which are not measured by the emulator) but do little to change total instruction counts. Vectorization would reduce instruction counts (as well as improve latency) but there’s usually little opportunity to vectorize in the kind of control-flow-heavy code typical of emulators.
- goldbugParticipantJuly 31, 2019 at 1:51 pmPost count: 45
These are pretty awesome and encouraging number Ivan.
” I suspect inlining and pipelining would make little difference to the counts when enabled because they improve cycle time and overall program latency”
Wouldn’t inlining help a lot in the instruction count? you eliminate the call operation, and if the inlined function is small, you might even be able to squeeze the operations into existing instructions, making the inlined function essentially free.
- This reply was modified 9 months, 4 weeks ago by goldbug.
- Ivan GodardKeymasterJuly 31, 2019 at 3:14 pmPost count: 565
Inlining wouldn’t help with op counts, unless the specializer can eliminate the branch in/branch-out that replaces the call/return. It can usually do that, but the gain should be low except for *very* short functions.
You are right the the inlined function body can often be folded into the width, improving bundle counts. How much improvement depends on whether the calculations inside the function are in the critical dataflow path of the caller. But if caller and callee bundles are already pretty full for the target member then folding may not produce much impact on bundle count, or on latency for that matter.
Another consideration with inlining is whether you can apply post-inline optimizations. Often arguments to the call are constants that control callee control flow, and you can toss a lot of the code as unnecessary to inline.
- goldbugParticipantAugust 10, 2019 at 7:31 amPost count: 45
- Ivan GodardKeymasterAugust 10, 2019 at 9:34 amPost count: 565
Code size varies by member. Because we mechanically create a custom entropy-optimal encoding for each member, the bitsize depends both on how many ops are supported in a slot, and the number of bits needed to express architectural constants such as a belt position number. Thus the opcodes on a Tin do not have to be able to represent floating-point ops because those are software (and hence function calls in the binary) on a Tin.
In addition, the encoding does not let us provide all the entropy needed by some ops in a single slot. Those are encoded as a “gang”, using two slots. The bitwise entropy for the op is really that of two slots, which makes valid op sizes a bit complicated to figure. Also, we haven’t tuned for space yet. Those gangs in particular may be able to encode better if we move some of the fields into the second slot, even if we can fit them into the first.
Finally, for inter-ISA comparisons, there’s the differences between machines in number of ops to represent a single semantic notion, such as a function call. A small op size doesn’t save much if it takes a dozen ops to do the job that one does with a longer op. X86 compared with RISC encodingd suffers from the same apples-to-oranges problem.
That said, for whole-program size when built for space we seem to be marginally more compact than x86 for all members, and significantly more compact than the 64-bit RISC machines. We seem close to the sizes of 16-bit encodings, worse on the larger members.
However, take this with a grain of salt: we haven’t paid much attention to size, neither measurement not improvement.
- hayestiParticipantAugust 9, 2019 at 2:17 amPost count: 12
Ivan: When can we expect SPEC numbers–simulated or otherwise?
- Ivan GodardKeymasterAugust 9, 2019 at 11:19 amPost count: 565
We will not have SPEC numbers until after the next funding round. SPEC.org is a standards group with strict membership and publication rules that restrict use of the benchmarks to group members. Membership is relatively inexpensive for academic use, but commercial entities such as Mill Computing must pay substantial fees to join or use the benchmarks. That is, the fees are not substantial for the like of Intel, but they are for us.
SPEC.org does valuable work and needs the fees to provide its services to the community. We don’t object to the fees, but find that in practice they are not in our budget.
- CyberaxParticipantSeptember 27, 2019 at 11:04 pmPost count: 1
Can you try non-SPEC benchmarks like dhrystone or whetstone? Perhaps CoreMark?
You must be logged in to reply to this topic.