Forum Replies Created
- AuthorPosts
- in reply to: Benchmarks #3497
Yes, you’ve got it right: we are comparing a 68k emulator running on Silver against the same emulator running on an (emulated) 68k. The 68k binary for the emulator was just compiled on a normal 68k-target compiler; I think he used the Apple compiler, but didn’t ask. I’m pretty sure that the 68k code didn’t do pipelining, and for sure it didn’t vectorize because 68k didn’t have vectors, but the 68k binary had all the normal compiler optimizations produced by a modern compiler for that target.
The 68k emulator models instruction-by-instruction execution – it’s an emulator, not a simulator – but that would be close to an in-order single issue machine as far as instruction counts; stalls it doesn’t model. That’s why we didn’t compare cycles, just instruction (bundles and ops) counts. Ignoring stalls (which were probably only from memory and similar for both), each 68k instruction and Mill bundle represent one issue cycle; the ratio reflect the Mill IPC resulting from the bundle width. Each 68k instruction and Mill operation represents an action taken by a functional unit; the ratio reflects the more powerful Mill operations. At a guess, the bulk of that difference is in call sequences, where the 68k is using a lengthy multi-instruction sequence and the Mill uses a single op.
The Mill code is running on our Mill simulator modeling the Silver configuration, not on chip or FPGA hardware. The sim in turn is running on one of our x86-based development servers, but that’s invisible behind the simulated machine. The FPGA work won’t be after our next funding round; stay tuned.
The missing inlining etc is all in the compiled Mill code of the emulator; those compiler features are just coming up, and the emulator is a big program and to get it to compile Josh disabled some of the newer and more buggy parts of the compiler. I suspect inlining and pipelining would make little difference to the counts when enabled because they improve cycle time and overall program latency (which are not measured by the emulator) but do little to change total instruction counts. Vectorization would reduce instruction counts (as well as improve latency) but there’s usually little opportunity to vectorize in the kind of control-flow-heavy code typical of emulators.
- in reply to: Benchmarks #3495
Josh got the xv68k emulator (https://www.v68k.org/xv68k/) sorta working on the Silver Mill config. “sorta” == no pipelining, inlining, or vectorization, pre-alpha code gen quality.
Xv68k emulating itself emulating a hand 68k-assembler version of helloWorld takes 55335 68k instructions executed. The same emulation run on Silver takes 16954 Mill instruction bundles executed. Most of that reflects Mill width, but Mill used only 41482 instruction operations, reflecting a less loquacious ISA too.
Granted, the 68k ISA is a pretty easy target, but this seems encouraging.
An update: Conforming is no built into branch ops, so the only use of a separate op is to recover live values from the mix of live and dead, as you suggest. Consequently the former “conform” op is no spelled “rescue”.
Rescue is cheap in execution time but expensive in code space when a value is only used far from where it is defined. A significant part of the specialize heuristics are devoted to finding a balance between use of rescue and of spill-fill.
We looked into auto-kill, but so far it seems to not fit the hardware timing; figuring out how things get reordered in the specializer is faster than trying to do it at runtime.
Rescue is only useful when the currently live values fit on the belt but some are out of reach. If the live-value count exceeds the belt size then there’s nothing for it but spill and refill when belt pressure is lower.
- in reply to: Benchmarks #3507
Code size varies by member. Because we mechanically create a custom entropy-optimal encoding for each member, the bitsize depends both on how many ops are supported in a slot, and the number of bits needed to express architectural constants such as a belt position number. Thus the opcodes on a Tin do not have to be able to represent floating-point ops because those are software (and hence function calls in the binary) on a Tin.
In addition, the encoding does not let us provide all the entropy needed by some ops in a single slot. Those are encoded as a “gang”, using two slots. The bitwise entropy for the op is really that of two slots, which makes valid op sizes a bit complicated to figure. Also, we haven’t tuned for space yet. Those gangs in particular may be able to encode better if we move some of the fields into the second slot, even if we can fit them into the first.
Finally, for inter-ISA comparisons, there’s the differences between machines in number of ops to represent a single semantic notion, such as a function call. A small op size doesn’t save much if it takes a dozen ops to do the job that one does with a longer op. X86 compared with RISC encodingd suffers from the same apples-to-oranges problem.
That said, for whole-program size when built for space we seem to be marginally more compact than x86 for all members, and significantly more compact than the 64-bit RISC machines. We seem close to the sizes of 16-bit encodings, worse on the larger members.
However, take this with a grain of salt: we haven’t paid much attention to size, neither measurement not improvement.
- in reply to: Benchmarks #3504
We will not have SPEC numbers until after the next funding round. SPEC.org is a standards group with strict membership and publication rules that restrict use of the benchmarks to group members. Membership is relatively inexpensive for academic use, but commercial entities such as Mill Computing must pay substantial fees to join or use the benchmarks. That is, the fees are not substantial for the like of Intel, but they are for us.
SPEC.org does valuable work and needs the fees to provide its services to the community. We don’t object to the fees, but find that in practice they are not in our budget.
- in reply to: Benchmarks #3502
Inlining wouldn’t help with op counts, unless the specializer can eliminate the branch in/branch-out that replaces the call/return. It can usually do that, but the gain should be low except for *very* short functions.
You are right the the inlined function body can often be folded into the width, improving bundle counts. How much improvement depends on whether the calculations inside the function are in the critical dataflow path of the caller. But if caller and callee bundles are already pretty full for the target member then folding may not produce much impact on bundle count, or on latency for that matter.
Another consideration with inlining is whether you can apply post-inline optimizations. Often arguments to the call are constants that control callee control flow, and you can toss a lot of the code as unnecessary to inline.
Don’t know anything about “A20” either, nor why you ask. I did mention “L4”, see https://en.wikipedia.org/wiki/L4_microkernel_family
Much less burdensome on a Mill, but still will be a long time and a lot of work, yes.
However, we need to start the OS work so as to have a test and verification code suite for the system aspects of the chip. Even an Arduino needs interrupt handling, atomics, … Standard benchmarks don’t exercise those parts of the design.
Various announcements coming soon (for some value of soon).
Ivan
What’s a daxpy?
Daxpy is a BLAS function, the most commonly used: “double precision A times X + Y”; see https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_1. It’s why hardware has FMA (a.k.a MAC) ops.And when you write “cycle,” do you mean a clock cycle or a Mill instruction?
Absent stalls, Mill issues all instructions of one bundle (formerly described as “all operations of one instruction”, except people got confused by the terminology) in one clock cycle. So cycle and bundle (instruction) are equivalent.- in reply to: Scratchpad design decision #3479
We expect that the next talk will be less about how the architecture works and more about how well it works – measured performance numbers, guesstimates for area and power, that kind of thing.
- in reply to: Scratchpad design decision #3475
As a slight variation on the multiple values in one go theme, would a “touch” like operation that specifies multiple belt positions (maybe just in the last N positions) that are needed soon again help in any way?
That’s what “rescue” does.
Also, never having worked with an architecture that exposed a scratchpad to me, I am wondering how its typically used. If it would be for constants that are needed a few times during execution of code, I imagine normal memory addressing and the cache system would work just fine. Is there a typical cutoff point where the scratchpad starts to benefit and at what sizes?
The compiler doesn’t load anything twice from memory/cache that it knows is unchanged; memory bandwidth is limited and cache is a power hog. Generally literal constants and some LEAs are simply redone if needed a second time, rather than being run off to scratch. We should probably revisit this for quad FP literals, assuming that a spill/fill is cheaper than a 16-byte literal, but we don’t have much real experience with quad yet. Some programs have memory arrays of constants that they load with explicit addressing; LLVM is a little erratic in how it treats those loads (to constexpr or not), but we just take what LLVM gives us.
- This reply was modified 5 years, 7 months ago by Ivan Godard.
- in reply to: Scratchpad design decision #3474
Rescue does in fact have twice the range of other references. phases are an abstraction, and there are actually no intra-cycle data movements. Consequently, at the end of the cycle (when rescue takes place) there are the belt’s worth of operand that started the cycle plus however many were dropped during the cycle physically available, which is up to twice belt size. Congratulations; I don’t think anyone before has spotted that to ask about it.
- in reply to: Scratchpad design decision #3461
I’m not sure I’ve understood your suggestions; if my response here seems inapt to you then please post again.
First, as to “sector”. This seems to be a notion where there are multiple distinct address spaces, with one of them being designated as “current” some how, so one can address within a sector using fewer bits than would be required to address across sectors. If that’s right, then in effect we already have such sectoring: “sectors” are created and destroyed automatically by the call/return mechanism, so each called function has its own private sector. This is a little different from your sectors though, because there is no inter-sector addressing; a function is stuck with its sector and cannot switch to another.
Then, as to the “chain of waterfalls” idea, where each space “falls off the end” into the next space. This won’t really help, because eventually everything ever produced at the top will eventually fall over every waterfall – which means that the path to DRAM (the last) would have to have the same bandwidth as the rate of result production at the top, in the core; the whole machine would run at memory speed, which in core terms is sloooow. The chain is also pointless, because most operands are finished with before they fall off, so there’s no need to preserve them over the subsequent falls.
So there needs to be some way to designate which operands are to be discarded and which passed on to the next level. The Mill belt by default discards everything, with the program explicitly saving any still alive. It could be done the other way, saving everything except what is explicitly discarded, but then one would still need to say where things are to be saved, and yes, an auto-save to a belt-like next level is possible. However, that level in turn would have to be kept congruent over control flow, which costs bits too. Which is the better trade-off depends on the likelihood of control flow over the life of an operand on the second level and the size of that level; looking at the code we are trying to execute, we decided that using spatial addressing on the second (scratchpad) level worked best.
Then, you suggest a “save many” and “restore many” so as to reduce the save/restore entropy, analogous to the register bulk save/restore of legacy ISAs. A “many” operation would be easier on a microcoded machine (Mill doesn’t use microcode) but could certainly be implemented on a non-micro physical architecture. However, to be useful the lifetimes of the operands in the “many” would need to be similar, so they could be saved and restored as a group. While the start of lives in a group would be similar (everything needing saving that was produced since the last save), the end of lives turn out to be very different, almost random, obviating a “restore many”. This suggests that a “save many” and “restore one” design would be possible. It’s not clear to me whether the entropy saving of a “save many” operation would justify its hardware complication; needs thinking about.
Then about sector+offset addressing. The value of this depends on whether multiple references are contiguous enough to pay the cost of switching sectors (and keeping the “current sector” congruent over control flow). Looking into that and doing actual measurement would be a good intern project, but by eyeball based on the code that we have running now I’m inclined to think that the reference patterns are too random to offer enough clustering value. It would also complicate the software, which is why sectoring has dropped out of use in legacy architectures.
- AuthorPosts