Forum Replies Created
- AuthorPosts
- in reply to: Speculative execution #1434
In a sense nearly all Mill operations are speculative; only those (like store) that change globally visible state cannot be speculated. So you can issue as many loads as there are functional units to handle the issue and retire stations to hold the values. An OOO machine has a similar constraint: it cannot have more loads in flight than it has issue and buffer capacity to handle. Consequently, in this the Mill and an OOO are equally constrained for a given amount of hardware resources.
Now to your example. You assume that the Mill will issue four loads and stall picking up the first one, while the OOO can issue six more. The problem with the example is: why didn’t the Mill issue 10 loads before it tried the first pickup? If there is code after the load that doesn’t depend on a load, the compiler is supposed to hoist that code to either before the load issue or between the lod issue and the load pickup. If the compiler did not hoist that code then there is a compiler bug, not an architecture problem. We’ll have compiler bugs for sure (I write compilers) but we fix them.
Incidentally, the purpose of the Mill deferred load facility is to mask the L1 and L2 latencies, not to mask the DRAM latency; masking the DRAM would require being able to hoist issue an impractical distance before pickup. Instead there are other (NYF) mechanisms to mask the DRAM latency.
- in reply to: Pipelining #1416
When we get working chips with the hardware support, I suspect the language enthusiasts will have a field day incorporating Mill-isms into their languages. We’d be happy to support their efforts, with hardware and development assistance. For commercial reasons, our own work will focus on those languages that we ourselves or critical customers need.
- in reply to: Dynamic Code Optimization #1365
I was at Hot Chips and saw this presentation. Frankly I was stunned that Nvidia found this worth while.
The chip counts execution of the native ARM code to locate hot spots in the code: typically loops, but other code as well. It traps when it finds something, and then software re-optimizes the code. When the hot spot is again executed, the hardware replaces the original ARM sequence with optimized code. Essentially this is a hardware-accelerated version of what Cliff Click’s Hot Spot JIT does. The optimizer can run in another core while the app continues executing the native ARM code. According to the presentation, the software optimizer does:
Unrolls Loops
Renames registers
Reorders Loads and Stores
Improves control flow
Removes unused computation
Hoists redundant computation
Sinks uncommonly executed computation
Improves scheduling
i.e. what any old compiler does at -O4 or so. The post-optimized code is micro-ops, not native ARM, although in response to a question the presenter said that “many” micros were the same as the corresponding native op.The stunner: Nvidia claimed 2X improvement.
2X is more than the typical difference between -O0 and -O5 in regular compilers, so Nvidia’s result cannot be just a consequence of a truly appallingly bad compiler producing the native code. The example they showed was the “crafty” benchmark, which uses 64-bit data, so one possible source of the gain is if the native ARM code did everything in 32-bit emulation of 64-bit and the JIT replaced that with the hardware-supported 64-bit ops.
Another possibility: the hardware has two-way decode and (I think) score-boarding, so decode may be a bottleneck. If the microcode is wider-issue then they may be able to get more ILP working directly in microcode (crafty has a lot of ILP). Lastly, the optimizer may be doing trace scheduling (assuming the microcode engine can handle that) so branch behavior may be better.
But bottom line: IMO 2X on crafty reflects a chosen benchmark and an expensive workaround for defects in the rest of the system. I don’t think that the approach would give a significant gain over a decent compiler on an x86, much less a Mill. So no, I don’t see us ever taking up the approach, nor do I expect it to be adopted by other chip architectures.
My opinion only, subject to revision as the white papers and other lit becomes available.
- in reply to: Speculative execution #1446
You can pipeline the loads as far ahead as you want, up to the maximum deferral expressible in the encoding, equivalently, the maximum number of distinct pick-tokens expressible.
OOO potentially does do a better job than Mill at hiding DRAM latency for simple loads; that is not the purpose of deferral, although it does help somewhat. There are even some cases where OOO can do a better job of hiding cache latency; those cases mostly involve cascaded loads. We believe such cases to be insignificant, but as yet we have no numbers to back that belief.
The Mill does extensive prefetch on the code side; the mechanism was explained in the Prediction talk. What it does on the data side must await a future talk.
- in reply to: Speculative execution #1442
Unrolling is unnecessary. See the Pipelining talk for how this works.
- in reply to: Speculative execution #1436
You have colliding repliers
Let me expand on Will’s comments re vectors. The present Mill opset assumes that single data elements, either scalar or vector element, are accessed with an inherent width represented in the metadata. Operations can produce new drops with a different width, but there’s no way to view a single value as having simultaneously different widths. That is, there’s no hardware equivalent of a reinterpret_cast< ...>.
The advantage of this assumption is that ordinary code from HLLs does not need to represent widths in the operation encoding, and will run unchanged on Mill members with radically different data encodings.. The drawback is that code that is playing machine-level games at the representation level cannot do some of the things it might want to do.
The problem with the code you describe is that it is explicitly machine dependent. If you are interpreting the same bucket of bits with completely different formats, then your code is assuming what that bucket looks like. That won’t run on a machine with a different size or shape bucket. For example, try to write your vector code in such a way that it works correctly on a 386, MMX, and SSE – without testing the hardware availability of those facilities. Any algorithm for which you have to ask “How big is a vector” is machine-dependent. The Mill doesn’t do machine-dependent, it does portable The Mill claims that it will run language-standard code correctly on all members, without rewrite or member-dependent testing. The Mill does not claim that it will run machine dependent assembler (or equivalent intrinsics) from other machines.
That said, as we get more real code examples through the tool chain we may find holes for which the present opset is insufficient. We have a backlog of candiate ops, many of then for vectors, waiting on more experience. If there is a way to express these candidates that makes sense on every Mill member then we are likely o put them in; the Mill is still a work in process. You can help, if you would: look at some of your vector code of concern, and think about what you are actually doing (mentally) to the data to give you the different view for the next operation. Feel free to post examples here.
A full crossbar connects every source to every sink simultaneously. The Mill contains a full crossbar, and indeed each sing can obtain an operand from any source, even all sinks from a single source. On larger Mill the crossbar is whopping big, although smaller than similar structure on OOO due to having fewer sources because there are no rename registers. There is no stall.
The time to transit a crossbar depends on the fanout, i.e. the number of possible sources that can feed a sink; this is a natural consequence of the mux tree implementation. This latency has a significant impact on cycle time, which must include a crossbar transit. Uniquely, the Mill splits the crossbar into a small crossbar (the fastpath) which contains only a small fraction of the total sources, and a big one (the slow path) that contains the rest of the sources. The fastpath crossbar is sized so as to have no cycle-time impact. The slowpath does have cycle-time impact, or would have if the slowpath had to fit in a cycle. However, the Mill is organized so that everything using the slowpath is already multi-cycle, and we simply accept the added latency of slowpath for those already-slower operations.
Of course, many multi-cycle operations do not fill their last cycle, and can use slowpath without causing another cycle latency. And many don’t quite fit and an op that would be three cycles if it could use fastpath becomes four cycles when it has to use slowpath. However, over the entire workload, letting popular ops use fastpath is a winner.
- in reply to: Pipelining #1422
The Mill pipelining code in the middle-end will be contributed. However, we don’t expect there to be much. As I understand it, LLVM already does pipeline analysis, and we will use that. Our job will be to remove/bypass the LLVM code that actually does pipelinin transformations and simply pass on the fact that pipelining is applicable, and detected parameters such as iteration interval, to the specializer where the needed transformations will be applied.
The specializer is coming along nicely. LLVM is still to unfamiliar to project a schedule.
I thank you for your comments on dynamic language issues. They prompted internal design work that has led to a new approach for VMs on the Mill. NYF though
- in reply to: Pipelining #1409
Ah yes – I recall reading that, but never used it and promptly forgot
I’m allergic to intrinsics, especially when they don’t overload. With five widths, signed and unsigned, that’s 10 functions per operation and 20 or so operations. Squared if you want mixed-mode support. Yuck.
- in reply to: Pipelining #1399
Moreover, one would like a notation for literals, something like “17usl” for an unsigned saturating long, which cannot be done in either language even with overloading, and casts would be necessary.
Currently the highest fixed arity is four, in the fma family. However, many ops are polyadic with unbounded arity, such as call, and some ops have member-dependent maximum arity, such as conform. There are field extract and insert operations, similar to those of other machines.
- in reply to: Pipelining #1387
Ran out of time; sorry. We hope to get a white paper out at some point.
It’s hard to explain without the animations, but the general solution recognizes that if the exit condition depends on (or can be scheduled to depend on) all non-speculable operations in the body then the leave operation (conditional on the exit condition) replaces the epilog.
lsbFU is a load/store for binary floating point, which converts to and from internal representation. There is also a lsdFU, but Gold doesn’t support decimal in native.
Most machines have a distinct internal representation for FP and convert on load and store. We had one that kept the number denormalized, but it turned out not to work. The current simplified form does (apparently) work, but it’s not clear that the gain is worth the complication, and lsbFU may go away later.
- AuthorPosts