Forum Replies Created
- AuthorPosts
- in reply to: Eating Intel's Lunch Too? #3236
It’s plausible that a Mill with the same power and area budget could outrun an x86 on well-formed x86 code; we don’t want to promise, but plausible. Probably not JIT – binary-to-binary translation.
- in reply to: Tuning the specializer #3222
There’s no such annotation at the moment. You can select for performance, space, or debugability now. We could easily add a power selection if there were demand; it would disable some speculative execution optimizations.
- in reply to: LCV and Vectorisation #3209
I lose track too 🙂
Automatic vectorization doesn’t work yet; the specializer and LLVM are having a hissy fit. For your particular example, the best code is probably just to load twice, with no shuffle involved. The size of the vector loaded is member-dependent, of course.
- in reply to: Floating Point Rounding #3207
Actually, very very early the Mill had a stochastic (dithered) rounding mode. Then I became a member of the IEEE-754 (FP standard) committee and the others convinced me that it was a bad idea. I’m not enough of a numerics guy to explain why to someone else, but we accepted the opinion of the FP mavens on the committee and dropped it.
- in reply to: MILL and OSS #3204
We expect that we will publish the tool chain and the rest of the reference software. The situation with the specification tools is unclear, as they would not be of use to users of Mill chips but might be of considerable use to vendors of competing chips.
- in reply to: Removing Whitespace From A String #3199
That kind of optimization is a middle-end problem, which in our case means LLVM, and LLVM doesn’t seem to know anything about such things; auto-vectorization in general is weak to absent.
You are right that the first step for a vectorized version on the Mill would be to None out the whitespace; that’s easy. There’s no reduction-compaction operation in the ISA at this point. One possibility would be to turn the None-laden vector into a bitmask using the mask() op, and then use the mask as the control in a switch that would execute an appropriate shuffle() op for each mask to do the compaction.
However, for low-height members (vector heights of 8 or 16 bytes) the cost of the armwaving to do vector compaction (absent a new op) probably exceed the cost of doing it in the naive scalar loop, which trivially gets one byte per cycle (var. 1 c/b)on a Mill.
Another approach might be to use the machine width to do several bytes at a time MIMD. Two-way would involve two loads and two stores per cycle, which larger Mills can do. Two compares, two adds, a pick and a branch (needed for the rest of the loop) would also fit in the same instruction in those members, so you’d get two bytes/cycle in scalar MIMD. As this is simple unrolling the compiler might be able to find it.
However, absent a compaction op in the ISA the right way to do this is streamers, but they are NYF.
- in reply to: A random smattering of questions. #3192
There is no “Null”. Perhaps you are thinking of “None”, which is a distinct meta-state and is not a NaR in the machine semantics, albeit encoded in a similar way. “None” is preserving (None+NaR->None), but also None+None is implementation dependent.
- in reply to: Mill Computing in 2017 #3190
As a bootstrap startup we long ago gave up making predictions about schedules. Give us a $10m budget and we can give you a reasonably hard schedule.
The Mill architecture is a coherent whole; it would be quite hard to pick off single features to incorporate into conventional designs such as RISC-V, or x86 for that matter.
Try and use one question per post. It’s easier for the reply and the readers.
What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?
What’s an output register?
Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn’t you just have a FU with multiple output registers feeding into the latency-1 tree? I’m not able to visualize what makes it difficult.
Hardware fanout and clock constraints. Lat-1 is clock critical and the number of sources (and the muxes for them) add latency to the lat-1 FUs. Lettng lat-1s drop two results would double the number of sources for the belt crossbar, and it’s not worth it. Lat-2 and up have much more clock room.
For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that’s the only thing I can think of that the second-level mux tree would have over the one-cycle tree.
There’s no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we’d need another pipe cycle in there to keep the clock rate up.
- in reply to: MILL and OSS #3230
Yep.
Calls (including the body of the called function) have zero latency. The FMA drops after the call returns.
The Mill spiller not only saves the furrent belt and scratchpad, but also everything that is in-flight. The in-flights are replayed when control returns to the caller, timed and belted as if the call hadn’t haoppened.
That how we can have traps and interrupts be just involuntary calls. The trapped/interrupted code is none the wiser.
- in reply to: Prediction #3224
There are multiple branches but no deferred branches. Deferred branches were included once but we got rid of them ages ago. One can think of phasing as being a single-cycle deferral – however all branches phase the same way, whereas the essence of deferral is to support variable latency.
The basic problem with deferred branches is belt congruency. If the target of a branch is a join then the branch carries belt arguments giving what belt objects are to be passed to the target, and their belt order; multiple incoming branches carry the belt arguments to match the target’s belt signature. If we hoist a branch then an argument it wants to pass may not exist yet when the branch issues; conversely, if we restrict its arguments to those that pre-exist it then those may have fallen off the belt by the time the branch is taken. Of course, we could split the branch op into a (hoisted) branch part and an (unhoisted) argument part, but then the hoisted part is really no more than a prefetch.
It would still be possible to use a deferred branch when the target is not a join; in that case the target just inherits the belt as it is when the branch is taken. But such an op, and a prefetch op, wait on having big enough code sample (and enough free engineering time) to do serious measuring to see if they would pay.
- in reply to: Meltdown and Spectre #3220
Yes, one could reload a cache image at portal exit, or even simpler just evict everything. The spectre attack depends on getting the victim to speculatively touch a line that it would not have touched in ordinary execution. It’s not clear that it’s very useful for an attacker to know which cache ways a victim touched during the normal execution of the portal.
- in reply to: MILL and OSS #3205
Back doors are more of a potential issue with micro-coded architectures because it’s a lot easier to embed a door in software (which at heart is what microcode is) undetectably than in hardware. It’s also easier to do in a bizarrely complicated design like OOO than one that is way simpler.
I hope we are never approached with a demand to inject something.
This is an interpreter, a class of application that is notorious for bad prediction behavior.
Direct interpreters have two branches per iteration: the switch and the loop back. The loop back will predict, the switch won’t.
In answer to your question: Mill does support indirect branches and indirect calls. Because there is only one branch site in the loop the Mill use of exit prediction is the same as a conventional branch predictor BTB. The mispredict rate is governed by the quality of the predictor and the “lumpiness” of the branching pattern, and should be similar between Mill and conventional. There may be some advantage to the Mill because the Mill prediction table is expected to be substantially larger than a conventional BTB, and so may be better at picking up history patterns; we have no data yet on that.
There’s nothing up the sleeve to let us predict the unpredictable. However, we do have one big advantage: Mill mispredict restart cost is five cycles, while that of a conventional is treble that or more.
- AuthorPosts