Forum Replies Created
- Ivan GodardKeymasterFebruary 26, 2018 at 12:37 pmPost count: 495
- Ivan GodardKeymasterFebruary 26, 2018 at 12:33 pmPost count: 495
- Ivan GodardKeymasterFebruary 21, 2018 at 1:33 pmPost count: 495
- Ivan GodardKeymasterFebruary 13, 2018 at 6:40 pmPost count: 495
I lose track too 🙂
Automatic vectorization doesn’t work yet; the specializer and LLVM are having a hissy fit. For your particular example, the best code is probably just to load twice, with no shuffle involved. The size of the vector loaded is member-dependent, of course.
- Ivan GodardKeymasterFebruary 13, 2018 at 5:19 pmPost count: 495
Actually, very very early the Mill had a stochastic (dithered) rounding mode. Then I became a member of the IEEE-754 (FP standard) committee and the others convinced me that it was a bad idea. I’m not enough of a numerics guy to explain why to someone else, but we accepted the opinion of the FP mavens on the committee and dropped it.
- Ivan GodardKeymasterFebruary 3, 2018 at 11:21 amPost count: 495
- Ivan GodardKeymasterFebruary 26, 2018 at 12:48 pmPost count: 495
Not quite. Everything is known well in advance, and the time for the actual computation may be the same for one and two res ops. However, after a result is computed it must be routed from the producer to all the consumers it will have in the next cycle. That route time is independent of which op produced the result. The route time is largely dependent on the number of sources, i.e. the number of distinct operands that can be dropped in each cycle, which is the fan-in to the routing crossbar.
We want to keep the routing time low, because every operation pays that cost. So we want few belt sources, and need to move low-value ops out of the way. We do that with the two-stage crossbar, and relegate all multi-result ops and all non-lat-1 ops to the second stage. There may be a two-res op that is fast enough that its result is available as fast as a simple add. But putting those two results on the fastpath would add a cost to every other fastpath op, including critical things like add.
- Ivan GodardKeymasterFebruary 26, 2018 at 12:35 pmPost count: 495
The specializer doesn’t know where the spiller puts things, because it cannot predict the dynamic control flow and hence the belt demand. As soon as there is a trap/interrupt or the main program goes through a function or label pointer, or even a conditional branch, well, that’s all she wrote.
- Ivan GodardKeymasterFebruary 25, 2018 at 9:38 pmPost count: 495
Try and use one question per post. It’s easier for the reply and the readers.
What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?
What’s an output register?
Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn’t you just have a FU with multiple output registers feeding into the latency-1 tree? I’m not able to visualize what makes it difficult.
Hardware fanout and clock constraints. Lat-1 is clock critical and the number of sources (and the muxes for them) add latency to the lat-1 FUs. Lettng lat-1s drop two results would double the number of sources for the belt crossbar, and it’s not worth it. Lat-2 and up have much more clock room.
For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that’s the only thing I can think of that the second-level mux tree would have over the one-cycle tree.
There’s no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we’d need another pipe cycle in there to keep the clock rate up.
- Ivan GodardKeymasterFebruary 25, 2018 at 9:23 pmPost count: 495
Calls (including the body of the called function) have zero latency. The FMA drops after the call returns.
The Mill spiller not only saves the furrent belt and scratchpad, but also everything that is in-flight. The in-flights are replayed when control returns to the caller, timed and belted as if the call hadn’t haoppened.
That how we can have traps and interrupts be just involuntary calls. The trapped/interrupted code is none the wiser.
- Ivan GodardKeymasterFebruary 24, 2018 at 4:07 pmPost count: 495
There are multiple branches but no deferred branches. Deferred branches were included once but we got rid of them ages ago. One can think of phasing as being a single-cycle deferral – however all branches phase the same way, whereas the essence of deferral is to support variable latency.
The basic problem with deferred branches is belt congruency. If the target of a branch is a join then the branch carries belt arguments giving what belt objects are to be passed to the target, and their belt order; multiple incoming branches carry the belt arguments to match the target’s belt signature. If we hoist a branch then an argument it wants to pass may not exist yet when the branch issues; conversely, if we restrict its arguments to those that pre-exist it then those may have fallen off the belt by the time the branch is taken. Of course, we could split the branch op into a (hoisted) branch part and an (unhoisted) argument part, but then the hoisted part is really no more than a prefetch.
It would still be possible to use a deferred branch when the target is not a join; in that case the target just inherits the belt as it is when the branch is taken. But such an op, and a prefetch op, wait on having big enough code sample (and enough free engineering time) to do serious measuring to see if they would pay.
- Ivan GodardKeymasterFebruary 19, 2018 at 5:10 pmPost count: 495
Yes, one could reload a cache image at portal exit, or even simpler just evict everything. The spectre attack depends on getting the victim to speculatively touch a line that it would not have touched in ordinary execution. It’s not clear that it’s very useful for an attacker to know which cache ways a victim touched during the normal execution of the portal.
- Ivan GodardKeymasterFebruary 3, 2018 at 11:29 amPost count: 495
Back doors are more of a potential issue with micro-coded architectures because it’s a lot easier to embed a door in software (which at heart is what microcode is) undetectably than in hardware. It’s also easier to do in a bizarrely complicated design like OOO than one that is way simpler.
I hope we are never approached with a demand to inject something.