Forum Replies Created
- AuthorPosts
- in reply to: What is your roadmap for 2023? #3799
By far the most important thing in 2023 will be our conversion from a bootstrap to a conventional company with paid salaries. We have hit the limit in what can be done in our current structure, so we expect to seek our next funding round, or affiliation with another company, as negotiated and the investment market permits. The answers to your other questions depend on how that goes. Naturally, we can’t talk about the details.
- in reply to: What is your roadmap for 2023? #3832
Open sourcing would only happen if we could not find funding otherwise. Crowd-funding $15M+ is a bit of a stretch. We will first try the usual funding paths.
- in reply to: Grab bag of questions #3830
Tail calls are — interesting — for us. Because security, our control stack (return addresses, spilled surrounding state) is not addressable by user code and is not in a simple stack model even if it were. We also support calls across protection domains. Consequently an app-mode trampoline is unable to diddle the state of the stack; it can only touch actual data.
However, the ISA does have one feature that should simplify simple tail calls: the branch instruction can take arguments that replace the pre-branch belt content. Unfortunately, call arguments may be too-big or too-many to fit all on the belt (and then there’s VARARGS). These go in the frame – but the frame itself must be explicitly allocated with a specific instruction that sets all the internal (and unaddressable) registers; you can’t just do a push. Because security, again.
Another issue is that a callee can assume that caller state is spilled and later restored in full: everything is in effect caller-saved. Thus it will assume that it has a full scratch and retire station complement available. If it doesn’t then it might unexpectedly run out of resources when tail-called by a branch that does not do the resource management that a call does.
When we have looked into this, we rather quickly concluded that the expected fix-the-stack-and-branch approach was hopeless. Instead we will need to define a “tailcall” hardware operation that does all the internals like a call but reuses (and extends) the pre-existing frame in a very Mill-specific way. Details TBD.
- in reply to: What is your roadmap for 2023? #3828
+1!
- in reply to: What is your roadmap for 2023? #3824
QEMU is one of the many things on our wish list that are waiting for our conversion from the bootstrap business model. The interpreter will be non-trivial; it’s not set up for wide ISAs and interleaved execution.
- in reply to: Grab bag of questions #3822
The problem that languages have diverse exception semantics, and they make new languages. The best we can do is to provide a general escape that will work with anything (trap to the language runtime system), and primitives that let common idioms be fast and clean.
One-pass unwind is fast and semantically clean, but is hard to debug because when you get to the debugger at the bottom of the stack, the data showing how you got there is gone. Two-pass mostly solves this, but establishing the execution context for a catch predicate halfway down the stack is non-trivial and very language dependent, i.e. will involve language RTS and arbitrary delay that precludes use in the kernel. We hope to give the debugger a command that snapshots the stack before search and then restores it for the real one-pass recovery. However, that’s not implemented yet and in any case won’t work in a running kernel.
- in reply to: Grab bag of questions #3817
Streamers NYF 🙂
Deferral is limited by the encoding: how many cycles can you encode. The deferral is in a morsel and is biased by the minimum, so a Copper can defer 11 cycles and a Silver 20. This is enough to hide L1 and usually L2 in open code.
The number of retire stations is another limit that sometimes bites in open code but is especially a problem in piped loops. For compute loops a rule of thumb is two loads and a store per iteration to size the FU counts. We try to saturate to a piped one-bundle loop, so with a latency of 3 we have six in flight, which helps size the retire station count. And yes, that gives a miss any time we need to go to the L2 or further. However, this is mitigated by the cache hardware that holds whole cache lines. If the loop is stepping through arrays (typical) then only the first miss of a line will take a hit: later loads to the same line will hit in the L1. So for typical line sizes you are missing on fewer than 10% of the loads regardless of where the line is.
Of course, that doesn’t help if the loop has no locality of reference: a hash table perhaps. But piping helps there too, because the pipe has several loads in flight. If the first misses, the later ones can miss too (miss under miss) without additional delay. The more loads the code can issue before one needs one to retire, the better. But that is bounded by the number of stations. Member configuration tuning is an art 🙂
As for compilers, we don’t try to do much of anything beyond piping and hoisting. Mill can do anything that can be done in other ISAs, such as prefetching in software or hardware, but we have no experience yet that suggests that we should. The big win was the architecture that split issue from retire in loads and made the load latency visible.
- in reply to: Grab bag of questions #3816
Without history your example will miss the paint call every time on any ISA. History will avoid misses so long as it doesn’t saturate. But you don’t need history to do the data prediction and prefetch.
The naive code will first do the load of the indexed widget, then the method select, then the call, and lastly the control variable update and back branch. Without piping that will cause a data delay in the method select. With piping the load can be hoisted to an earlier iteration, extending the deferral. Depending on the width there may be more than one iteration in flight, but even on a narrow member the load can be hoisted over the paint call of the prior iteration. The load then has the full duration of the call to deal with data cache misses.
The present specializer hoists loads to the furthest dominator. In a loop the furthest dominator may be after the load in textual order but ahead of the load in iteration order. This is safe, even in the presence of mutation of the widgets array and even without “restrict”, because the load is defined to return the value as of retire, not as of issue, and retire is after the paint call.
That takes care of data misses. It doesn’t prefetch the code of the callee. To do that the compiler can insert a prefetch instruction for the address used by the call, but ahead of the call. That prefetch instruction too can be hoisted, so the entire load/select/prefetch of iteration N is executed before the paint call of iteration N-1. Of course, that squeezes out the data hoisting we just did, so that needs to be hoisted again. The whole sequence is:
LOAD (N) CALL (N-2) SELECT (N) PREFETCH (N) CALL (N-1) CALL (N)
and there are three iterations in flight in the pipe. Disclosure: we don’t do this yet, but the ISA supports it already.
- in reply to: Grab bag of questions #3815
This is getting into internal subtleties as well as definitional ones. Your desired code
can be reached by pipelining by scheduling further down the pipe with no unroll. Or maybe that should be called “unrolling” too? I assume that the loop is something like
for (int i = 0; i < N, ++i) a[i]+= 5;
so there is a recurrence in the update to the control variable, and the latency of the load must be dealt with. Assuming three-cycle loads, there will be three iterations in flight from load, and an add iteration, before the first store has anything to do. The pipeline logic has a (previously computed) known number of bundles to work with, and does a normal schedule that wraps around at that number. If there are unused resources (like load slots in the bundles) when it places the last instruction of the hyperblock it can just continue modulo placement until it runs out. I suppose it would be fair to call this “unrolling”, although I think of it as more piping.As for PGO in general: I don’t see much to gain, but full disclosure: PGO is not implemented and measurement often surprises me.
- in reply to: Grab bag of questions #3807
Yes, we’ve thought about hardware valgrind. So far we have nothing that is convincingly useful, cheap, and safe. Most runtime analysis opens exfiltration attack surfaces – what should happen if the app being monitored calls a function to check a password? The cache history can be revealing!
On a legacy ISA you can hide things with security implications behind a process barrier, and monitor per-process. Of course that means you have a process switch, a trip through the OS, and a reschedule just to verify a password. On a Mill you can change protection environment (turf) without a process switch and no more cost than an ordinary call.
We looked at having monitoring be per-turf. However, that causes a ton of state switch at turf change, when we want it to be cheap and quick. And you rarely want to monitor everything done by a turf in all contexts: probably only while in this thread and that function and after the other counter has exhausted. That kind of thing means software in the monitoring predicate, which implies trap/predicate/update sequences, which is difficult in a wide machine and will almost certainly screw up some of the data to be collected (the predicate code will alter the cache contents, for example).
Bottom line: yes, there is some potential for such a thing, and we have looked at it, and our looking has shown that it is a more complex problem than it might appear. So we have punted it to later.
- in reply to: Grab bag of questions #3806
State depends on the optimization. For example, a linear (stride) predictor needs some history to recognize a pattern. Is that history state to be preserved over a process switch? Over an inner loop? Over a call? All such choices have gains, and losses, and costs. Probably process switch is rare enough that the state should simply be discarded. A level-ignorant predictor can catch stride patterns at all nesting levels, but has a limit to the number of tracked addresses that may be exceeded by a call. Giving each frame its own predictor avoids that, but then the predictor state needs to be saved/restored over calls. TANSTAAFL!
- in reply to: Grab bag of questions #3798
– 18) Any clever ideas you’ve had lately that didn’t pan out?
Lots 🙁An example in my own code: the specializer must turn the program into something that can execute on the target member, and the generated code must not exceed the physical limitations of the target. Example limitations include the number of simultaneously live objects (limited by the belt size), the number of concurrent active loads (limited by the number of retire stations), and others. Straightforward scheduling of pending instructions (an NP-complete bin-packing problem) can produce a binary that exceeds these limits. The initial specializer implementation did a schedule and then examined the resulting code for physicality violations. If it found some, it modified the program by inserting suitable spill/fill and rescue operations and rescheduled. This iterated, called annealing, until a valid schedule was produced.
Unfortunately we found examples where the annealing did not converge, or converged only after an impractical number of iterations or inserted operation counts. After a great deal of work, the scheduling part of the specializer was abandoned and rewritten (the predictive scheduler) to anticipate situations in which resource exhaustion was possible and avoid them. The problem is that the predictive scheduler will avoid problems that might not actually happen in the actual schedule, producing poorer code than annealing does when annealing works. Someday someone will get a sheepskin for coming up with a better scheduler, but for now we live with it.
- in reply to: Grab bag of questions #3797
– 17) What have you learned from other recent innovations?
Little because we are not at the point at which such tuning is beneficial. Specific language support will need implementations of those languages and lots of experience with them – the team and I are no better at guessing where the pinch points will be than any other group of good engineers, i.e. lousy. The solution is measurement, measurement, and measurement. Buy me a beer someday and we can talk design philosophy 🙂 - in reply to: Grab bag of questions #3796
– 16) What have you learned from the M1?
We have not looked much at the M1, nor at other contemporary chip-level implementations. We are still working at the ISA level, and the hardware side is still focused on correctness and completeness. - in reply to: Grab bag of questions #3795
– 15) Formally proving safety
We have so far made no effort toward formal proof of anything, being too concerned with getting it working first. We will be very happy when the definition is frozen enough that proof techniques are applicable. - AuthorPosts