Mill Computing, Inc. › Forums › The Mill › Architecture › Grab bag of questions
- AuthorPosts
- #3772 |
Hello there!
It’s been a while since somebody asked lots of technical questions on this forum. I thought I would pick it back up, and try to discuss things that haven’t been brought up too much, either in the video talks or on the forums.
Since I have a ton of questions, I’ll be splitting them roughly by subject in posts below.
– 1) Exit prediction with dynamic addresses.
What’s the current story for exit prediction with dynamic function pointers, vtables and such? Obviously, if you have a loop iterating over an array of virtual objects, the predicted address for a single call instruction isn’t going to be too helpful. But it’s probably suboptimal to throw up your arms and say “it can’t be predicted”, since sometimes these addresses will be available tens of cycles ahead of time.
I think you mentioned potentially using deferred branches, but eventually gave up on the idea. If nothing else, deferred branches wouldn’t help you with the “array of objects” cases, since it could presumably only predict the first one.
Ideally, you would want to have predictions available for each vtable in the array, so you can call the next method as soon as you returned from the last one. Maybe you could have a prediction queue of some sort? Or an alternate encoding scheme for when the Exit Table predicts that a given exit will always be a dynamic branch?
– 2) Hardware prefetching
How does the Mill handle data-side hardware prefetching? Traditional CPUs mostly rely on detecting stride patterns, eg “If you asked for addresses 12037, 12047 and 12057, you’re probably going to ask for 12067”. Do you expect Mill cores to do stride detection?
Deferred load can help hide the latency of an L1 miss, but obviously don’t help with an L3 miss. But there are some common patterns (eg tree traversal, arrays of pointers) where stride detection and deferred loads don’t help at all, but the hardware would still have enough information to prefetch lots of data ahead of time. For instance, a foreach loop iterating depth-first over a binary tree might want to take the left branch immediately and prefetch the right branch, thus skipping half the L3 misses. Does the Mill provide anything to facilitate this?
– 3) Inner-loop specific optimizations
The “inner” instruction seems like it would open a lot of potential for optimizations. Have you explored loop-specific optimizations?
I’m thinking about things like exact exit prediction (if you’re looping from “i = 0” to “i < array.size”, you know exactly how many iterations the loop will run), smart prefetching (in a loop with no branches, you know exactly what loads the next iterations will run, so you can prefetch them ahead of time), etc.
I know that a lot of software prefetching is wasted because it prefetches past the end of a loop and things like that, but the hardware would have enough info to be precise.
– 1) Exit prediction with dynamic addresses.
The cause of the exit transfer doesn’t matter for prediction, with the exception of returns. Most ebbs (hyperblocks) have several conditional ways to exit, one of which is taken; the predictor says where the taken one will go, in which cycle, and the kind of transfer (call/return/branch). The prediction itself doesn’t care whether it was something dynamic. Calls and returns use a predictor-maintained stack of return positions rather than an actual address in the prediction; the return prediction entry itself just says “return” and pops the entry.Your question seems to assume that prediction is per transfer site, as in per vtable. It’s not; it’s per transferred-to location: if you got here somehow, then you will next get there. History prediction can use a chain (actually a hash) of heres. If a region of code contains a bunch of vtable transfers on the same object, the first will predict that the object will be of the same type (and transfer the same way) as the prior object – and miss if wrong. But the history will then kick in and subsequent transfers will predict the transfer sequence of the last time an object of the same type was processed. There’s a per-member tradeoff between the size of the predictor and the fanout of types that the predictor can handle.
The cause of the exit transfer doesn’t matter for prediction, with the exception of returns. Most ebbs (hyperblocks) have several conditional ways to exit, one of which is taken; the predictor says where the taken one will go, in which cycle, and the kind of transfer (call/return/branch). The prediction itself doesn’t care whether it was something dynamic.
I’m not sure we understand each other here.
Say you have a loop iterating over a list of dynamic objects (say, widgets):
`
for (i in 0…this.child_widgets.size()) {
this.child_widgets[i].paint();
}
`
Given the same EBB (the loop body), each entry might have a different exit. So for each loop iteration, the predictor would be given the same EBB address, and need to predict a different exit address (the paint call of the given vtable).
I guess history prediction, agree prediction, etc would help. If your widget list is
[Button, Label, Checkbox, Button, Label, CheckBox, Button, ...]
, the predictor can probably guess “If you got into this EBB by returning fromButton::paint
then the exit is probablyLabel::paint
” from history. I’m not sure how much that helps? If you have a lot of method calls on heterogeneous object lists, and/or if you have lists that change regularly, wouldn’t that saturate the predictor? On the other hand, modern predictors are pretty big (around 4096 entries according to a cloudflare article I found).Anyway, my point was, given the example code above, the compiler has enough information to predict every call to the
paint
method hundreds of cycles in advance of it happening. Say each paint call takes about 100 cycles on average, and the compiler knows that. The compiler could insert prefetch calls to tell the CPU “prefetch the vtable entry forchild_widgets[i + 3]
” and mask the DRAM load latency. Then “predict that next time this EBB is entered, its exit will bechild_widgets[i + 1].paint()
” and avoid the mispredict even if the history has not caught up yet.Of course, I don’t know how useful any of this would be in practice. From what I know of the subject,
likely
-style hints aren’t used by the compiler because branch predictors actually outperform compiler hints, to the point that the hints are a hindrance more than anything else. But I don’t know how much this is accidental vs fundamental to the prediction problem.Without history your example will miss the paint call every time on any ISA. History will avoid misses so long as it doesn’t saturate. But you don’t need history to do the data prediction and prefetch.
The naive code will first do the load of the indexed widget, then the method select, then the call, and lastly the control variable update and back branch. Without piping that will cause a data delay in the method select. With piping the load can be hoisted to an earlier iteration, extending the deferral. Depending on the width there may be more than one iteration in flight, but even on a narrow member the load can be hoisted over the paint call of the prior iteration. The load then has the full duration of the call to deal with data cache misses.
The present specializer hoists loads to the furthest dominator. In a loop the furthest dominator may be after the load in textual order but ahead of the load in iteration order. This is safe, even in the presence of mutation of the widgets array and even without “restrict”, because the load is defined to return the value as of retire, not as of issue, and retire is after the paint call.
That takes care of data misses. It doesn’t prefetch the code of the callee. To do that the compiler can insert a prefetch instruction for the address used by the call, but ahead of the call. That prefetch instruction too can be hoisted, so the entire load/select/prefetch of iteration N is executed before the paint call of iteration N-1. Of course, that squeezes out the data hoisting we just did, so that needs to be hoisted again. The whole sequence is:
LOAD (N) CALL (N-2) SELECT (N) PREFETCH (N) CALL (N-1) CALL (N)
and there are three iterations in flight in the pipe. Disclosure: we don’t do this yet, but the ISA supports it already.
– 2) Hardware prefetching
Data prefetch is per-member. The present members at Silver and above have stride prefetching defined; others have none. We haven’t poked at anything more complicated, and do not expect to do so pending customer experience with real data and apps.– 3) Inner-loop specific optimizations
Anything of this sort would be member-specific, and none of the current members do such things. There has been some thought though; at one point we considered a “do(n)times” branch form. One problem is that the needs and possible optimizations are heavily domain dependent: “inner loop” screams “scientific programming”, whereas more conventional code can actually be detuned by keeping extra state for “optimization”.One problem is that the needs and possible optimizations are heavily domain dependent: “inner loop” screams “scientific programming”, whereas more conventional code can actually be detuned by keeping extra state for “optimization”.
What kind of extra state are you thinking of that would be a hindrance?
State depends on the optimization. For example, a linear (stride) predictor needs some history to recognize a pattern. Is that history state to be preserved over a process switch? Over an inner loop? Over a call? All such choices have gains, and losses, and costs. Probably process switch is rare enough that the state should simply be discarded. A level-ignorant predictor can catch stride patterns at all nesting levels, but has a limit to the number of tracked addresses that may be exceeded by a call. Giving each frame its own predictor avoids that, but then the predictor state needs to be saved/restored over calls. TANSTAAFL!
– 4) Tail calls
Does the Mill support tail-call instructions? I’m told they’re important for some functional programming languages.
You might be able to save instruction entropy by encoding tail calls as “a return ganged with a function call”.
– 5) Exception handling
Do you have first-class C++ style exception handling?
By “first class”, I mean “zero-cost if no exception is thrown”, so no branch-after-every-function-return or things like that.
Also, can you handle setjmp/longjmp?
– 6) Coroutines
Do you support stackless coroutines in hardware? By coroutines, I mean functions you can jump into, yield from, and then jump back into at the yield point.
By stackless, I mean “unlike a thread, they don’t need their dedicated stack and add stack frames directly to the caller’s stack”. I guess the big difficulty would be spilling directly to userspace.
– 4) Tail calls
There is no special architectural support for tail calling at present. The LLVM-based compiler and tool chain can convert recursion to tail calling loops. This is actually easier in a Mill than in a register architecture, because the belt means you don’t have to overwrite register contents, just use the new values, as you do in a loop.There is no special architectural support for tail calling at present. The LLVM-based compiler and tool chain can convert recursion to tail calling loops
I think there are some programs / languages which explicitly rely on tail calls, in ways that can’t be easily elided with TCO. Eg mutually-recursive functions and tail calls to dynamic functions, to give the most obvious examples.
I don’t have enough experience in functional languages to know the details; but there’s been some discussion about this subject on the WebAssembly tail-calls proposal page:
Since we don’t use wasm function parameters at all, we can assume
all wasm type signatures of our functions are i32(). The returned i32
indicates the call target (all calls become tail calls). Then we can
use a trampolined mini-interpreter which given the first function to
enter, repeatedly invoke the function, perform some side-effects,
retrieve the next call target, then use the call_indirect instruction
to call into the next target. This is the current approach of
asterius.Now, asterius is deprecated and GHC now includes wasm targets; I haven’t checked if it used the same tricks, but I assume it does.
tl;dr: For Haskell and other functional languages, you either need hardware support for tail calls or you accept that the language is going to route every call through a trampoline in the worst case (unless the compiler can provably find and remove every tail call instance ahead of time).
In the same vein, Wasmtime recently had an RFC for adding tail-calls:
In general, we want Wasmtime to support well-designed, standards-track WebAssembly proposals like the tail calls proposal. The proposal (currently in phase 4) is mature, stable, and implemented in V8 and JSC.
A variety of Wasm producers will and already do target Wasm tail calls. In clang, C++20’s coroutines lower to calls annotated with LLVM’s musttail attribute, which fails to compile to Wasm unless the tail calls proposal is enabled. LLVM’s optimizer will additionally turn regular calls into tail calls when the proposal is enabled. Compiling to Wasm with a continuation passing style where everything is a tail call is a natural fit for some domains, such as lowering a finite state machine to Wasm. Many functional languages guarantee proper tail calls and have to bend over backwards to deliver them without the Wasm tail calls proposal.
Tail calls are — interesting — for us. Because security, our control stack (return addresses, spilled surrounding state) is not addressable by user code and is not in a simple stack model even if it were. We also support calls across protection domains. Consequently an app-mode trampoline is unable to diddle the state of the stack; it can only touch actual data.
However, the ISA does have one feature that should simplify simple tail calls: the branch instruction can take arguments that replace the pre-branch belt content. Unfortunately, call arguments may be too-big or too-many to fit all on the belt (and then there’s VARARGS). These go in the frame – but the frame itself must be explicitly allocated with a specific instruction that sets all the internal (and unaddressable) registers; you can’t just do a push. Because security, again.
Another issue is that a callee can assume that caller state is spilled and later restored in full: everything is in effect caller-saved. Thus it will assume that it has a full scratch and retire station complement available. If it doesn’t then it might unexpectedly run out of resources when tail-called by a branch that does not do the resource management that a call does.
When we have looked into this, we rather quickly concluded that the expected fix-the-stack-and-branch approach was hopeless. Instead we will need to define a “tailcall” hardware operation that does all the internals like a call but reuses (and extends) the pre-existing frame in a very Mill-specific way. Details TBD.
When we have looked into this, we rather quickly concluded that the expected fix-the-stack-and-branch approach was hopeless. Instead we will need to define a “tailcall” hardware operation that does all the internals like a call but reuses (and extends) the pre-existing frame in a very Mill-specific way. Details TBD.
Right, that why I was bringing it up. You need a dedicated tail-call instruction, otherwise compilers can’t implement it at all.
– 5) Exception handling
Mill has first class C++ exceptions in a way that is NYF (Not Yet Filed, sorry). The mechanism is also used for setjmp/longjmp. A consequence of the design is that exceptions work in the kernel code itself, and do not require trapping to recovery code nor use of the file system.The facility is not entirely zero cost because there are setup and control instructions needed to define the exception environment and these do cost code space. However there is no latency cost if not thrown. The latency of a caught throw is roughly the same as the equivalent sequence of return instructions and destructor calls if directly executed, mostly due to mispredicts in the control flow of the exception path; there are no table searches or file lookups. The net result is that Mill exceptions are a practical control flow technique that can be used, and thrown, in performance critical and kernel code.
IMHO, That sounds much better than the “Sutter method” of checking a flag after each function call, which you had mentioned before.
I hope that you have considered also other languages than C++. For instance languages which push exception policies down the call chain, and languages with “catch”-clauses that perform closer inspection of the exception object before deciding to handle the exception.
To handle different cases, some runtimes perform their stack unwinding in two passes: One to find the target handler, and one to perform the actual unwinding.The problem that languages have diverse exception semantics, and they make new languages. The best we can do is to provide a general escape that will work with anything (trap to the language runtime system), and primitives that let common idioms be fast and clean.
One-pass unwind is fast and semantically clean, but is hard to debug because when you get to the debugger at the bottom of the stack, the data showing how you got there is gone. Two-pass mostly solves this, but establishing the execution context for a catch predicate halfway down the stack is non-trivial and very language dependent, i.e. will involve language RTS and arbitrary delay that precludes use in the kernel. We hope to give the debugger a command that snapshots the stack before search and then restores it for the real one-pass recovery. However, that’s not implemented yet and in any case won’t work in a running kernel.
– 6) Coroutines
This functionality is subsumed under streamers, which are NYF. The semantics is slightly different, but nearly all use of such coroutines is to implement streams and at present we do not expect to separately support stackless coroutines beyond whatever the compiler/language system provides. There has been some talk about what we should do to help microthread systems, but frankly we’re too ignorant of the needs to do much without more experience.
– 7) Benchmarking support
Are there plans to support stable benchmarking in the Mill? Benchmarks often have to go through all sort of contortions to get filter out performance noise: warming up caches, running multiple times to smooth outliers out, maybe even recompiling multiple times to test performance with different code layouts, etc.
The Mill has some natural advantages: the Exit Table can be preloaded, and no runtime reordering means better stability. Do you have plans to include features specifically for stable execution? Things like purging caches, avoiding interrupts, pinning cores, etc?
– 8) Profiling support
Similar to the previous question, do you have plans to support fine-grained profiling? Profilers often give out coarse information like “this function was executed roughly X times”, “this function took roughly Y cycles on average”, etc.
You can use emulation tools like valgrind to get more fine-grained info like the number of L2 misses from a specific instruction, at a cost of massive performance loss. Could the Mill provide tools to help get fine-grained data without the massive overhead?
– 9) Profile-guided optimization
How does Mill plan to benefit from PGO?
Traditional PGO is mostly centered on layout: the profiling phase gets coarse-grained info about which functions are called most often in which order, and from that the compiler knows both which functions to optimize for size vs performance (in other words, which loops to unroll and vectorize) and which functions should be laid out together in memory.
It feels like, since the Mill is a lot more structured than traditional ISA, it could get a lot more mileage from PGO. It can already get benefits through the Exit Table. Are there plans to facilitate other optimizations, besides general code layout?
– 10) Specializer hints
Because of the whole Specializer workflow, the genASM format is essentially a compiler IR. As such, have you thought about including hints to help the Specializer produce better local code?
Some hints might include (straight from LLVM): aliasing info (saying that a load and a store will never alias), purity info for function calls (saying a function will never load/store/modify FP flags), generally saying that a loop is safe to unroll, etc.
– 7) Benchmarking support
Nearly all of this topic is at the OS and language levels, not the architectural level. There are some admin operations presently defined – bulk purging the cache for example – and there may be more added as needed, but most such things, interrupt control for example, are the property of various system services and most systems don’t make them available for use directly by apps. Of course, you could write your benchmark to run on a bare machine and get control at power up…I agree that this should be at an OS level, and coarse-grained.
Direct access to fine-grained cycle counters are well-known to be used for timing-based attacks. (Not just Spectre and Meltdown but also for reducing the time to guess passwords, encryption keys, etc, etc.). A whole field of “constant time algorithms” has arisen to combat it, which is a bit unnecessary IMHO.- This reply was modified 1 year, 9 months ago by Findecanor.
– 8) Profiling support
This too is mostly the domain of language and OS. The usual instrumentation methods will work on a Mill of course, but transparent instrumentation has really nasty issues with security that can only be addressed by the OS security model, not the architecture. Do you really want to give this turf the ability to dynamically change (i.e. instrument) the code being run by that turf?In addition, like all wide machines the Mill presents an execution model that is not the single-program-sequence of a z80. What does it mean to single-step a program when it fires a dozen instructions at once, and retires them at a dozen different points in time? Our tools give us a good view of machine level debugging – but debugging and profiling Mill at the source level is just as hard as doing it for heavily optimized code on any architecture.
The tool chain does have a mode (still buggy) in which the code emulates a one-complete-instruction-at-a-time architecture, z80-ish. That helps with debugging and some profile questions like how many times was something called, but is completely useless for timing questions. These are human-factors questions for which nobody seems to have answers on any architecture.
This too is mostly the domain of language and OS. The usual instrumentation methods will work on a Mill of course, but transparent instrumentation has really nasty issues with security that can only be addressed by the OS security model, not the architecture. Do you really want to give this turf the ability to dynamically change (i.e. instrument) the code being run by that turf?
My question was, have you thought about adding tools that give the benefits of valgrind-style instrumentation without actually intercepting instructions?
Say my program is running much more slowly than I expected, and I’m trying to figure out why. Right now my options are:
– Run the program with perf, and get a coarse-grained profile of my program. I can get a flamegraph of the function calls (though it will be an averaged approximation) and a rough summary of various indicators (instructions retired, cache misses, branch mispredicts, etc).
– Run the program with cachegrind, and get a fine-grained profile of my program. The emulator will tell me exactly how many cache misses each line of code will trigger, with the caveats that they’re misses in an emulated architecture, not on my actual computer; some things like stride prefetching and reorder buffers aren’t actually emulated. Also, obviously the execution is about 10x to 50x slower.
– Use a domain-specific profiler (eg if I’m using a database and it comes with one).
– Implement my own homemade solution to measure the metrics I care about.Of the choices above, only perf actually uses CPU instrumentation. The problem is that it’s extremely coarse; the CPU will only have counters for some types of events, to give you a summary at the end of execution. (Also, I think it needs some privileges to run.)
What I’d want is something giving me a journal of the entire execution, from which I would get the metrics I need.
For instance, say what I care about is cache misses. I would want to tell Mill “execute this program, and every time you run a
load
instruction, log the cache level plus the instruction pointer in a journal, and save the journal to disk every 10MB”. This is a super-simplified example (if nothing else, the resulting journal would be huge), but it’s the kind of information I would want. From that journal, I could get a summary that tells me “this function has a lot of L1 misses, this function has some L2 misses, this specific line of code has lots of L3 misses and keeps stalling the entire program”, etc.The Mill has the advantage of being fully deterministic except for a very small number of instructions; so, from a small journal, you might be able to reconstruct the entire execution trace, and find better insights than you’d get from other CPUs.
Yes, we’ve thought about hardware valgrind. So far we have nothing that is convincingly useful, cheap, and safe. Most runtime analysis opens exfiltration attack surfaces – what should happen if the app being monitored calls a function to check a password? The cache history can be revealing!
On a legacy ISA you can hide things with security implications behind a process barrier, and monitor per-process. Of course that means you have a process switch, a trip through the OS, and a reschedule just to verify a password. On a Mill you can change protection environment (turf) without a process switch and no more cost than an ordinary call.
We looked at having monitoring be per-turf. However, that causes a ton of state switch at turf change, when we want it to be cheap and quick. And you rarely want to monitor everything done by a turf in all contexts: probably only while in this thread and that function and after the other counter has exhausted. That kind of thing means software in the monitoring predicate, which implies trap/predicate/update sequences, which is difficult in a wide machine and will almost certainly screw up some of the data to be collected (the predicate code will alter the cache contents, for example).
Bottom line: yes, there is some potential for such a thing, and we have looked at it, and our looking has shown that it is a more complex problem than it might appear. So we have punted it to later.
– 9) Profile-guided optimization
The tool chain does not yet have PGO support, so this answer is speculative. Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. Thus for example unrolling loops prevents software pipelining and forces spill/fill of transients that otherwise would live only on the belt for their lifetime; shut unrolling off for better performance. It is also unclear how much function and block reordering will actually pay; our best estimate now is “not much” because so much Mill control flow is collapsed in the tool chain into much larger blocks and gets executed speculatively. Exit prediction also sharply cuts the fetch overhead that reorder is intended to help.Lastly, SIMDization (they are not really vectors in the Cray sense) can be done in the tool chain as well for the Mill as for any architecture. Our major advance is the ability to do SIMD with per-element error control. Whether apps will take advantage of that to improve their RAS is as yet unclear.
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. Thus for example unrolling loops prevents software pipelining and forces spill/fill of transients that otherwise would live only on the belt for their lifetime; shut unrolling off for better performance.
Maybe this is a vocabulary problem, but there’s definitely a kind of unrolling that the Mill benefits from; we could call it “horizontal” unrolling as opposed to “vertical” unrolling which would be the traditional version.
But if you have code that looks like this:
LOOP: CON(...), ADD(...), LOAD(...), STORE(...);
If your Mill is wide enough, you will definitely benefit from turning it into:
`
LOOP:
CON(…), CON(…), CON(…),
ADD(…), LOAD(…), ADD(…), LOAD(…), ADD(…), LOAD(…),
STORE(…), STORE(…), STORE(…);
`
I don’t know exactly how you’d call that operation, but to me it’s “unrolling” of a sort. And it definitely hits that “efficiency / code size” tradeoff that PGO helps optimize.
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. […] It is also unclear how much function and block reordering will actually pay;
Either way, as long as your codegen is based on heuristics, you’ll benefit from PGO somewhere.
To give a Mill-specific example, there’s
load
delays. Your current heuristic is “as long as you can get away with”, but a PGO analysis might say “actually, this load almost always hits L1, so we can give it a delay of 3 to start this following load earlier / to better pack instructions”.- This reply was modified 1 year, 10 months ago by NarrateurDuChaos.
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. Thus for example unrolling loops prevents software pipelining and forces spill/fill of transients that otherwise would live only on the belt for their lifetime; shut unrolling off for better performance.
Maybe this is a vocabulary problem, but there’s definitely a kind of unrolling that the Mill benefits from; we could call it “horizontal” unrolling as opposed to “vertical” unrolling which would be the traditional version.
But if you have code that looks like this:
`
LOOP:
CON(…),
ADD(…), LOAD(…),
STORE(…);
`
If your Mill is wide enough, you will definitely benefit from turning it into:
`
LOOP:
CON(…), CON(…), CON(…),
ADD(…), LOAD(…), ADD(…), LOAD(…), ADD(…), LOAD(…),
STORE(…), STORE(…), STORE(…);
`
I don’t know exactly how you’d call that operation, but to me it’s “unrolling” of a sort. And it definitely hits that “efficiency / code size” tradeoff that PGO helps optimize.
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. […] It is also unclear how much function and block reordering will actually pay;
That doesn’t sound right. As long as your codegen is based on heuristics, you’ll benefit from PGO somewhere.
To give a Mill-specific example, there’s
load
delays. Your current heuristic is “as long as you can get away with”, but a PGO analysis might say “actually, this load almost always hits L1, so we can give it a delay of 3 to start this following load earlier / to better pack instructions”.This is getting into internal subtleties as well as definitional ones. Your desired code
can be reached by pipelining by scheduling further down the pipe with no unroll. Or maybe that should be called “unrolling” too? I assume that the loop is something like
for (int i = 0; i < N, ++i) a[i]+= 5;
so there is a recurrence in the update to the control variable, and the latency of the load must be dealt with. Assuming three-cycle loads, there will be three iterations in flight from load, and an add iteration, before the first store has anything to do. The pipeline logic has a (previously computed) known number of bundles to work with, and does a normal schedule that wraps around at that number. If there are unused resources (like load slots in the bundles) when it places the last instruction of the hyperblock it can just continue modulo placement until it runs out. I suppose it would be fair to call this “unrolling”, although I think of it as more piping.As for PGO in general: I don’t see much to gain, but full disclosure: PGO is not implemented and measurement often surprises me.
– 10) Specializer hints
The tool chain already responds to some of the hint info that LLVM provides. In addition, our dumper (IR to genAsm shim) extracts some info known to LLVM that is not in the IR from LLVM internal structures and adds to the genAsm. An example is the ordering dependency oracle in the genAsm for each function body, which encompasses what LLLVM knows from the aliasing info and function body analysis. Generally we want to push that sort of info and analysis into the steps before genAsm, and let the specializer use the result of the analysis (such as required ordering) without doing the analysis itself.For your specific examples:
* aliasing is reflected in the oracle; potentially colliding references have an oracle entry giving the textual order
* flags are local in the architecture; ops that change global flags are ordered in the oracle w/r/t ops that use them
* loops should not be unrolled; SSA form (in the IR and genAsm) eliminates most references and the pipelining preserves order across iterations for the rest
– 11) Feature detection
Are there any plans for the general Mill ISA to have instructions of the form “If this feature is available on the target architecture, compile this EBB, otherwise compile this other EBB”? I know something similar is planned for WebAssembly.
– 12) Any plans for over-128-bits SIMD?
In past discussions on the subject you seemed to think it was a future possibility? Now that AVX-512 has been out for a while, have you settled?
Do you think there’s benefit in having 256-bits or 512-bits vector instructions, or do you think having a very wide pipeline always provides the same benefits?
– 13) Economy cores
Recently, mobile CPUs and the Apple M1 have started to use designs based on “economy cores”. The idea being that the CPU includes a large number of small, low-frequency, less-performant cores that the device could use both as additional threads on heavy workloads, and as the only threads for when the device needs to conserve power.
Could the Mill provide a similar architecture? In the past you’ve floated the idea of a chip with both Silver and Tin cores (I think), but those would have incompatible binary formats. Have you considered ways a single device (eg smartphone) could run multiple cores with different specs while still executing the same binaries?
– 11) Feature detection
Member detection exists now; that’s how we handle per-member differences such as whether quad (128-bit) arithmetic is present. The specializer replaces non-existent genAsm operations with synthesized calls on library functions, often inlined. There is not currently any way to do feature (as opposed to member) detection in source that gets passed through; you’d use pre-processor facilities to build targeted genAsm, or select the desired genAsm in the prelinker. This may change as we get further into the OS work.– 12) Any plans for over-128-bits SIMD?
The problem with wide SIMD is that the belts position widths must match the widest possible, which is wasteful for the vast majority of code that won’t use it, or we would need to have multiple belts of differing element widths with the concomitant hassles of multiple ops/transfer move instructions/etc. that plague the legacy ISAs with multiple register widths. We have pretty much settled that SIMD in a member will only be to the widest scalar width, and further parallelism will be provided by wider pipes.As an aside: auto-SIMD with the same semantics as scalar has been ten years away for a very long time. It is worst when the code does something unexpected, like overflow. At least on a Mill you will get the same answer SIMD or scalar; good luck on legacy ISAs.
– 13) Economy cores
The mill has economy cores in the form of economy members. What it doesn’t have is different members that execute the same binaries, but it does have different members that execute the same genAsm. This restricts the ability of the kernel to migrate a running app across different members: the code must reach a point at which program semantics remains the same but program representation can change, and then signal that it is ready to be moved to a different core with a different binary. We’re pretty sure we know how to do this, but the kernel work isn’t far enough along to be sure.Of course, different functionality could be run dedicated on different members. Legacy chips do that now – inside your x86 chip are several completely incompatible administrative cores that run at startup. Likewise the CPU and GPU on combined chips don’t support migration. That works for Mill too
– 14) Function signature tags
I know you’ve said in the past that jumping to an attacker-supplied address is very unlikely to be a problem in the Mill, because the bidirectional encoding means arbitrary addresses in executable memory are very unlikely to be valid code; but still, have you considered adding some hardening for dynamic function calls? History has proven that attackers can be a *lot* more imaginative than defenders in building exploits out of seemingly unusable flaws.
Hardening might include checks that the caller function has supplied the right number and shapes of arguments, and that the target of a dynamic function call is indeed a function and not an arbitrary EBB (for instance, it could check that the EBB starts with a stackf instruction).
– 15) Formally proving safety
Have you considered looking into formal proofs of safety?
I don’t know about processors, but the idea of formal proofs for compilers has been picking up steam lately, as exploits keep piling up.
In particular, while you’ve talked a great deal about how the Mill made Spectre-style exploits harder/impossible, do you have an actual model for what information is available to a given user through caches and side-channels, and how you might guarantee that these side-channels can’t be used to extract privileged information?
(My background here is from the Rust language, where the compiler provides guarantees that certain types of exploits can’t happen except when certain escape hatches are used. While these guarantees do hold strong in practice, there has been a major drive to figure out the formal rules the guarantees implicitly rely on, to have a better model of what code is sound and why.)
– 14) Function signature tags
By coincidence this is an active area of development for us right now. The original design used portals for every inter-module transfer. This forced portal overhead on modules in the same turf that were separately compiled and runtime-bound. For inter-turf calls it had the advantage that only exported entry points could be called. Signatures involve more than valid entry points however; argument lists must be as expected, and it must be verified that things like buffer arguments must be accessible to the caller before the callee uses the address. The current solution is NYF.– 15) Formally proving safety
We have so far made no effort toward formal proof of anything, being too concerned with getting it working first. We will be very happy when the definition is frozen enough that proof techniques are applicable.
– 16) What have you learned from the M1?
The Apple M1 has been a bit of a shockwave in the CPU world. Everyone was surprised what a company like Apple could get in terms of single-core performance for a consumer product, once they had control of the entire supply chain.
Are there any lessons you took from them? Anything you’ve decided to do differently after you saw their results?
– 17) What have you learned from other recent innovations?
The announcement of the Mill was almost 10 years ago, an eternity in computer times. Since then, a lot has happened, both on the hardware side (which I’m not too familiar with) and the software side. I’m thinking of WebAssembly, Fuschia, Rust and other memory-safe languages, eBPF, io_uring, etc.
Did any of these lead you to change plans or think of new features you could add to the Mill? I know you’ve said that eg the Mill would be particularly well suited to running WebAssembly, but have you considered adding instructions specifically to help WebAssembly support? To help Rust support?
(I’m thinking of the M1 which added instructions to run JS faster, and faster atomic counters since Swift and Objective-C use lots of ref-counted pointers. Mill could probably benefit from a fast bound-checks instruction.)
– 18) Any clever ideas you’ve had lately that didn’t pan out?
I know there are a lot of clever ideas you can’t talk about because you want to patent them first, but are there some clever ideas you’ve had, that you put some research effort into, and that you ultimately discarded because they didn’t pan out in practice?
– 16) What have you learned from the M1?
We have not looked much at the M1, nor at other contemporary chip-level implementations. We are still working at the ISA level, and the hardware side is still focused on correctness and completeness.– 17) What have you learned from other recent innovations?
Little because we are not at the point at which such tuning is beneficial. Specific language support will need implementations of those languages and lots of experience with them – the team and I are no better at guessing where the pinch points will be than any other group of good engineers, i.e. lousy. The solution is measurement, measurement, and measurement. Buy me a beer someday and we can talk design philosophy 🙂The solution is measurement, measurement, and measurement. Buy me a beer someday and we can talk design philosophy 🙂
Well I don’t know where you live, but I’ll be in the Berkeley area in a week, so we could maybe arrange something =D
– 18) Any clever ideas you’ve had lately that didn’t pan out?
Lots 🙁An example in my own code: the specializer must turn the program into something that can execute on the target member, and the generated code must not exceed the physical limitations of the target. Example limitations include the number of simultaneously live objects (limited by the belt size), the number of concurrent active loads (limited by the number of retire stations), and others. Straightforward scheduling of pending instructions (an NP-complete bin-packing problem) can produce a binary that exceeds these limits. The initial specializer implementation did a schedule and then examined the resulting code for physicality violations. If it found some, it modified the program by inserting suitable spill/fill and rescue operations and rescheduled. This iterated, called annealing, until a valid schedule was produced.
Unfortunately we found examples where the annealing did not converge, or converged only after an impractical number of iterations or inserted operation counts. After a great deal of work, the scheduling part of the specializer was abandoned and rewritten (the predictive scheduler) to anticipate situations in which resource exhaustion was possible and avoid them. The problem is that the predictive scheduler will avoid problems that might not actually happen in the actual schedule, producing poorer code than annealing does when annealing works. Someday someone will get a sheepskin for coming up with a better scheduler, but for now we live with it.
Wow, what a pile of excellent questions! Thank you.
(It seems the forum software has been eating some of my posts. Not sure if I should repost them or just wait for the forum to catch up.)
(I think the forum’s anti-spam filter gets easily triggered by links and some kinds of formatting. I was able to repost after I removed a link from one of my answers and I changed the code formatting on the other.)
- This reply was modified 1 year, 10 months ago by NarrateurDuChaos.
- This reply was modified 1 year, 10 months ago by NarrateurDuChaos.
Last question, and then I should probably stop there:
– 19) Pipelining loads
The Mill’s stated strategy for hiding load latencies is to give them a long delay and fit some work in-between. If you have a pipelined loop, that means the loop reaches its steady state once its first load is retired, and the following loads can reasonably be expected to retire in sequence.
If your steady state is one cycle long (eg the same instruction repeated over and over), that means your loop will have LOAD_COUNT * LOAD_DELAY loads in-flight at any given time. Eg if your loop does three loads per cycle with a delay of 5 cycles, you can expect your loop to have 15 loads in-flight during its steady state.
The Silver Core will supposedly have 16 retire stations; in your talks, you say that you expect L2 hits to have about 10 cycles of latency. That means if you want your pipeline to run smoothly and you expect your data won’t be in the L1 cache, you can’t have more than a single load per cycle in your loop. So you can’t, say, iterate other two arrays and return the sum of their pairs.
There are some obvious mitigations (including, as we’ve discussed, software prefetching and stride prefetching). Generally speaking, how do you expect compilers to address that problem?
(I’m guessing the answer will start with “streamers” and end with “NYF”)
Streamers NYF 🙂
Deferral is limited by the encoding: how many cycles can you encode. The deferral is in a morsel and is biased by the minimum, so a Copper can defer 11 cycles and a Silver 20. This is enough to hide L1 and usually L2 in open code.
The number of retire stations is another limit that sometimes bites in open code but is especially a problem in piped loops. For compute loops a rule of thumb is two loads and a store per iteration to size the FU counts. We try to saturate to a piped one-bundle loop, so with a latency of 3 we have six in flight, which helps size the retire station count. And yes, that gives a miss any time we need to go to the L2 or further. However, this is mitigated by the cache hardware that holds whole cache lines. If the loop is stepping through arrays (typical) then only the first miss of a line will take a hit: later loads to the same line will hit in the L1. So for typical line sizes you are missing on fewer than 10% of the loads regardless of where the line is.
Of course, that doesn’t help if the loop has no locality of reference: a hash table perhaps. But piping helps there too, because the pipe has several loads in flight. If the first misses, the later ones can miss too (miss under miss) without additional delay. The more loads the code can issue before one needs one to retire, the better. But that is bounded by the number of stations. Member configuration tuning is an art 🙂
As for compilers, we don’t try to do much of anything beyond piping and hoisting. Mill can do anything that can be done in other ISAs, such as prefetching in software or hardware, but we have no experience yet that suggests that we should. The big win was the architecture that split issue from retire in loads and made the load latency visible.
- AuthorPosts
You must be logged in to reply to this topic.