Forum Replies Created
- AuthorPosts
- in reply to: Grab bag of questions #3814
Last question, and then I should probably stop there:
– 19) Pipelining loads
The Mill’s stated strategy for hiding load latencies is to give them a long delay and fit some work in-between. If you have a pipelined loop, that means the loop reaches its steady state once its first load is retired, and the following loads can reasonably be expected to retire in sequence.
If your steady state is one cycle long (eg the same instruction repeated over and over), that means your loop will have LOAD_COUNT * LOAD_DELAY loads in-flight at any given time. Eg if your loop does three loads per cycle with a delay of 5 cycles, you can expect your loop to have 15 loads in-flight during its steady state.
The Silver Core will supposedly have 16 retire stations; in your talks, you say that you expect L2 hits to have about 10 cycles of latency. That means if you want your pipeline to run smoothly and you expect your data won’t be in the L1 cache, you can’t have more than a single load per cycle in your loop. So you can’t, say, iterate other two arrays and return the sum of their pairs.
There are some obvious mitigations (including, as we’ve discussed, software prefetching and stride prefetching). Generally speaking, how do you expect compilers to address that problem?
(I’m guessing the answer will start with “streamers” and end with “NYF”)
- in reply to: Grab bag of questions #3805
(It seems the forum software has been eating some of my posts. Not sure if I should repost them or just wait for the forum to catch up.)
- in reply to: Grab bag of questions #3811
(I think the forum’s anti-spam filter gets easily triggered by links and some kinds of formatting. I was able to repost after I removed a link from one of my answers and I changed the code formatting on the other.)
- This reply was modified 1 year, 10 months ago by NarrateurDuChaos.
- This reply was modified 1 year, 10 months ago by NarrateurDuChaos.
- in reply to: Grab bag of questions #3778
– 16) What have you learned from the M1?
The Apple M1 has been a bit of a shockwave in the CPU world. Everyone was surprised what a company like Apple could get in terms of single-core performance for a consumer product, once they had control of the entire supply chain.
Are there any lessons you took from them? Anything you’ve decided to do differently after you saw their results?
– 17) What have you learned from other recent innovations?
The announcement of the Mill was almost 10 years ago, an eternity in computer times. Since then, a lot has happened, both on the hardware side (which I’m not too familiar with) and the software side. I’m thinking of WebAssembly, Fuschia, Rust and other memory-safe languages, eBPF, io_uring, etc.
Did any of these lead you to change plans or think of new features you could add to the Mill? I know you’ve said that eg the Mill would be particularly well suited to running WebAssembly, but have you considered adding instructions specifically to help WebAssembly support? To help Rust support?
(I’m thinking of the M1 which added instructions to run JS faster, and faster atomic counters since Swift and Objective-C use lots of ref-counted pointers. Mill could probably benefit from a fast bound-checks instruction.)
– 18) Any clever ideas you’ve had lately that didn’t pan out?
I know there are a lot of clever ideas you can’t talk about because you want to patent them first, but are there some clever ideas you’ve had, that you put some research effort into, and that you ultimately discarded because they didn’t pan out in practice?
- in reply to: Grab bag of questions #3777
– 14) Function signature tags
I know you’ve said in the past that jumping to an attacker-supplied address is very unlikely to be a problem in the Mill, because the bidirectional encoding means arbitrary addresses in executable memory are very unlikely to be valid code; but still, have you considered adding some hardening for dynamic function calls? History has proven that attackers can be a *lot* more imaginative than defenders in building exploits out of seemingly unusable flaws.
Hardening might include checks that the caller function has supplied the right number and shapes of arguments, and that the target of a dynamic function call is indeed a function and not an arbitrary EBB (for instance, it could check that the EBB starts with a stackf instruction).
– 15) Formally proving safety
Have you considered looking into formal proofs of safety?
I don’t know about processors, but the idea of formal proofs for compilers has been picking up steam lately, as exploits keep piling up.
In particular, while you’ve talked a great deal about how the Mill made Spectre-style exploits harder/impossible, do you have an actual model for what information is available to a given user through caches and side-channels, and how you might guarantee that these side-channels can’t be used to extract privileged information?
(My background here is from the Rust language, where the compiler provides guarantees that certain types of exploits can’t happen except when certain escape hatches are used. While these guarantees do hold strong in practice, there has been a major drive to figure out the formal rules the guarantees implicitly rely on, to have a better model of what code is sound and why.)
- in reply to: Grab bag of questions #3776
– 11) Feature detection
Are there any plans for the general Mill ISA to have instructions of the form “If this feature is available on the target architecture, compile this EBB, otherwise compile this other EBB”? I know something similar is planned for WebAssembly.
– 12) Any plans for over-128-bits SIMD?
In past discussions on the subject you seemed to think it was a future possibility? Now that AVX-512 has been out for a while, have you settled?
Do you think there’s benefit in having 256-bits or 512-bits vector instructions, or do you think having a very wide pipeline always provides the same benefits?
– 13) Economy cores
Recently, mobile CPUs and the Apple M1 have started to use designs based on “economy cores”. The idea being that the CPU includes a large number of small, low-frequency, less-performant cores that the device could use both as additional threads on heavy workloads, and as the only threads for when the device needs to conserve power.
Could the Mill provide a similar architecture? In the past you’ve floated the idea of a chip with both Silver and Tin cores (I think), but those would have incompatible binary formats. Have you considered ways a single device (eg smartphone) could run multiple cores with different specs while still executing the same binaries?
- in reply to: Grab bag of questions #3833
When we have looked into this, we rather quickly concluded that the expected fix-the-stack-and-branch approach was hopeless. Instead we will need to define a “tailcall” hardware operation that does all the internals like a call but reuses (and extends) the pre-existing frame in a very Mill-specific way. Details TBD.
Right, that why I was bringing it up. You need a dedicated tail-call instruction, otherwise compilers can’t implement it at all.
- in reply to: Grab bag of questions #3829
In the same vein, Wasmtime recently had an RFC for adding tail-calls:
In general, we want Wasmtime to support well-designed, standards-track WebAssembly proposals like the tail calls proposal. The proposal (currently in phase 4) is mature, stable, and implemented in V8 and JSC.
A variety of Wasm producers will and already do target Wasm tail calls. In clang, C++20’s coroutines lower to calls annotated with LLVM’s musttail attribute, which fails to compile to Wasm unless the tail calls proposal is enabled. LLVM’s optimizer will additionally turn regular calls into tail calls when the proposal is enabled. Compiling to Wasm with a continuation passing style where everything is a tail call is a natural fit for some domains, such as lowering a finite state machine to Wasm. Many functional languages guarantee proper tail calls and have to bend over backwards to deliver them without the Wasm tail calls proposal.
- in reply to: Grab bag of questions #3808
There is no special architectural support for tail calling at present. The LLVM-based compiler and tool chain can convert recursion to tail calling loops
I think there are some programs / languages which explicitly rely on tail calls, in ways that can’t be easily elided with TCO. Eg mutually-recursive functions and tail calls to dynamic functions, to give the most obvious examples.
I don’t have enough experience in functional languages to know the details; but there’s been some discussion about this subject on the WebAssembly tail-calls proposal page:
Since we don’t use wasm function parameters at all, we can assume
all wasm type signatures of our functions are i32(). The returned i32
indicates the call target (all calls become tail calls). Then we can
use a trampolined mini-interpreter which given the first function to
enter, repeatedly invoke the function, perform some side-effects,
retrieve the next call target, then use the call_indirect instruction
to call into the next target. This is the current approach of
asterius.Now, asterius is deprecated and GHC now includes wasm targets; I haven’t checked if it used the same tricks, but I assume it does.
tl;dr: For Haskell and other functional languages, you either need hardware support for tail calls or you accept that the language is going to route every call through a trampoline in the worst case (unless the compiler can provably find and remove every tail call instance ahead of time).
- in reply to: Grab bag of questions #3810
The cause of the exit transfer doesn’t matter for prediction, with the exception of returns. Most ebbs (hyperblocks) have several conditional ways to exit, one of which is taken; the predictor says where the taken one will go, in which cycle, and the kind of transfer (call/return/branch). The prediction itself doesn’t care whether it was something dynamic.
I’m not sure we understand each other here.
Say you have a loop iterating over a list of dynamic objects (say, widgets):
`
for (i in 0…this.child_widgets.size()) {
this.child_widgets[i].paint();
}
`
Given the same EBB (the loop body), each entry might have a different exit. So for each loop iteration, the predictor would be given the same EBB address, and need to predict a different exit address (the paint call of the given vtable).
I guess history prediction, agree prediction, etc would help. If your widget list is
[Button, Label, Checkbox, Button, Label, CheckBox, Button, ...]
, the predictor can probably guess “If you got into this EBB by returning fromButton::paint
then the exit is probablyLabel::paint
” from history. I’m not sure how much that helps? If you have a lot of method calls on heterogeneous object lists, and/or if you have lists that change regularly, wouldn’t that saturate the predictor? On the other hand, modern predictors are pretty big (around 4096 entries according to a cloudflare article I found).Anyway, my point was, given the example code above, the compiler has enough information to predict every call to the
paint
method hundreds of cycles in advance of it happening. Say each paint call takes about 100 cycles on average, and the compiler knows that. The compiler could insert prefetch calls to tell the CPU “prefetch the vtable entry forchild_widgets[i + 3]
” and mask the DRAM load latency. Then “predict that next time this EBB is entered, its exit will bechild_widgets[i + 1].paint()
” and avoid the mispredict even if the history has not caught up yet.Of course, I don’t know how useful any of this would be in practice. From what I know of the subject,
likely
-style hints aren’t used by the compiler because branch predictors actually outperform compiler hints, to the point that the hints are a hindrance more than anything else. But I don’t know how much this is accidental vs fundamental to the prediction problem. - in reply to: Grab bag of questions #3809
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. Thus for example unrolling loops prevents software pipelining and forces spill/fill of transients that otherwise would live only on the belt for their lifetime; shut unrolling off for better performance.
Maybe this is a vocabulary problem, but there’s definitely a kind of unrolling that the Mill benefits from; we could call it “horizontal” unrolling as opposed to “vertical” unrolling which would be the traditional version.
But if you have code that looks like this:
`
LOOP:
CON(…),
ADD(…), LOAD(…),
STORE(…);
`
If your Mill is wide enough, you will definitely benefit from turning it into:
`
LOOP:
CON(…), CON(…), CON(…),
ADD(…), LOAD(…), ADD(…), LOAD(…), ADD(…), LOAD(…),
STORE(…), STORE(…), STORE(…);
`
I don’t know exactly how you’d call that operation, but to me it’s “unrolling” of a sort. And it definitely hits that “efficiency / code size” tradeoff that PGO helps optimize.
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. […] It is also unclear how much function and block reordering will actually pay;
That doesn’t sound right. As long as your codegen is based on heuristics, you’ll benefit from PGO somewhere.
To give a Mill-specific example, there’s
load
delays. Your current heuristic is “as long as you can get away with”, but a PGO analysis might say “actually, this load almost always hits L1, so we can give it a delay of 3 to start this following load earlier / to better pack instructions”. - in reply to: Grab bag of questions #3802
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. Thus for example unrolling loops prevents software pipelining and forces spill/fill of transients that otherwise would live only on the belt for their lifetime; shut unrolling off for better performance.
Maybe this is a vocabulary problem, but there’s definitely a kind of unrolling that the Mill benefits from; we could call it “horizontal” unrolling as opposed to “vertical” unrolling which would be the traditional version.
But if you have code that looks like this:
LOOP: CON(...), ADD(...), LOAD(...), STORE(...);
If your Mill is wide enough, you will definitely benefit from turning it into:
`
LOOP:
CON(…), CON(…), CON(…),
ADD(…), LOAD(…), ADD(…), LOAD(…), ADD(…), LOAD(…),
STORE(…), STORE(…), STORE(…);
`
I don’t know exactly how you’d call that operation, but to me it’s “unrolling” of a sort. And it definitely hits that “efficiency / code size” tradeoff that PGO helps optimize.
Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. […] It is also unclear how much function and block reordering will actually pay;
Either way, as long as your codegen is based on heuristics, you’ll benefit from PGO somewhere.
To give a Mill-specific example, there’s
load
delays. Your current heuristic is “as long as you can get away with”, but a PGO analysis might say “actually, this load almost always hits L1, so we can give it a delay of 3 to start this following load earlier / to better pack instructions”.- This reply was modified 1 year, 10 months ago by NarrateurDuChaos.
- in reply to: Grab bag of questions #3804
The solution is measurement, measurement, and measurement. Buy me a beer someday and we can talk design philosophy 🙂
Well I don’t know where you live, but I’ll be in the Berkeley area in a week, so we could maybe arrange something =D
- in reply to: Grab bag of questions #3801
This too is mostly the domain of language and OS. The usual instrumentation methods will work on a Mill of course, but transparent instrumentation has really nasty issues with security that can only be addressed by the OS security model, not the architecture. Do you really want to give this turf the ability to dynamically change (i.e. instrument) the code being run by that turf?
My question was, have you thought about adding tools that give the benefits of valgrind-style instrumentation without actually intercepting instructions?
Say my program is running much more slowly than I expected, and I’m trying to figure out why. Right now my options are:
– Run the program with perf, and get a coarse-grained profile of my program. I can get a flamegraph of the function calls (though it will be an averaged approximation) and a rough summary of various indicators (instructions retired, cache misses, branch mispredicts, etc).
– Run the program with cachegrind, and get a fine-grained profile of my program. The emulator will tell me exactly how many cache misses each line of code will trigger, with the caveats that they’re misses in an emulated architecture, not on my actual computer; some things like stride prefetching and reorder buffers aren’t actually emulated. Also, obviously the execution is about 10x to 50x slower.
– Use a domain-specific profiler (eg if I’m using a database and it comes with one).
– Implement my own homemade solution to measure the metrics I care about.Of the choices above, only perf actually uses CPU instrumentation. The problem is that it’s extremely coarse; the CPU will only have counters for some types of events, to give you a summary at the end of execution. (Also, I think it needs some privileges to run.)
What I’d want is something giving me a journal of the entire execution, from which I would get the metrics I need.
For instance, say what I care about is cache misses. I would want to tell Mill “execute this program, and every time you run a
load
instruction, log the cache level plus the instruction pointer in a journal, and save the journal to disk every 10MB”. This is a super-simplified example (if nothing else, the resulting journal would be huge), but it’s the kind of information I would want. From that journal, I could get a summary that tells me “this function has a lot of L1 misses, this function has some L2 misses, this specific line of code has lots of L3 misses and keeps stalling the entire program”, etc.The Mill has the advantage of being fully deterministic except for a very small number of instructions; so, from a small journal, you might be able to reconstruct the entire execution trace, and find better insights than you’d get from other CPUs.
- in reply to: Grab bag of questions #3800
One problem is that the needs and possible optimizations are heavily domain dependent: “inner loop” screams “scientific programming”, whereas more conventional code can actually be detuned by keeping extra state for “optimization”.
What kind of extra state are you thinking of that would be a hindrance?
- AuthorPosts