Forum Replies Created
- AuthorPosts
- in reply to: Grab bag of questions #3780
Wow, what a pile of excellent questions! Thank you.
- in reply to: Grab bag of questions #3794
– 14) Function signature tags
By coincidence this is an active area of development for us right now. The original design used portals for every inter-module transfer. This forced portal overhead on modules in the same turf that were separately compiled and runtime-bound. For inter-turf calls it had the advantage that only exported entry points could be called. Signatures involve more than valid entry points however; argument lists must be as expected, and it must be verified that things like buffer arguments must be accessible to the caller before the callee uses the address. The current solution is NYF. - in reply to: Grab bag of questions #3793
– 13) Economy cores
The mill has economy cores in the form of economy members. What it doesn’t have is different members that execute the same binaries, but it does have different members that execute the same genAsm. This restricts the ability of the kernel to migrate a running app across different members: the code must reach a point at which program semantics remains the same but program representation can change, and then signal that it is ready to be moved to a different core with a different binary. We’re pretty sure we know how to do this, but the kernel work isn’t far enough along to be sure.Of course, different functionality could be run dedicated on different members. Legacy chips do that now – inside your x86 chip are several completely incompatible administrative cores that run at startup. Likewise the CPU and GPU on combined chips don’t support migration. That works for Mill too
- in reply to: Grab bag of questions #3792
– 12) Any plans for over-128-bits SIMD?
The problem with wide SIMD is that the belts position widths must match the widest possible, which is wasteful for the vast majority of code that won’t use it, or we would need to have multiple belts of differing element widths with the concomitant hassles of multiple ops/transfer move instructions/etc. that plague the legacy ISAs with multiple register widths. We have pretty much settled that SIMD in a member will only be to the widest scalar width, and further parallelism will be provided by wider pipes.As an aside: auto-SIMD with the same semantics as scalar has been ten years away for a very long time. It is worst when the code does something unexpected, like overflow. At least on a Mill you will get the same answer SIMD or scalar; good luck on legacy ISAs.
- in reply to: Grab bag of questions #3791
– 11) Feature detection
Member detection exists now; that’s how we handle per-member differences such as whether quad (128-bit) arithmetic is present. The specializer replaces non-existent genAsm operations with synthesized calls on library functions, often inlined. There is not currently any way to do feature (as opposed to member) detection in source that gets passed through; you’d use pre-processor facilities to build targeted genAsm, or select the desired genAsm in the prelinker. This may change as we get further into the OS work. - in reply to: Grab bag of questions #3790
– 10) Specializer hints
The tool chain already responds to some of the hint info that LLVM provides. In addition, our dumper (IR to genAsm shim) extracts some info known to LLVM that is not in the IR from LLVM internal structures and adds to the genAsm. An example is the ordering dependency oracle in the genAsm for each function body, which encompasses what LLLVM knows from the aliasing info and function body analysis. Generally we want to push that sort of info and analysis into the steps before genAsm, and let the specializer use the result of the analysis (such as required ordering) without doing the analysis itself.For your specific examples:
* aliasing is reflected in the oracle; potentially colliding references have an oracle entry giving the textual order
* flags are local in the architecture; ops that change global flags are ordered in the oracle w/r/t ops that use them
* loops should not be unrolled; SSA form (in the IR and genAsm) eliminates most references and the pipelining preserves order across iterations for the rest - in reply to: Grab bag of questions #3789
– 9) Profile-guided optimization
The tool chain does not yet have PGO support, so this answer is speculative. Many, perhaps nearly all PGO optimizations are counter-productive on a Mill. Thus for example unrolling loops prevents software pipelining and forces spill/fill of transients that otherwise would live only on the belt for their lifetime; shut unrolling off for better performance. It is also unclear how much function and block reordering will actually pay; our best estimate now is “not much” because so much Mill control flow is collapsed in the tool chain into much larger blocks and gets executed speculatively. Exit prediction also sharply cuts the fetch overhead that reorder is intended to help.Lastly, SIMDization (they are not really vectors in the Cray sense) can be done in the tool chain as well for the Mill as for any architecture. Our major advance is the ability to do SIMD with per-element error control. Whether apps will take advantage of that to improve their RAS is as yet unclear.
- in reply to: Grab bag of questions #3788
– 8) Profiling support
This too is mostly the domain of language and OS. The usual instrumentation methods will work on a Mill of course, but transparent instrumentation has really nasty issues with security that can only be addressed by the OS security model, not the architecture. Do you really want to give this turf the ability to dynamically change (i.e. instrument) the code being run by that turf?In addition, like all wide machines the Mill presents an execution model that is not the single-program-sequence of a z80. What does it mean to single-step a program when it fires a dozen instructions at once, and retires them at a dozen different points in time? Our tools give us a good view of machine level debugging – but debugging and profiling Mill at the source level is just as hard as doing it for heavily optimized code on any architecture.
The tool chain does have a mode (still buggy) in which the code emulates a one-complete-instruction-at-a-time architecture, z80-ish. That helps with debugging and some profile questions like how many times was something called, but is completely useless for timing questions. These are human-factors questions for which nobody seems to have answers on any architecture.
- in reply to: Grab bag of questions #3787
– 7) Benchmarking support
Nearly all of this topic is at the OS and language levels, not the architectural level. There are some admin operations presently defined – bulk purging the cache for example – and there may be more added as needed, but most such things, interrupt control for example, are the property of various system services and most systems don’t make them available for use directly by apps. Of course, you could write your benchmark to run on a bare machine and get control at power up… - in reply to: Grab bag of questions #3786
– 6) Coroutines
This functionality is subsumed under streamers, which are NYF. The semantics is slightly different, but nearly all use of such coroutines is to implement streams and at present we do not expect to separately support stackless coroutines beyond whatever the compiler/language system provides. There has been some talk about what we should do to help microthread systems, but frankly we’re too ignorant of the needs to do much without more experience. - in reply to: Grab bag of questions #3785
– 5) Exception handling
Mill has first class C++ exceptions in a way that is NYF (Not Yet Filed, sorry). The mechanism is also used for setjmp/longjmp. A consequence of the design is that exceptions work in the kernel code itself, and do not require trapping to recovery code nor use of the file system.The facility is not entirely zero cost because there are setup and control instructions needed to define the exception environment and these do cost code space. However there is no latency cost if not thrown. The latency of a caught throw is roughly the same as the equivalent sequence of return instructions and destructor calls if directly executed, mostly due to mispredicts in the control flow of the exception path; there are no table searches or file lookups. The net result is that Mill exceptions are a practical control flow technique that can be used, and thrown, in performance critical and kernel code.
- in reply to: Grab bag of questions #3784
– 4) Tail calls
There is no special architectural support for tail calling at present. The LLVM-based compiler and tool chain can convert recursion to tail calling loops. This is actually easier in a Mill than in a register architecture, because the belt means you don’t have to overwrite register contents, just use the new values, as you do in a loop. - in reply to: Grab bag of questions #3783
– 3) Inner-loop specific optimizations
Anything of this sort would be member-specific, and none of the current members do such things. There has been some thought though; at one point we considered a “do(n)times” branch form. One problem is that the needs and possible optimizations are heavily domain dependent: “inner loop” screams “scientific programming”, whereas more conventional code can actually be detuned by keeping extra state for “optimization”. - in reply to: Grab bag of questions #3782
– 2) Hardware prefetching
Data prefetch is per-member. The present members at Silver and above have stride prefetching defined; others have none. We haven’t poked at anything more complicated, and do not expect to do so pending customer experience with real data and apps. - in reply to: Grab bag of questions #3781
– 1) Exit prediction with dynamic addresses.
The cause of the exit transfer doesn’t matter for prediction, with the exception of returns. Most ebbs (hyperblocks) have several conditional ways to exit, one of which is taken; the predictor says where the taken one will go, in which cycle, and the kind of transfer (call/return/branch). The prediction itself doesn’t care whether it was something dynamic. Calls and returns use a predictor-maintained stack of return positions rather than an actual address in the prediction; the return prediction entry itself just says “return” and pops the entry.Your question seems to assume that prediction is per transfer site, as in per vtable. It’s not; it’s per transferred-to location: if you got here somehow, then you will next get there. History prediction can use a chain (actually a hash) of heres. If a region of code contains a bunch of vtable transfers on the same object, the first will predict that the object will be of the same type (and transfer the same way) as the prior object – and miss if wrong. But the history will then kick in and subsequent transfers will predict the transfer sequence of the last time an object of the same type was processed. There’s a per-member tradeoff between the size of the predictor and the fanout of types that the predictor can handle.
- AuthorPosts