Mill Computing, Inc. › Forums › The Mill › Architecture › Memory
- AuthorPosts
- #251 |
Talk by Ivan Godard – 2013-10-16 at Stanford
Slides: PowerPoint (.pptx)
This talk at Stanford EE380 Computer Systems ColloquiumMostly missless memory in the Mill CPU
Avoiding the pain of cache misses in a statically-scheduled architecture
The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope. It achieves DSP-like power/performance on general purpose codes, without reprogramming. The Mill is a wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, sustained. The pipeline is very short, with a mispredict penalty of only four cycles.It is well known that exposed-pipe static scheduling yields near-perfect code with minimal power – except when there is a miss in the cache. In a conventional VLIW, a miss stalls the whole machine, whereas an out-of-order architecture can sometimes find other useful operations to execute while waiting on the memory hierarchy. The Mill uses a novel load instruction that tolerates load misses as well as hardware out-of-order approaches can do, while avoiding the need for expensive load buffers and completely avoiding false aliasing. In addition, store misses are impossible on a Mill, and a large fraction of the memory traffic of a conventional processor can be omitted entirely.
The talk covers these and other technical aspects of the memory hierarchy in the Mill design.
- This topic was modified 10 years, 10 months ago by staff.
- AnonymousJanuary 3, 2014 at 3:44 pmPost count: 7
(Moved from incorrect thread.)
I would like to mention that I think the Mill’s load semantics are *brilliant*. Absolutely genius work. People have been trying to make software prefetch work for years, and never getting something that works right.
By making it an actual load, just deferring the alias resolution, avoid the whole problem of haiving a second address computation, translation, cache lookup, etc.
Three questions about loads:
1. Your example seemed to have a minimum latency; on page 22 of your slides, couldn’t I ask for 0 delay and avoid the nops? Does it actually simplify things to leave out this feature?
2. Have you played with a “load abandon” instruction for pickup loads? That would let me start a likely-to-miss load speculatively, and if it turns out to be unnecessary, reclaim the retire station without having to wait for the load to complete.
3. I know it resolves same-processor aliasing, but what about multiprocessor? Does the technique let me hoist a load above a lock (e.g load-acquire)? That would definitely be a neat feature.
Answers:
1) 0-latency deferred load when there’s no work to overlap. Turns out that it’s expensive to stall issue; modern cores don’t stop on a dime. Hence it’s cheaper to assume a D$1 cache hit and insert the requisite no-ops. However, in nearly all cases the no-ops occupy zero bits in the code stream, so they are free; see Encoding.
2) There is an abandon mechanism for pickup loads and other in-flight state. NYF, but will be covered in the Pipelining talk.
3) Locks: the Retire Stations do snoop on the coherency protocol in multicore configurations. Multicore will be a talk, but not soon.
- This reply was modified 10 years, 10 months ago by staff. Reason: fix link
It sounds to me like the snooping by retire stations is the same mechanism that ensures that aliasing is properly resolved on the single core, right?
And if I recall correctly, this was on an arbitrary range of memory, rather than being on a cache-line level? Does this mean that in a multicore situation, there is no “false sharing” risk?
So, at a stroke, another problem the Mill sidesteps and dances around? 🙂
Yes, no false sharing; the valid bits on each byte in cache are used by the coherence protocol so that several cores can own disjoint parts of the same line without having to swap the line back and forth. Invalidation is fire-and-forget; the writing core doesn’t have to get the line to write to it, and the sequential consistency memory model removes write-ordering issues so long as invalidations cannot cross in the mail, which is pretty easy to ensure in hardware.
I suspect the answer is “Not Yet Filed”, but what was the main reason behind deciding to have one global address space for all processes apart from “Because we can” in 64bit.
I don’t really like this for various reasons and it’s about the only thing I don’t like so far. The reasons mentioned in the videos for it seem to be minor and more incidental and not the real motivating factors.
The overarching theme behind me not liking it is that it forces all code to be relocatable, i.e. all calls and jumps are indirect. Even when those instructions themselves are very effcient, they require separate loads and the use of precious belt slots.
I used to think the main reason was because there is no real prefetching, even for code, and all latency issues are covered by the load instruction. But the prediction talk says differently.
Another reason could be the mentioned but not yet explained mechanism that enables function calls without really leaving an EBB.
But, when all processes think they are alone and have the full address space for themselves and all code and data sharing is done via shared virtual memory pages, all code can be statically linked (as far as the process/program itself is concerned), with all the advantages and optimization opportunities that gives, while still having all the advantages of relocatable code without any of the disadvantages. The specializer enables the creation of perfectly laid out memory images for each program in that specific system it is run on, and the virtual address translations, that always happen anyway, do the indirections for free.
Well, some answers are not NYF 🙂 Though there wasn’t a clear deductive line that led to SAS; it evolved by fits and starts like everything else.
I’m somewhat mystified when you say “Even when those instructions themselves are very effecient, they require separate loads and the use of precious belt slots.” True label- and function-pointers do exist on a Mill, and they must be loaded and then branched or called through, just like for any other machine, but the code would be no different with private address spaces. The great majority of transfers are direct and don’t use pointers, and those have neither load nor belt positions. You can branch to a label or call a function in a single operation with no belt change. Color me confused.
And yes, the predictor machinery does do prefetching. The phasing mechanism used to embed calls in an EBB is NFY, but will be covered in the Execution talk 2/5 at Stanford.
We assumed from the beginning that any architecture needed 64-bit program spaces; a 32-bit wall is just too constraining. We never really considered supporting both 32- and 64-bit program spaces; apps are written for both on x86 solely because of the massive install base, which we don’t have. We were afraid that either 32- or 64-bit mode would become dominant by accident and the other would die due to network effects, and our double work would be wasted. So pointers had to be 64-bits, less a few we could steal if we wanted to (and it turned out we did).Given a 64-bit address space, static linking is right out: 64-bit offsets would completely hose the encoding and icache. So what gets statically linked in such systems, and what could replace it? Two answers: code, and bss (static data). Turns out that nobody has 4GB worth of code, and all big data is held in dynamic memory (malloc, with mmap behind it), not in static, so global static is under 4GB too. Sure, you can have a program that statically declares a 100GB array – but look at the code the compiler gives you for that – you’ll see something else going on behind the scenes, if the compiler doesn’t err-out right away.
So both code and static only need 32-bit offsets, off some address base. That takes care of the encoding issues, but also obviates static linking – there’s no advantage to fixing the base, because the instructions are now position-independent because they carry offsets, not addresses. Sure, you need an address-adder, but you needed that anyway to support indexing, unless you are a RISC Puritan and are willing to do individual shift and add operations, and pay for your purity in icache and issue slots. The Mill has quite conventional address modes: base, index and offset, requiring a specialized three-input adder. No big deal.
So now we have position-independent code (PIC), and 32-bit code and static data spaces within 64-bit overall space. Are all those 64-bits going to be used by any application? Within my children’s lifetime? Only on massive multi-processor supercomputers with shared memory. However, it’s increasingly obvious that shared memory at the building scale is a bad idea, and message passing is the way to go when off-chip. At a guess, 48 bits of space is enough for any app that we want to support in the market. And how many apps will there be? Or rather, how many distinct protection environments (called a turf in Mill terminology) will there be. Or rather, how much address space in total will be used by all turfs concurrently (after all, many turfs are small) on one chip? Surely much less than 64 bits.
So a SAS is possible without running out of bits or needing more hardware. What are the advantages? The obvious one is getting the TLB out of 90+ % of memory access. TLBs are horribly expensive on a conventional. They have their own cache hierarchy to hide some of the miss costs (which still run 20% or more cycles), and to be fast are a huge part of the power budget. All that goes away with SAS. Yes, SAS still has protection in front of the cache, but that is vastly cheaper than a full-blown TLB (NYF; there will be a Protection talk). The virtual address translation does not need to happen anyway, and is very far from free 🙂
Then there are software advantages: with SAS, processes can cheaply share data at any granularity and in any location; they are not restricted to page-sized shared mmap regions that require expensive OS calls to set up. Getting the OS out of the act permits a micro-thread programming model that is essential for painless parallel programming. The OS is simpler too – it doesn’t have to track whose address space a pointer refers to.
Now all this is an intuitive argument; we certainly don’t have the experience with large programs and real OSs to measure the real costs and benefits. But we were persuaded. 🙂
>The great majority of transfers are direct and don’t use pointers, and those have neither load nor belt positions. You can branch to a label or call a function in a single operation with no belt change. Color me confused.
That’s because I was confused due to lack of information and misinterpreting.
>Given a 64-bit address space, static linking is right out
True. Static linking should always be an offset to some implicit base too, especially in 64bit. Whether that base is NULL or some other fixed address or the current program counter for relative addressing doesn’t matter. The important part for static linking is that you can compute the final destinations at compile time and hardcode them into the instruction stream relative to that implicit base.
Now having read your reply I get the impression that there is an implicit base for each module a process uses that is handled somewhere else that is not a belt slot, and that those implicit bases likely are initialized at module load time in some separate mapping structure in parallel to the virtual addresses.
> there’s no advantage to fixing the base
I can actually think of a few advantages to fixing the base (which of course is only possible when you don’t have a SAS). But:
>The obvious one is getting the TLB out of 90+ % of memory access.
When this indeed means that in 90% of the memory accesses the virtual address is the physical address and a translation is unnecessary and not done, I fully agree that this would obliterate any advantage I have in mind. I just can’t imagine how this would be feasibly implemented, you would need to know whether an address needs translating before you actually bother to look it up. I guess this can done with the protection zones.
If that is not the case, and you still have to do address translation on every memory access, then I guess I could start listing and explaining all those advantages I figure fixed bases and offsets have to offer, especially on an architecture like the Mill.
- This reply was modified 10 years, 10 months ago by imbecile.
> you would need to know whether an address needs translating before you actually bother to look it up
Do you need to?
In the mainstream, the TLB does both mapping and protection; whereas in the Mill, translation is just mapping. Protection (turf reminds me of ‘get off my lawn’) is fast and between load/store FU and the top of the cache, whereas TLB is much slower but between the bottom of the cache and main memory (which is even slower anyway).
This is all filed, but was at most alluded to in the talks so far. so here’s a more complete explanation.
Code:
All transfers use one of two modes: indirect through a pointer on the belt (no offset), or (up to 32-bit manifest) signed offset from the entry address of the current function. There is also a LEA form that returns the offset-from-function-entry result as a pointer dropped to the belt.Data:
All address modes (load, store, and LEA) comprise a base, an up-to-32-bit manifest signed offset, and optionally an index from the belt with a manifest scale factor that left-shifts the index by 0-4 bits. The base may be either a pointer from the belt or one of a small set of specRegs that contain addresses. Currently these are:
cpReg base of code region for load module
cppReg base of constant pool in load module
dpReg base of static data region
fpReg base of stack frame, or NaR if no frame
inpReg base of inbound memory-passed function arguments, or NaR if none
tpReg base of thread-local storage, or NaR if noneThis list is expected to change as we discover issues when porting the OSs. In particular, cpReg and cppReg are likely to go away; there may be an outpReg added; and there may be a few levels of display added (for support of languages with closures). With these, all addresses in the code itself are static and no load-time relocation is necessary.
cpReg, cppReg, and dpReg are initialized by the loader when the process is created. fpReg and inpReg are managed directly by the hardware via the call, retn, and stackf operations. The values of all these registers are currently readable by the rd operation, but that too may go away. They are reachable by MMIO, as used by the init ROM in the power-up sequence to set the initial execution environment.
The TLB issues didn’t quite sit right with me, which usually is a sign I’m wrong or misunderstood something.
What I obviously was wrong about, and which became clear upon rewatching this memory talk, is that the caches are all virtual addresses, and the overhead of going through the TLB is only done when that overhead is hidden by real DRAM access on a cache miss.
That means you could still have per process address spaces and fix all the bases for direct memory access offsets and still get almost all address indirections for free through the virtual memory in the image created by the specializer. And even better than I thought before, since it only happens on cache loads and writebacks, not on every memory access.
And there are more advantages:
You could get rid of most of those per frame special base registers and replace them with one address space selector/process ID register (or integrate this into the frame ID). You would still need to encode the needed base in the load instructions though as if there were registers, so on the encoding side there is no real difference.
Those hard coded bases/memory regions at the same time serve as the permissions, integrating them into the virtual addresses, making the separate protection tables redundant. Except that when those base selectors serve also as the permission, the encoding for stores and loads becomes smaller. For that to work you would have to make some cache invalidation on context switches though or have the processIDs as part of the key for cache lookup in some way.
Having the permissions in the virtual address itself also enables the specialization of caches and cache behavior. You know certain data is constant or write only, because of its virtual address, allowing you to adjust cache strategies or even to have separate (and smaller and faster) caches for those.
Having fixed virtual bases per process could make NUMA memory architectures (which very likely will only become more wide spread in 64 computing) visible to the processes and compilers themselves, adjusting their algorithms and load strategies and memory layouts and even process and thread scheduling accordingly.
Very big gains from fixed offset bases are to be had in code, too. In this area this would mean a pretty big change in the architecture though. On the other hand, in a way it is bringing some of the core ideas of the Mill to their full potential:
Having several fixed code pointer bases enables you to have even more instruction streams trivially. In the encoding white paper more than two instruction streams are said to be possible via interleaving cache lines, which has some problems. Well, a fixed base address for each instruction stream doesn’t have those problems and it gives you all those advantages of more streams mentioned like more smaller and faster dedicated instruction caches, smaller instruction encoding, specialized decoders an order of magnitude more efficient with each addtional one etc. Although you probably will get diminishing returns when you go beyond 4 streams (likely subdivisions of functionality would be control flow, compare/data route, load/store, arithmetic).
There are some caveats with per process address spaces with fixed memory regions. The first process that sets up the virtual memory must be treated specially with a carefully laid out physical memory image. But considering how carefully laid out and crufted the lower memory regions on a pc are with IO addresses, bios code and tables etc. that isn’t anything that isn’t done routinely.
Also sharing addresses between processes is harder and needs to be arbitrated in some way. And again this is something that is done in most current virtual memory systems. And it is a relatively rare operation, especially in comparison to normal intra-process cross-module function calls that would be much cheaper now.
Another issue could be that actual executable code size is limited per process. But that is just a question of configuring the chip. Even in the biggest applications today the combined text segments of all modules don’t go beyond a few hundred MB. Actually I have never seen one one where actual code size goes beyond a few dozen MB. In essence a lower per process code segment of a few GB is far less limiting than the actual maximum installed memory usable by all processes. Especially in an architecture that is geared towards many cores, few context switches and compact code size.
It could also be beneficial to encode the streams that are actually used after a jump into the control instructions. That spares unncesessary memory accesses. Although in practice at least the control flow stream and the compare/data route stream will always be used.This post is going quite off topic I guess.
I also admit that I have another rather self centered and whimsical reason for wanting an own full address space for each process to use: I’m playing around with some language design, and letting each process design their own memory layout allows it to make a lot of safe assumptions about addresses, which in turn enable or make cheaper/free a lot of language features I’m interested in.You are well on the way to inventing capabilities 🙂 https://en.wikipedia.org/wiki/Capability-based_addressing
I truly wish that the Mill were a cap(ability) architecture. One of the Mill team is Norm Hardy, who did the KeyKos cap operating system, of which Eros and others are descendants. The first compiler I ever did was for the Burroughs B6500, which was a cap machine before caps were invented. One of the things that Andy Glew and I are in passionate agreement about is that computing needs caps.
That said: I know how to build a cap machine, but I don’t know how to sell one. Caps are fundamentally incompatible with C, because caps are secure and C is not. Caps machines like the AS-400 let you use C by giving you a sandbox that is one whole cap in which you can be as insecure as you want within your play-PDP11. Oh well.
And fair warning: language design is an insidious vice – you have a slippery slope ahead. Hear the words of someone whose name is in the Algol68 Revised Report 🙂
I read papers and documentation on KeyKos and Coyotos. Although that was a few years ago.
And separate address spaces offer most of the advantages of capabilites without being C-incompatible. The traditional problem for separate address spaces is expensive context switches. But on multicore 64 bit processors context switches can be vastly reduced, and the Mill goes into that direction anyway. And with the cache architecture of the Mill and below cache TLBs context switches can become a lot cheaper too even with separate address spaces.
And as you said yourself, it’s better to leave the OS out of as much things as possible and let the hardware take care of things, and capabilities must be OS constructs and cannot be hardware data types like virtual addresses can. Or am I wrong here?And yes, language design is a terrible vice. Ever since I started programming I was unhappy and annoyed and frustrated with whatever I was using. And whenever you try to find or to think of ways to do things better, what you find and learn usually only reveals new annoyances that make you quickly forget about the old ones you have solved.
Is a cap machine compatible with Oberon?
Yes, as far as I know. However, there are lots of Oberons, and the checking details are messy and not well documented, so it’s not certain.
Sorry, the protection model is NYF. Yes, we do fork(). Yes, it’s a kludge. About as much a kludge as fork() itself is 🙁
Thank you very much for that detailed information.
Those are indeed quite a few base pointer special registers. I would be a bit sad to see the cpReg and cppReg go though. I can think of at least one use case where they would allow for elegant implementation of Haskell thunks/closures, where the closure memory layout/parameter description meta-data and the code thunk should be accessible via the same offset/pointer but must of course reside in different access permissions and thus need different bases.
No idea if those “levels of display” would cover this, because I have no idea what that means in this context here.> Are you thinking about closures created by a JIT, where the code could be anywhere and the code and data may be moved by the GC? Or compiled closures like lambdas in C++?
I mean compiled closures.
But since we’re getting more specific now I went and looked up this old Peyton Jones paper again, which you are most likely familiar with.
In chapter 7 he describes how haskell core language closures could/should be implemented in stock hardware of the time. It’s very possible that most of that information very much obsolete today, but as you have said so nicely yourself, the architectures of stock hardware haven’t changed/improved much since then.
In particular the optimization in 7.6 cannot really be done when constant data and code are in different memory protection turfs. In that case it would be beneficial to have implicit bases handy for both cases, so that one offset works for both, although the info table itself is far less referenced than the actual code.
- This reply was modified 10 years, 10 months ago by imbecile.
I read through the Peyton Jones paper; not sure how long ago I looked at if ever, though the Miranda paper I did read.
Some general comments:
1) The Mill design has generally been concerned with commercially important languages, and we haven’t put much thought into support for esoterica, which Haskell still is 🙂 However, we want to provide such support, but don’t really know enough about the issues and details at this point.2) The scheme of using global state over function calls as a way to do tail calls doesn’t work on a Mill; call has much more complex semantics than just changing the code pointer. However, the Mill supports label variables and indirect branches, which should provide a satisfactory substitute.
3) The data structure in which there is a pointer to a closure which contains a pointer to an info record which contains a pointer to the code would be very painful on any modern machine, Mill included. This implies that a closure should contain both the code and the arguments so that it can simply be entered with a single indirection, which is free if the closure pointer is already on the belt. However, there are issues with mixed code and data, because the Mill uses a Harvard architecture with separate i$ and d$. There are also issues with protection; the PLB has byte granularity, but you *really* don’t want a PLB entry for every closure! As a naive alternative, the closure could contain the bindings and an info pointer and also a code pointer, eliminating one but not both indirections. I have to think about what the predictor consequences of that would be.
4) One possibility is to use the function pointer representation of lexically nested functions as the representation of a closure pointer. That is, a closure pointer is the pair {code pointer, binding pointer}, as is used e.g. for Pascal and other languages with lexical nesting. We expect this form to be used for lambdas in C++. The drawback is that function pointers become 128 bits, which is OK on members with quad but not so good on smaller members.
5) An alternative is to generate a thunk for each lambda or statically nested function, where the thunk loads a static pointer to the binding (or outer frame) and then jumps to the actual code. This approach is used by gcc for lambdas I think.
6) both of the above have issues with where to put the data when there’s no GC; not relevant for Haskell and friends, but very relevant for C++.
7) The natural place for the Node register is inpReg. There are interesting issues when the closure is in a different protection domain (“turf”) than the caller, which inpReg helps with (sorry, NYF). In particular, the caller must not be able to muck with the closure data before or during the call. This again suggests a thunk-like approach, but again the PLB can’t tolerate one thunk per closure-instance and even one per bind-clause may have performance issues.
I’m probably showing my ignorance here 🙁 We will want to revisit this when the full machine is public.
> The Mill design has generally been concerned with commercially important languages, and we haven’t put much thought into support for esoterica, which Haskell still is
Fully understandable. But I also suspect that the frame contexts of which there isn’t too much revealed yet, provide everything you need.
> However, the Mill supports label variables and indirect branches, which should provide a satisfactory substitute.
I also expect the pick instruction to be very helpful there.
> The data structure in which there is a pointer to a closure which contains a pointer to an info record which contains a pointer to the code would be very painful on any modern machine
Yes, a lot has changed there. Most common uses of the info structure have been optimized away, in particular via tagging the pointers. And that is something the Mill supports better than any current hardware.
> That is, a closure pointer is the pair
You most likely wouldn’t even need two full 64bit pointers, just two smaller offsets and use the vector instructions to split them. At least in languages with a runtime.
And I’m no expert by far either, so it should be given a rest here.
Can an EBB see the stack of its caller?
Can you explain more how
inpReg
is used?And, finally, cheap thread-local-storage? Excellent news! Although I imagine many languages may repurpose it, if you let them use it as just another data segment.
Sorry, the whole process model is very different, but it’s all NYF. Talk#11 maybe?
But yes, cheap TLS 🙂 And micro-threads.
Using real estate for more cores in preference to threading, resulting from the the Mill’s other architectural features, brings to mind a question about on-chip memory architecture that, while of no immediate consequence to the Mill chip, might affect future trade offs in real estate use.
With 14nm and higher density technologies coming on line, there is a point where it makes sense prefer on-chip shared memory to some other uses of real estate. This raises the problem of increasing latency to on-chip memory, not only with size of the memory but with the number of cores sharing it. In particular, it seems that with an increasing number of cores, a critical problem is reducing arbitration time among cores for shared, interleaved, on-chip memory banks. In this case, interleaving isn’t to allow a given core to issue a large number of memory accesses in rapid succession despite high latency; it is to service a large number of cores — all with low latency.
Toward that end I came up with a circuit that does the arbitration in analog which, if it works when scaled down to 14nm and GHz frequencies, might result in a architectural preference for a on-chip cross-bar switch between interleaved low-latency memory banks and a relatively large numbers of cores.
This spans disciplines — a problem well known to folks involved with the Mill architecture which spans disciplines between software and computer architecture (rather than between computer architecture and analog electronics).
I’d appreciate any feedback from folks who understand the difficulty of cross-disciplinary technology and have a better grasp the issues than do I.
From following the link you give, I see the general outline of the idea. The challenge of implementing such a device will be to make it deterministic for larger numbers of cores and banks, which is where it would potentially have the greatest benefit. I did not see sufficient detail to make any determination regarding that aspect of the idea’s feasibility.
That said, the problem being attacked is determining which core gets access to which bank. There are several sources of latency here: the arbitration mechanism latency, the “bank is busy” latency, routing latency and RAM access latency. This idea only directly addresses the arbitration mechanism latency. By allowing a larger number of banks, it appears to indirectly help the “bank is busy” or “bank access collision” latency. Unfortunately, a larger number of banks or cores will also increase routing latency. So in the end, routing latency may merely replace arbitration latency as being the performance limiting factor.
- This reply was modified 10 years, 10 months ago by Art.
Thank you for pointing out the challenge of routing latency. Rather than further digression from the Mill memory architecture here, can you suggest a proper forum for discussing this?
Have you tried the comp.arch newsgroup?
So, the Mill has memory acccess that are the direct result of the execution of instructions but also implicit loads and stores that happen as a result of function calls as the belt and and outstanding memory accesses overspill their buffers and are stored in the general memory heirarchy. I think this would be infeasible on an architecture where any load or store might generate a fault immediately, but on the Mill it’ll just make it’s way through the memory heirarchy the same way anything else would and most interesting things will happen at the TLB stage in the same manner and with the same level of decoupling as any other write or read.
How do you handle memory protection and synchonization with implicit operations? Just the same as with normal operaitons, possibly with a duplicated PLB and with the addresses broadcast to the retire stations?
There are essentially three layers in the spiller. At the top layer, data is stored in flipflops, effectively registers except not addressable by the program. These are buffers, used as parking places for values that are still live on the belt and are in the result register of a functional unit but the FU has another result coming out and needs its result register back. These live operands are first daisy-chained up the latency line within the containing FU pipeline, but eventually the pipeline runs out and the operand, if still live, gets moved to a spiller buffer. This cost of the move is the same as a register-to-register copy on a conventional machine. A running program that doesn’t do a function call or return will live entirely in these registers, with no spiller traffic below that.
The second layer is a block of SRAM, connected to those spiller registers. When you do a call operation, we save the belt, which means the belt state in pipe and spiller registers is saved. The save is lazy, but gradually live-but-inactive operands are copied from the needed registers into the spiller SRAM. The SRAM is big enough to hold several belts (and scratchpads and other non-memory frame-related state), so you can nest several deep in calls with everything fitting in the spiller SRAM. Most programs spend most of their time withing a frame working set of five or so, calling in a few, returning out a few, calling in a few, over and over. Such behavior fits entirely internal to the spiller.
However, if a program suddenly switches to a deep run of calls, a run of returns, or if there’s a task switch, the state of all those nested calls (or the new thread) will exceed the spiller’s SRAM and the spiller then uses the third level, which is the regular memory hierarchy. The spiller does not go direct to DRAM; it talks to the L2 cache instead, which provides still more buffering.
If you compare the spiller with the explicit state save used by a conventional legacy register machine, you will see that the spiller top level is akin to the register machine’s rename and architected registers; if a function fits in the registers then there’s no traffic, just as if it fits in the belt and scratchpad there’s no traffic.
If there are nested calls on a conventional then the register state gets saved to the hierarchy using normal store operations. These go to the D$1 cache and are buffered there. This is akin to the spiller’s SRAM, but the spiller has three advantages: it uses no program code to do the save, so the power, instruction entropy, and store buffer contention cost of the stores is avoided; it uses a private repository, so saves are not cluttering the D$1 (which is a very scarce resource); and it’s not program visible, removing many possibilities of program bugs and exploits.
Lastly, if you have deeply nested calls on a conventional, the saved state exceeds the capacity of the D$1 and will overflow into the D$2 and eventually DRAM. The spiller does the same when it runs out of private SRAM. Put all this together, and you see that the spiller is in effect a bunch of registers hooked to its own cache, and the overall benefit is to shift state save/restore out of the top level D$ cache and into the spiller SRAM which is in effect a private spiller cache, freeing up space for real data.
One last point: the total Mill state traffic is less than that on a conventional. A conventional callee-save protocol winds up saving registers that are in fact dead, but the callee doesn’t know that and saves them anyway. And the existence of the scratchpad on a Mill means that many function locals that would be kept in memory are in the scratchpad and so do not contribute to cache load and memory bandwidth. Combine these effects, and our sims suggest that the Mill save/restore and locals traffic is about a factor of two less than that of a conventional. This saves not only bandwidth but also power.
We do not have large-scale sims yet so the overall results are guesstimates, but it does appear that actual DRAM traffic on a Mill will be overwhelmingly composed of I/O and very large external data sets, which have the same traffic load as on a conventional; save/restore, locals, and the working set of globals will never see DRAM. That’s why we the Mill has the Virtual Zero that lets it use “memory” that has no backing DRAM.
All this needs pictures, and we will get to the spiller in the talks eventually.
p.s. Forgot to respond to your protection question:
The spiller has its own region of the address space. Spiller space is used for all processes, and the save state of the different processes are not in the process’s address space and so cannot be directly addressed by the program. In effect you can think of the spiller having its own private PLB, although it need not (and so far is not) implemented that way.
Utilities like debuggers and stack unwinders get access to saved state through an API that runs with PLB rights to spiller space; in effect the API is another process. The API is restricted; you cannot arbitrarily change the downstack links for example. As a result, the usual stack-smash exploit is impossible on the Mill.
Because the application cannot address spiller space, there is no synchronization needed between app use of memory and spiller use of memory; they are necessarily disjoint.
I’m curious:
How does/will the Mill implement the semantics of C’s volatile qualifier?
From what I’ve gathered from watching the talks, the Mill always loads data from the cache hierarchy, not directly from main memory. However, the data in volatile variables (for example memory-mapped peripheral registers), can change independently of CPU activity, so cached values of volatile variables may well not be correct.Is there a mechanism, such as designating certain memory regions as non-cacheable, to ensure that the Mill implements the semantics of volatile variables? A forum search turned up a single forum entry that contains the word volatile. In Mr. Godard’s reply #733 in topic “Side channel timing attacks” he writes:
One of those flags is “volatileData”, which forces the request to bypass private caches and go to the top shared cache.
However, going to the top shared cache (instead of to main memory itself) doesn’t appear to implement the semantics of C-language volatile variables. Since the Mill is targeted at existing code bases, which include C and C++, I assume that there is (or will be) a way to preserve the semantics of volatile. If it can be revealed at this point, I’d very much like a confirmation that the Mill will preserve the semantics of the volatile qualifier — and, if possible, how the Mill handles volatile.
Thanks in advance for any reply.
Being hardware, the Mill is not language-specific, but must support all common languages, common assembler idioms, and our best guess as to where those will evolve.
The best that we, or any design team, can do is to try to isolate primitive notions from the mass of semantic bundles and provide clean support for those primitives, letting each language compose its own semantic atoms from the primitives we supply.The “volatile” notion from C is used for two purposes, with contradictory hardware requirements for any machine other than the native PDP-11 that is C. One purpose is to access active memory such as MMIO registers, for which the idempotency of usual memory access is violated and cache must not be used, nor can accesses be elided because of value reuse.
The other common purpose is to ensure that the actions of an asynchronous mutator (such as another core) are visible to the program. Here too it is important that accesses are not elided, but there is no need for an access to go to physical memory; they must only go to the level at which potentially concurrent accesses are visible. Where that level is located depends on whether there are any asynchronous mutators (there may be only one core, so checking for changes would be pointless) and the cache coherency structure (if any) among the mutators.
Currently the Mill distinguishes these two cases using different behavior flags in the load and store operations. One of those is the volatileData flag mentioned in the previous post. This flag ensures that the access is to the top coherent level, and bypasses any caching above that level. On a monocore all caches are coherent by definition, so on a monocore this flag is a no-op. It is also safe for the compiler to optimize out redundant access, because the data cannot change underneath the single core.
On a coherent multicore the flag is also a no-op to the hardware but is not a no-op to the compiler: the compiler must still encode all access operations, and must not elide any under the assumption that the value has not changed, because it might have.
On an incoherent multicore (common in embedded work), a volatileData access must proceed (without caching) to the highest shared level, which may be DRAM or may be a cache that is shared by all cores. Again, the compiler must not elide accesses.
For the other usage of C volatile keyword, namely MMIO, the access is to an active object and must reach whatever level is the residence of such objects. On the Mill that level is the aperture level, below all caches and immediately above the actual memory devices, and the flag noCache ensures that an access will reach that level. Again, the compiler must not elide noCache accesses.
Besides the noCache flag, the Mill also maps all of the physical address space to a fixed position (address zero) within the (much larger) shared virtual space. An access within that address range is still cached (unless it is noCache) but bypasses the TLB and its virtual address translation mechanism. NoCache addresses outside the physAddr region (and volatileData accesses if they get past the caches) are translated normally.
There are other access control flags, but that’s enough for now 🙂
Does “the Mill also maps all of the physical address space to a fixed position (address zero) within the (much larger) shared virtual space” not mean you have a virtual alias for the whole physical address space?
Yes, it means exactly that. This part of the address space is still protected by the normal Mill memory protection mechanism so an operating system can ensure that nobody unwittingly has an actual alias though.
Is there any plan for the Mill to support some kind of transactional memory, and in general how does it address atomic read-modify-write operations? Perhaps this will be addressed in the future multi-core talk?
There are no RMW ops. The Mill uses optimistic concurrency using the top level cache for the write set, much like the IBM and Intel optimistic facility. It’s been mentioned somewhere, but I don’t remember if it was in the talks or here, or maybe comp.arch.
Using the L1 as the write set makes sense as the normal coherency protocol informs you of conflicts. But, how is this exposed to the SW? AFAIR, OCC needs Begin/Modify/Validate/Rollback operations. Modify is easy – just normal stores. But there seem to be no way around some form of HW support for Validate, which means some sort of explicit Begin. It also seems very hard for SW to safely do a Rollback, preventing another thread from seeing the middle (muddled) state of a transaction. So does the Mill have something similar to Intel’s RTM (explicit begin/end transaction instructions), handling the Begin/Validate/Rollback operations?
There are begin, abort, and commit ops. The only unusual parts of the Mill implementation is that you can have both transactional and non-transactional memory references while in the sequence, and how we handle failure recovery (which is NYF).
The linked article mentions the Mill CPU architecture just once, in a sentence so vague that I cannot honestly tell what the author believes to be true about the Mill:
BTW, this is a major reason I’m skeptical of the Mill architecture. Putting aside arguments about whether or not they’ll live up to their performance claims and that every chip startup I can think of failed to hit their power/performance targets, being technically excellent isn’t, in and of itself, a business model.
The first word of the above quote is such a vague reference (the previous paragraphs were about memory barriers and what is/isn’t guaranteed on multiprocessor systems), that I cannot tell what the author is trying to say about the Mill. Has anyone else read the linked article, and better understood what the author was trying to convey?
I missed the Mill reference when I read the article the other day.
The paragraph before the one you quote says:
This is long enough without my talking about other architectures so I won’t go into detail, but if you’re wondering why anyone would create a spec that allows that kind of crazy behavior, consider that before rising fab costs crushed DEC, their chips were so fast that they could run industry standard x86 benchmarks of real workloads in emulation faster than x86 chips could run the same benchmarks natively.
He then says, as you quoted:
BTW, this is a major reason I’m skeptical of the Mill architecture. Putting aside arguments about whether or not they’ll live up to their performance claims and that every chip startup I can think of failed to hit their power/performance targets, being technically excellent isn’t, in and of itself, a business model.
So I think he’s saying that being technically excellent isn’t going to sell chips. He says it didn’t sell Alpha chips?
Like you, I am a little unsure of my interpretation 🙂
I read it to say that he recognized technical excellence but had doubts about business viability anyway. Unclear though.
Its possibly ambiguous, but I don’t think he was being skeptical of the Mill memory model. I think he was being skeptical of there being any business for non-x86 chips even if they are better? That was my reading of that part, anyway.
When I read the article the other day (proggit discussion), I cherry picked some of the technical problems he raised and summarised what the Mill was doing in each area:
- TLBs and TLB misses: translation of addresses is after the cache hierarchy; the cache is in virtual addresses and the memory is shared so there’s no translation needed in IPC and context switches.
- locks: we’re 100% hardware transactional memory (HTM) on which you can build conventional locks; but you can jump ahead and write your code to use HTM directly for a big performance gain
- Syscalls and context-switches: there aren’t any context-switches; we’re Single Address Space (SAS). Syscalls are much much faster, and you aren’t restricted to a ring architecture (hello efficient microkernels and sandboxing!)
- SIMD: the Mill is MIMD. Much more of your code is auto-vectorizable and auto-pipelinable on the Mill, and your compiler will be able to find this parallelism
- Branches: we can do much more load hoisting and speculation. We’re not Out-of-Order (OoO), so its swings and roundabouts and we often come out ahead
Free forward with any technical questions related to the article 🙂
Regarding context switches, how much of the permissions buffer is cached? Given how it operates in parallel with L1 cache, I’m thinking the permissions cache would be about the same size. What I’m unclear on is how much of the total that represents and how often a portal call will need to load new permission table elements. Then again, this could be handled along with the call prefetch.
Regarding SIMD, I recently read a report on some benchmarking with a test case resembling
int a,b,c,d;
…
loop_many_times
{
a++; b++; d++;
}that ran slower than a test case applying to all 4 vars. The case incrementing all 4 could use x86 family SIMD. Applied to the mill I can see this case being implemented by loading from all of a through d then applying a null mask on the var not being altered.
Regards PLB size:
Consider the size of a high-end conventional L1 TLB; it might contain 64 4K page entries, 32 2MB page entries and 4 1GB pages.
The conventional L1 TLB has to do the address translation before the load from L1 cache itself; the translation and lookup are serial.
This is why the L1 TLB is forced to be small to be fast and hasn’t been growing in recent high-end OoO superscaler microarchitectures. They have actually been adding L2 TLB and so on because of this problem.
A recent article on conventional CPUs actually counts TLB evictions for various real syscalls:
Some of these syscalls cause 40+ TLB evictions! For a chip with a 64-entry d-TLB, that nearly wipes out the TLB. The cache evictions aren’t free, either.
Now consider the situation for the Mill PLB: the entries are arbitrary ranges (rather than some page count), and it has as many cycles as the actual L1 lookup to do its protection check… it can be large and slow as its work is in parallel to the lookup.
Now this really emphasises the real and practical advantages of a virtual cache and Single Address Space architecture 🙂
On the second question about SIMD: exactly! 🙂
Excess slots in a vector can be filled with
None
,0
,1
,1.0
or whatever value nullifies those elements for the operations to be performed.Let me add to Will’s response: the great majority of accesses don’t go to the PLB in the first place, but are checked by a Well Known Region. WKRs have their own hardware access control; there is one for stack, one for the code, and one for all the global space, including .data, .bss, and the heap up to the point where you overflow the huge allocation you get at startup.
As a practical matter, PLB is used for regions that were mmap(MAP_SHARED), for stack segments other than the top segment, and for portal vectors. Important, yes, especially the MAP_SHARED when it is used for direct access to file buffers to avoid copy overhead (a Mill feature), but not a large fraction of all the references a program makes.
I’m a digital logic designer but not a processor designer. Possibly naïve questions follow.
1. At 44:35 in the video: when the retire stations monitor stores and see a match, why do they rerequest the load? Can’t they grab that outbound data instead of requesting what just flew by?
2. At 1:13:00: more of a general cache eviction question. Instead of waiting until an eviction is forced, can some *small* number of LRU lines be proactively pushed downward BUT still kept in cache, such that if a cache line is needed, one of those LRU lines would be instantaneously available? This would likely require an extra bit per line to indicate that it’s mirrored at the next level away.
#1: they could, and perhaps some members might be implemented that way. However the store might be only partly overlapping the load, so the logic to do the grab might have to do a shift and merge, which is non-trivial hardware and there are a lot of retire stations. The L1 already has the shift-and-merge logic (it must, to put stored data into the right place in the cache line), and aliasing interference is rare, so it is cheaper to let the store go to the L1 and reissue the load from the station.
Note that the first try for the load will have caused the loaded line to be hoisted in the cache hierarchy, so the retry will find the data (both the newly stored part and any that had not been overwritten) higher – and faster to get – than it did on the first try.
#2: Cache policies are full of knobs, levers and buttons in all architectures, and Mill is no exception. It is quite likely that both the complement of policy controls and their settings will vary among Mill family members. What you suggest is one possible such policy. The current very-alpha-grade policies in the sim try to preemptively push updates down, except if the line has never been read; this distinction is an attempt the avoid pushing partial lines that are output-only, to avoid the power cost of the repeated partial pushes. This and other policies are certain to change as we get more code through and start to move off sim so we get real power numbers. Fortunately, none of this is visible in the machine model so apps can ignore it all.
A couple of questions–sorry if they’re basic:
1. You mention that when the return operation cuts back a stack that it clears the valid bits on the stack frame’s cache lines. Does the clearing of the valid bits have to cascade to all levels of cache?
2. Unless I’m mistaken, the TLB is a cache of PTEs and might not contain all the PTEs in the system (i.e. it’s a cache over operation system tables, right?). You mention in the talk that that the during a load-miss that gets to the TLB that also misses in the TLB the TLB directly returns a zero, without having to go to main memory. Wouldn’t the TLB have to go to main memory for PTEs, even if it doesn’t have to go to main memory for the actual value to be returned, at least some of the time? Are you using a data structure that makes this unlikely (i.e. you can answer “not found” queries without having access to the whole set of PTEs in the TLB) or is it just the fact that you have a large TLB and the “well known region” registers cover a lot of what would otherwise be PTEs and that makes it likely that all PTEs are in the TLB?
Thanks for the answers. I’m a software guy and not a hardware guy, so I’m sorry if the questions betray a lack of understanding.
Sound questions.
1) Frame exit invalidation is done in the Well Known Region for the stack. It is not necessary to clear the valid bits, because the exited frame is no longer in the user’s address space; return automatically cuts back the stack permission WKR. However, there is an issue with inter-core references to foreign stacks. Core B has no WKR for core A’s stack, so in normal mode threads cannot reference each others stacks and threads cannot reference their own exited frames. However, there is a legacy mode in which inter-stack references are permitted. In legacy mode the entire stack, both used and unused, has a PLB entry for the turf. So if there are two legacy threads running in the same turf then they can browse each others stack, and can browse the rubble in their own stack. We have never found a way to restrict interstack references to only the live portion; all practical implementations seem to require exposing rubble too.
2) The implementation of the TLB is per-member; it is transparent to the programmer. As currently in the sim, a miss in the TLB requires a search of the PTE tables. However, unallocated regions of the 60-bit address space do not need PTEs and do not occupy physical memory. If the search leads to an address where a PTE might be but is not, then the Implicit Zero logic will return a synthetic line of zeroes. The PTE format is carefully arranged to that a zero PTE means that the region is unallocated in DRAM – and the load in turn returns implicit zero. Thus the page table is in virtual memory, not physical, and there are only entries for mapped space.
Thus the miss will trigger a search, which may miss, which triggers a search… The Tiny Alice algorithm we use bounds the recursion to the number of distinct range sizes (current 5), but in practice the intermediate entries are in the LLC and search depth is nearly always <= 2.
I have a question about hoisted deferred loads and function boundaries.
Consider a relatively large function that requires multiple loads. I would expect at the start of the function to be a flurry of address calculations and deferred loads, as much as possible before getting on with the rest of its’ functionality in order to hide cache/dram latency as much as possible. I might even call it a ‘deferred load preamble’, not officially, but I could see it being a common enough pattern to recognize it.
So my first question: Does this scenario sound reasonable? Would you expect it to be that common?
Now lets extend it. Break up the function into three smaller functions. Lets assume it’s very simple and you can just group instructions together into their own functions, with outputs flowing to inputs etc. So instead of one big section at the beginning where all the loads are issued, each smaller function has its own ‘deferred load preamble’. This would mean that e.g. the last of the three was not able to defer its loads as far and may suffer more from memory latency issues.
Does this also sound reasonable? Is it just the compiler’s (|| specializer’s) responsibility to inline functions and hoist loads as much as possible or does mill hardware offer any mitigation to this issue? It’s not OOO, so I wouldn’t really expect it to “peek ahead” to see those loads, but then again the mill’s durability to speculation would really help such an implementation.
Thoughts?
Short functions with loads tend to have extra noops or stalls for the reason you give. A sufficiently smart compiler could issue prefetches in the caller, just as can be done for any architecture. But the general Mill approach is inlining.
Most ISAs inline to avoid call overhead, but calls are cheap on a Mill. We inline to get more for the machine width to work on. On a wider Mill you can inline surprisingly large functions with zero latency cost in the expanded caller, because the called ops fit in the schedule holes of the caller. Of course there is usually some increase in the code space used, but removing the call and return means less thrashing in the cache lines, so net it’s a win unless profiling shows the call is never taken.
We’re still tuning the inlining heuristics in the tool chain. As a rule of thumb, any function shorter than a cache line is worth inlining on icache grounds alone.
SAS means there are no homonyms on the Mill. Fair enough. However, what about address synonyms? Memory with copy-on-write semantics is obviously not an issue, so this has no effect on
fork()
, but I seem to recall someone mentioning that all of physical RAM is mapped at startup, and there are definitely a few neat tricks involving doubly-mapped pages (such as fast circular buffers and heap compaction without pointer changes).- AuthorPosts
You must be logged in to reply to this topic.