Memory

staff

Talk by Ivan Godard – 2013-10-16 at Stanford

Slides: PowerPoint (.pptx)
This talk at Stanford EE380 Computer Systems Colloquium

Mostly missless memory in the Mill CPU

Avoiding the pain of cache misses in a statically-scheduled architecture

The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope. It achieves DSP-like power/performance on general purpose codes, without reprogramming. The Mill is a wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, sustained. The pipeline is very short, with a mispredict penalty of only four cycles.

It is well known that exposed-pipe static scheduling yields near-perfect code with minimal power – except when there is a miss in the cache. In a conventional VLIW, a miss stalls the whole machine, whereas an out-of-order architecture can sometimes find other useful operations to execute while waiting on the memory hierarchy. The Mill uses a novel load instruction that tolerates load misses as well as hardware out-of-order approaches can do, while avoiding the need for expensive load buffers and completely avoiding false aliasing. In addition, store misses are impossible on a Mill, and a large fraction of the memory traffic of a conventional processor can be omitted entirely.

The talk covers these and other technical aspects of the memory hierarchy in the Mill design.

[deleted]

(Moved from incorrect thread.)

I would like to mention that I think the Mill’s load semantics are *brilliant*. Absolutely genius work. People have been trying to make software prefetch work for years, and never getting something that works right.

By making it an actual load, just deferring the alias resolution, avoid the whole problem of haiving a second address computation, translation, cache lookup, etc.

Three questions about loads:

1. Your example seemed to have a minimum latency; on page 22 of your slides, couldn’t I ask for 0 delay and avoid the nops? Does it actually simplify things to leave out this feature?

2. Have you played with a “load abandon” instruction for pickup loads? That would let me start a likely-to-miss load speculatively, and if it turns out to be unnecessary, reclaim the retire station without having to wait for the load to complete.

3. I know it resolves same-processor aliasing, but what about multiprocessor? Does the technique let me hoist a load above a lock (e.g load-acquire)? That would definitely be a neat feature.

ivan

Answers:

1) 0-latency deferred load when there's no work to overlap. Turns out that it's expensive to stall issue; modern cores don't stop on a dime. Hence it's cheaper to assume a D$1 cache hit and insert the requisite no-ops. However, in nearly all cases the no-ops occupy zero bits in the code stream, so they are free; see Encoding.

2) There is an abandon mechanism for pickup loads and other in-flight state. NYF, but will be covered in the Pipelining talk.

3) Locks: the Retire Stations do snoop on the coherency protocol in multicore configurations. Multicore will be a talk, but not soon.

Will_Edwards

It sounds to me like the snooping by retire stations is the same mechanism that ensures that aliasing is properly resolved on the single core, right?

And if I recall correctly, this was on an arbitrary range of memory, rather than being on a cache-line level? Does this mean that in a multicore situation, there is no "false sharing" risk?

So, at a stroke, another problem the Mill sidesteps and dances around? :)

ivan

Yes, no false sharing; the valid bits on each byte in cache are used by the coherence protocol so that several cores can own disjoint parts of the same line without having to swap the line back and forth. Invalidation is fire-and-forget; the writing core doesn't have to get the line to write to it, and the sequential consistency memory model removes write-ordering issues so long as invalidations cannot cross in the mail, which is pretty easy to ensure in hardware.

imbecile

I suspect the answer is "Not Yet Filed", but what was the main reason behind deciding to have one global address space for all processes apart from "Because we can" in 64bit.

I don't really like this for various reasons and it's about the only thing I don't like so far. The reasons mentioned in the videos for it seem to be minor and more incidental and not the real motivating factors.

The overarching theme behind me not liking it is that it forces all code to be relocatable, i.e. all calls and jumps are indirect. Even when those instructions themselves are very effcient, they require separate loads and the use of precious belt slots.

I used to think the main reason was because there is no real prefetching, even for code, and all latency issues are covered by the load instruction. But the prediction talk says differently.

Another reason could be the mentioned but not yet explained mechanism that enables function calls without really leaving an EBB.

But, when all processes think they are alone and have the full address space for themselves and all code and data sharing is done via shared virtual memory pages, all code can be statically linked (as far as the process/program itself is concerned), with all the advantages and optimization opportunities that gives, while still having all the advantages of relocatable code without any of the disadvantages. The specializer enables the creation of perfectly laid out memory images for each program in that specific system it is run on, and the virtual address translations, that always happen anyway, do the indirections for free.

ivan

Well, some answers are not NYF :-) Though there wasn't a clear deductive line that led to SAS; it evolved by fits and starts like everything else.

I'm somewhat mystified when you say "Even when those instructions themselves are very effecient, they require separate loads and the use of precious belt slots." True label- and function-pointers do exist on a Mill, and they must be loaded and then branched or called through, just like for any other machine, but the code would be no different with private address spaces. The great majority of transfers are direct and don't use pointers, and those have neither load nor belt positions. You can branch to a label or call a function in a single operation with no belt change. Color me confused.

And yes, the predictor machinery does do prefetching. The phasing mechanism used to embed calls in an EBB is NFY, but will be covered in the Execution talk 2/5 at Stanford.
We assumed from the beginning that any architecture needed 64-bit program spaces; a 32-bit wall is just too constraining. We never really considered supporting both 32- and 64-bit program spaces; apps are written for both on x86 solely because of the massive install base, which we don't have. We were afraid that either 32- or 64-bit mode would become dominant by accident and the other would die due to network effects, and our double work would be wasted. So pointers had to be 64-bits, less a few we could steal if we wanted to (and it turned out we did).

Given a 64-bit address space, static linking is right out: 64-bit offsets would completely hose the encoding and icache. So what gets statically linked in such systems, and what could replace it? Two answers: code, and bss (static data). Turns out that nobody has 4GB worth of code, and all big data is held in dynamic memory (malloc, with mmap behind it), not in static, so global static is under 4GB too. Sure, you can have a program that statically declares a 100GB array - but look at the code the compiler gives you for that - you'll see something else going on behind the scenes, if the compiler doesn't err-out right away.

So both code and static only need 32-bit offsets, off some address base. That takes care of the encoding issues, but also obviates static linking - there's no advantage to fixing the base, because the instructions are now position-independent because they carry offsets, not addresses. Sure, you need an address-adder, but you needed that anyway to support indexing, unless you are a RISC Puritan and are willing to do individual shift and add operations, and pay for your purity in icache and issue slots. The Mill has quite conventional address modes: base, index and offset, requiring a specialized three-input adder. No big deal.

So now we have position-independent code (PIC), and 32-bit code and static data spaces within 64-bit overall space. Are all those 64-bits going to be used by any application? Within my children's lifetime? Only on massive multi-processor supercomputers with shared memory. However, it's increasingly obvious that shared memory at the building scale is a bad idea, and message passing is the way to go when off-chip. At a guess, 48 bits of space is enough for any app that we want to support in the market. And how many apps will there be? Or rather, how many distinct protection environments (called a turf in Mill terminology) will there be. Or rather, how much address space in total will be used by all turfs concurrently (after all, many turfs are small) on one chip? Surely much less than 64 bits.

So a SAS is possible without running out of bits or needing more hardware. What are the advantages? The obvious one is getting the TLB out of 90+ % of memory access. TLBs are horribly expensive on a conventional. They have their own cache hierarchy to hide some of the miss costs (which still run 20% or more cycles), and to be fast are a huge part of the power budget. All that goes away with SAS. Yes, SAS still has protection in front of the cache, but that is vastly cheaper than a full-blown TLB (NYF; there will be a Protection talk). The virtual address translation does not need to happen anyway, and is very far from free :-)

Then there are software advantages: with SAS, processes can cheaply share data at any granularity and in any location; they are not restricted to page-sized shared mmap regions that require expensive OS calls to set up. Getting the OS out of the act permits a micro-thread programming model that is essential for painless parallel programming. The OS is simpler too - it doesn't have to track whose address space a pointer refers to.

Now all this is an intuitive argument; we certainly don't have the experience with large programs and real OSs to measure the real costs and benefits. But we were persuaded. :-)

imbecile

>The great majority of transfers are direct and don’t use pointers, and those have neither load nor belt positions. You can branch to a label or call a function in a single operation with no belt change. Color me confused.

That's because I was confused due to lack of information and misinterpreting.

>Given a 64-bit address space, static linking is right out

True. Static linking should always be an offset to some implicit base too, especially in 64bit. Whether that base is NULL or some other fixed address or the current program counter for relative addressing doesn't matter. The important part for static linking is that you can compute the final destinations at compile time and hardcode them into the instruction stream relative to that implicit base.

Now having read your reply I get the impression that there is an implicit base for each module a process uses that is handled somewhere else that is not a belt slot, and that those implicit bases likely are initialized at module load time in some separate mapping structure in parallel to the virtual addresses.

> there’s no advantage to fixing the base

I can actually think of a few advantages to fixing the base (which of course is only possible when you don't have a SAS). But:

>The obvious one is getting the TLB out of 90+ % of memory access.

When this indeed means that in 90% of the memory accesses the virtual address is the physical address and a translation is unnecessary and not done, I fully agree that this would obliterate any advantage I have in mind. I just can't imagine how this would be feasibly implemented, you would need to know whether an address needs translating before you actually bother to look it up. I guess this can done with the protection zones.

If that is not the case, and you still have to do address translation on every memory access, then I guess I could start listing and explaining all those advantages I figure fixed bases and offsets have to offer, especially on an architecture like the Mill.

Will_Edwards

> you would need to know whether an address needs translating before you actually bother to look it up

Do you need to?

In the mainstream, the TLB does both mapping and protection; whereas in the Mill, translation is just mapping. Protection (turf reminds me of 'get off my lawn') is fast and between load/store FU and the top of the cache, whereas TLB is much slower but between the bottom of the cache and main memory (which is even slower anyway).

imbecile

> Do you need to?

Yes, but I guess i can be done cheaply with the separate protection mechanisms.

ivan

This is all filed, but was at most alluded to in the talks so far. so here's a more complete explanation.

Code:
All transfers use one of two modes: indirect through a pointer on the belt (no offset), or (up to 32-bit manifest) signed offset from the entry address of the current function. There is also a LEA form that returns the offset-from-function-entry result as a pointer dropped to the belt.

Data:
All address modes (load, store, and LEA) comprise a base, an up-to-32-bit manifest signed offset, and optionally an index from the belt with a manifest scale factor that left-shifts the index by 0-4 bits. The base may be either a pointer from the belt or one of a small set of specRegs that contain addresses. Currently these are:
cpReg base of code region for load module
cppReg base of constant pool in load module
dpReg base of static data region
fpReg base of stack frame, or NaR if no frame
inpReg base of inbound memory-passed function arguments, or NaR if none
tpReg base of thread-local storage, or NaR if none

This list is expected to change as we discover issues when porting the OSs. In particular, cpReg and cppReg are likely to go away; there may be an outpReg added; and there may be a few levels of display added (for support of languages with closures). With these, all addresses in the code itself are static and no load-time relocation is necessary.

cpReg, cppReg, and dpReg are initialized by the loader when the process is created. fpReg and inpReg are managed directly by the hardware via the call, retn, and stackf operations. The values of all these registers are currently readable by the rd operation, but that too may go away. They are reachable by MMIO, as used by the init ROM in the power-up sequence to set the initial execution environment.

imbecile

Thank you very much for that detailed information.

Those are indeed quite a few base pointer special registers. I would be a bit sad to see the cpReg and cppReg go though. I can think of at least one use case where they would allow for elegant implementation of Haskell thunks/closures, where the closure memory layout/parameter description meta-data and the code thunk should be accessible via the same offset/pointer but must of course reside in different access permissions and thus need different bases.
No idea if those "levels of display" would cover this, because I have no idea what that means in this context here.

ivan

A "display" is a mechanism to implement lexically nested functions.

Are you thinking about closures created by a JIT, where the code could be anywhere and the code and data may be moved by the GC? Or compiled closures like lambdas in C++?

imbecile

> Are you thinking about closures created by a JIT, where the code could be anywhere and the code and data may be moved by the GC? Or compiled closures like lambdas in C++?

I mean compiled closures.

But since we're getting more specific now I went and looked up this old Peyton Jones paper again, which you are most likely familiar with.

In chapter 7 he describes how haskell core language closures could/should be implemented in stock hardware of the time. It's very possible that most of that information very much obsolete today, but as you have said so nicely yourself, the architectures of stock hardware haven't changed/improved much since then.

In particular the optimization in 7.6 cannot really be done when constant data and code are in different memory protection turfs. In that case it would be beneficial to have implicit bases handy for both cases, so that one offset works for both, although the info table itself is far less referenced than the actual code.

ivan

I read through the Peyton Jones paper; not sure how long ago I looked at if ever, though the Miranda paper I did read.

Some general comments:
1) The Mill design has generally been concerned with commercially important languages, and we haven't put much thought into support for esoterica, which Haskell still is :-) However, we want to provide such support, but don't really know enough about the issues and details at this point.

2) The scheme of using global state over function calls as a way to do tail calls doesn't work on a Mill; call has much more complex semantics than just changing the code pointer. However, the Mill supports label variables and indirect branches, which should provide a satisfactory substitute.

3) The data structure in which there is a pointer to a closure which contains a pointer to an info record which contains a pointer to the code would be very painful on any modern machine, Mill included. This implies that a closure should contain both the code and the arguments so that it can simply be entered with a single indirection, which is free if the closure pointer is already on the belt. However, there are issues with mixed code and data, because the Mill uses a Harvard architecture with separate i$ and d$. There are also issues with protection; the PLB has byte granularity, but you *really* don't want a PLB entry for every closure! As a naive alternative, the closure could contain the bindings and an info pointer and also a code pointer, eliminating one but not both indirections. I have to think about what the predictor consequences of that would be.

4) One possibility is to use the function pointer representation of lexically nested functions as the representation of a closure pointer. That is, a closure pointer is the pair {code pointer, binding pointer}, as is used e.g. for Pascal and other languages with lexical nesting. We expect this form to be used for lambdas in C++. The drawback is that function pointers become 128 bits, which is OK on members with quad but not so good on smaller members.

5) An alternative is to generate a thunk for each lambda or statically nested function, where the thunk loads a static pointer to the binding (or outer frame) and then jumps to the actual code. This approach is used by gcc for lambdas I think.

6) both of the above have issues with where to put the data when there's no GC; not relevant for Haskell and friends, but very relevant for C++.

7) The natural place for the Node register is inpReg. There are interesting issues when the closure is in a different protection domain ("turf") than the caller, which inpReg helps with (sorry, NYF). In particular, the caller must not be able to muck with the closure data before or during the call. This again suggests a thunk-like approach, but again the PLB can't tolerate one thunk per closure-instance and even one per bind-clause may have performance issues.

I'm probably showing my ignorance here :-( We will want to revisit this when the full machine is public.

imbecile

> The Mill design has generally been concerned with commercially important languages, and we haven’t put much thought into support for esoterica, which Haskell still is

Fully understandable. But I also suspect that the frame contexts of which there isn't too much revealed yet, provide everything you need.

> However, the Mill supports label variables and indirect branches, which should provide a satisfactory substitute.

I also expect the pick instruction to be very helpful there.

> The data structure in which there is a pointer to a closure which contains a pointer to an info record which contains a pointer to the code would be very painful on any modern machine

Yes, a lot has changed there. Most common uses of the info structure have been optimized away, in particular via tagging the pointers. And that is something the Mill supports better than any current hardware.

> That is, a closure pointer is the pair

You most likely wouldn't even need two full 64bit pointers, just two smaller offsets and use the vector instructions to split them. At least in languages with a runtime.

And I'm no expert by far either, so it should be given a rest here.

imbecile

The TLB issues didn't quite sit right with me, which usually is a sign I'm wrong or misunderstood something.

What I obviously was wrong about, and which became clear upon rewatching this memory talk, is that the caches are all virtual addresses, and the overhead of going through the TLB is only done when that overhead is hidden by real DRAM access on a cache miss.

That means you could still have per process address spaces and fix all the bases for direct memory access offsets and still get almost all address indirections for free through the virtual memory in the image created by the specializer. And even better than I thought before, since it only happens on cache loads and writebacks, not on every memory access.

And there are more advantages:

You could get rid of most of those per frame special base registers and replace them with one address space selector/process ID register (or integrate this into the frame ID). You would still need to encode the needed base in the load instructions though as if there were registers, so on the encoding side there is no real difference.

Those hard coded bases/memory regions at the same time serve as the permissions, integrating them into the virtual addresses, making the separate protection tables redundant. Except that when those base selectors serve also as the permission, the encoding for stores and loads becomes smaller. For that to work you would have to make some cache invalidation on context switches though or have the processIDs as part of the key for cache lookup in some way.

Having the permissions in the virtual address itself also enables the specialization of caches and cache behavior. You know certain data is constant or write only, because of its virtual address, allowing you to adjust cache strategies or even to have separate (and smaller and faster) caches for those.

Having fixed virtual bases per process could make NUMA memory architectures (which very likely will only become more wide spread in 64 computing) visible to the processes and compilers themselves, adjusting their algorithms and load strategies and memory layouts and even process and thread scheduling accordingly.

Very big gains from fixed offset bases are to be had in code, too. In this area this would mean a pretty big change in the architecture though. On the other hand, in a way it is bringing some of the core ideas of the Mill to their full potential:

Having several fixed code pointer bases enables you to have even more instruction streams trivially. In the encoding white paper more than two instruction streams are said to be possible via interleaving cache lines, which has some problems. Well, a fixed base address for each instruction stream doesn't have those problems and it gives you all those advantages of more streams mentioned like more smaller and faster dedicated instruction caches, smaller instruction encoding, specialized decoders an order of magnitude more efficient with each addtional one etc. Although you probably will get diminishing returns when you go beyond 4 streams (likely subdivisions of functionality would be control flow, compare/data route, load/store, arithmetic).

There are some caveats with per process address spaces with fixed memory regions. The first process that sets up the virtual memory must be treated specially with a carefully laid out physical memory image. But considering how carefully laid out and crufted the lower memory regions on a pc are with IO addresses, bios code and tables etc. that isn't anything that isn't done routinely.
Also sharing addresses between processes is harder and needs to be arbitrated in some way. And again this is something that is done in most current virtual memory systems. And it is a relatively rare operation, especially in comparison to normal intra-process cross-module function calls that would be much cheaper now.
Another issue could be that actual executable code size is limited per process. But that is just a question of configuring the chip. Even in the biggest applications today the combined text segments of all modules don't go beyond a few hundred MB. Actually I have never seen one one where actual code size goes beyond a few dozen MB. In essence a lower per process code segment of a few GB is far less limiting than the actual maximum installed memory usable by all processes. Especially in an architecture that is geared towards many cores, few context switches and compact code size.
It could also be beneficial to encode the streams that are actually used after a jump into the control instructions. That spares unncesessary memory accesses. Although in practice at least the control flow stream and the compare/data route stream will always be used.

This post is going quite off topic I guess.
I also admit that I have another rather self centered and whimsical reason for wanting an own full address space for each process to use: I'm playing around with some language design, and letting each process design their own memory layout allows it to make a lot of safe assumptions about addresses, which in turn enable or make cheaper/free a lot of language features I'm interested in.

PeterH

If I'm figuring this right, specReg based addresses can be made fork() safe. Data addresses getting their base from the belt, on the other hand, are not so easily rebased. I'm seeing heap memory using the latter in many cases.

Will_Edwards

Can an EBB see the stack of its caller?

Can you explain more how inpReg is used?

And, finally, cheap thread-local-storage? Excellent news! Although I imagine many languages may repurpose it, if you let them use it as just another data segment.

ivan

Sorry, the whole process model is very different, but it's all NYF. Talk#11 maybe?

But yes, cheap TLS :-) And micro-threads.