Mill Computing, Inc. Forums The Mill Architecture Memory Reply To: Memory

Ivan Godard
Keymaster
Post count: 689

Well, some answers are not NYF 🙂 Though there wasn’t a clear deductive line that led to SAS; it evolved by fits and starts like everything else.

I’m somewhat mystified when you say “Even when those instructions themselves are very effecient, they require separate loads and the use of precious belt slots.” True label- and function-pointers do exist on a Mill, and they must be loaded and then branched or called through, just like for any other machine, but the code would be no different with private address spaces. The great majority of transfers are direct and don’t use pointers, and those have neither load nor belt positions. You can branch to a label or call a function in a single operation with no belt change. Color me confused.

And yes, the predictor machinery does do prefetching. The phasing mechanism used to embed calls in an EBB is NFY, but will be covered in the Execution talk 2/5 at Stanford.
We assumed from the beginning that any architecture needed 64-bit program spaces; a 32-bit wall is just too constraining. We never really considered supporting both 32- and 64-bit program spaces; apps are written for both on x86 solely because of the massive install base, which we don’t have. We were afraid that either 32- or 64-bit mode would become dominant by accident and the other would die due to network effects, and our double work would be wasted. So pointers had to be 64-bits, less a few we could steal if we wanted to (and it turned out we did).

Given a 64-bit address space, static linking is right out: 64-bit offsets would completely hose the encoding and icache. So what gets statically linked in such systems, and what could replace it? Two answers: code, and bss (static data). Turns out that nobody has 4GB worth of code, and all big data is held in dynamic memory (malloc, with mmap behind it), not in static, so global static is under 4GB too. Sure, you can have a program that statically declares a 100GB array – but look at the code the compiler gives you for that – you’ll see something else going on behind the scenes, if the compiler doesn’t err-out right away.

So both code and static only need 32-bit offsets, off some address base. That takes care of the encoding issues, but also obviates static linking – there’s no advantage to fixing the base, because the instructions are now position-independent because they carry offsets, not addresses. Sure, you need an address-adder, but you needed that anyway to support indexing, unless you are a RISC Puritan and are willing to do individual shift and add operations, and pay for your purity in icache and issue slots. The Mill has quite conventional address modes: base, index and offset, requiring a specialized three-input adder. No big deal.

So now we have position-independent code (PIC), and 32-bit code and static data spaces within 64-bit overall space. Are all those 64-bits going to be used by any application? Within my children’s lifetime? Only on massive multi-processor supercomputers with shared memory. However, it’s increasingly obvious that shared memory at the building scale is a bad idea, and message passing is the way to go when off-chip. At a guess, 48 bits of space is enough for any app that we want to support in the market. And how many apps will there be? Or rather, how many distinct protection environments (called a turf in Mill terminology) will there be. Or rather, how much address space in total will be used by all turfs concurrently (after all, many turfs are small) on one chip? Surely much less than 64 bits.

So a SAS is possible without running out of bits or needing more hardware. What are the advantages? The obvious one is getting the TLB out of 90+ % of memory access. TLBs are horribly expensive on a conventional. They have their own cache hierarchy to hide some of the miss costs (which still run 20% or more cycles), and to be fast are a huge part of the power budget. All that goes away with SAS. Yes, SAS still has protection in front of the cache, but that is vastly cheaper than a full-blown TLB (NYF; there will be a Protection talk). The virtual address translation does not need to happen anyway, and is very far from free 🙂

Then there are software advantages: with SAS, processes can cheaply share data at any granularity and in any location; they are not restricted to page-sized shared mmap regions that require expensive OS calls to set up. Getting the OS out of the act permits a micro-thread programming model that is essential for painless parallel programming. The OS is simpler too – it doesn’t have to track whose address space a pointer refers to.

Now all this is an intuitive argument; we certainly don’t have the experience with large programs and real OSs to measure the real costs and benefits. But we were persuaded. 🙂