Memory

Will_Edwards

Yes, it means exactly that. This part of the address space is still protected by the normal Mill memory protection mechanism so an operating system can ensure that nobody unwittingly has an actual alias though.

ralphbecket

In this article, under the section on Memory, Dan Luu expresses in passing some skepticism regarding the Mill's memory model. I'm not savvy enough to follow his reasoning and was hoping someone in the know might be able to comment?

LarryP

The linked article mentions the Mill CPU architecture just once, in a sentence so vague that I cannot honestly tell what the author believes to be true about the Mill:

BTW, this is a major reason I’m skeptical of the Mill architecture. Putting aside arguments about whether or not they’ll live up to their performance claims and that every chip startup I can think of failed to hit their power/performance targets, being technically excellent isn’t, in and of itself, a business model.

The first word of the above quote is such a vague reference (the previous paragraphs were about memory barriers and what is/isn't guaranteed on multiprocessor systems), that I cannot tell what the author is trying to say about the Mill. Has anyone else read the linked article, and better understood what the author was trying to convey?

Will_Edwards

I missed the Mill reference when I read the article the other day.

The paragraph before the one you quote says:

This is long enough without my talking about other architectures so I won’t go into detail, but if you’re wondering why anyone would create a spec that allows that kind of crazy behavior, consider that before rising fab costs crushed DEC, their chips were so fast that they could run industry standard x86 benchmarks of real workloads in emulation faster than x86 chips could run the same benchmarks natively.

He then says, as you quoted:

BTW, this is a major reason I’m skeptical of the Mill architecture. Putting aside arguments about whether or not they’ll live up to their performance claims and that every chip startup I can think of failed to hit their power/performance targets, being technically excellent isn’t, in and of itself, a business model.

So I think he's saying that being technically excellent isn't going to sell chips. He says it didn't sell Alpha chips?

Like you, I am a little unsure of my interpretation :)

Will_Edwards

Its possibly ambiguous, but I don't think he was being skeptical of the Mill memory model. I think he was being skeptical of there being any business for non-x86 chips even if they are better? That was my reading of that part, anyway.

When I read the article the other day (proggit discussion), I cherry picked some of the technical problems he raised and summarised what the Mill was doing in each area:

TLBs and TLB misses: translation of addresses is after the cache hierarchy; the cache is in virtual addresses and the memory is shared so there's no translation needed in IPC and context switches.

locks: we're 100% hardware transactional memory (HTM) on which you can build conventional locks; but you can jump ahead and write your code to use HTM directly for a big performance gain

Syscalls and context-switches: there aren't any context-switches; we're Single Address Space (SAS). Syscalls are much much faster, and you aren't restricted to a ring architecture (hello efficient microkernels and sandboxing!)

SIMD: the Mill is MIMD. Much more of your code is auto-vectorizable and auto-pipelinable on the Mill, and your compiler will be able to find this parallelism

Branches: we can do much more load hoisting and speculation. We're not Out-of-Order (OoO), so its swings and roundabouts and we often come out ahead

Free forward with any technical questions related to the article :)

ivan

I read it to say that he recognized technical excellence but had doubts about business viability anyway. Unclear though.

PeterH

Regarding context switches, how much of the permissions buffer is cached? Given how it operates in parallel with L1 cache, I'm thinking the permissions cache would be about the same size. What I'm unclear on is how much of the total that represents and how often a portal call will need to load new permission table elements. Then again, this could be handled along with the call prefetch.

Regarding SIMD, I recently read a report on some benchmarking with a test case resembling

int a,b,c,d;
...
loop_many_times
{
a++; b++; d++;
}

that ran slower than a test case applying to all 4 vars. The case incrementing all 4 could use x86 family SIMD. Applied to the mill I can see this case being implemented by loading from all of a through d then applying a null mask on the var not being altered.

Will_Edwards

Regards PLB size:

Consider the size of a high-end conventional L1 TLB; it might contain 64 4K page entries, 32 2MB page entries and 4 1GB pages.

The conventional L1 TLB has to do the address translation before the load from L1 cache itself; the translation and lookup are serial.

This is why the L1 TLB is forced to be small to be fast and hasn't been growing in recent high-end OoO superscaler microarchitectures. They have actually been adding L2 TLB and so on because of this problem.

A recent article on conventional CPUs actually counts TLB evictions for various real syscalls:

Some of these syscalls cause 40+ TLB evictions! For a chip with a 64-entry d-TLB, that nearly wipes out the TLB. The cache evictions aren’t free, either.

Now consider the situation for the Mill PLB: the entries are arbitrary ranges (rather than some page count), and it has as many cycles as the actual L1 lookup to do its protection check... it can be large and slow as its work is in parallel to the lookup.

Now this really emphasises the real and practical advantages of a virtual cache and Single Address Space architecture :)

On the second question about SIMD: exactly! :)

Excess slots in a vector can be filled with None, 0, 1, 1.0 or whatever value nullifies those elements for the operations to be performed.

ivan

Let me add to Will's response: the great majority of accesses don't go to the PLB in the first place, but are checked by a Well Known Region. WKRs have their own hardware access control; there is one for stack, one for the code, and one for all the global space, including .data, .bss, and the heap up to the point where you overflow the huge allocation you get at startup.

As a practical matter, PLB is used for regions that were mmap(MAP_SHARED), for stack segments other than the top segment, and for portal vectors. Important, yes, especially the MAP_SHARED when it is used for direct access to file buffers to avoid copy overhead (a Mill feature), but not a large fraction of all the references a program makes.

chrispitude

I'm a digital logic designer but not a processor designer. Possibly naïve questions follow.

1. At 44:35 in the video: when the retire stations monitor stores and see a match, why do they rerequest the load? Can't they grab that outbound data instead of requesting what just flew by?

2. At 1:13:00: more of a general cache eviction question. Instead of waiting until an eviction is forced, can some *small* number of LRU lines be proactively pushed downward BUT still kept in cache, such that if a cache line is needed, one of those LRU lines would be instantaneously available? This would likely require an extra bit per line to indicate that it's mirrored at the next level away.

ivan

#1: they could, and perhaps some members might be implemented that way. However the store might be only partly overlapping the load, so the logic to do the grab might have to do a shift and merge, which is non-trivial hardware and there are a lot of retire stations. The L1 already has the shift-and-merge logic (it must, to put stored data into the right place in the cache line), and aliasing interference is rare, so it is cheaper to let the store go to the L1 and reissue the load from the station.

Note that the first try for the load will have caused the loaded line to be hoisted in the cache hierarchy, so the retry will find the data (both the newly stored part and any that had not been overwritten) higher - and faster to get - than it did on the first try.

#2: Cache policies are full of knobs, levers and buttons in all architectures, and Mill is no exception. It is quite likely that both the complement of policy controls and their settings will vary among Mill family members. What you suggest is one possible such policy. The current very-alpha-grade policies in the sim try to preemptively push updates down, except if the line has never been read; this distinction is an attempt the avoid pushing partial lines that are output-only, to avoid the power cost of the repeated partial pushes. This and other policies are certain to change as we get more code through and start to move off sim so we get real power numbers. Fortunately, none of this is visible in the machine model so apps can ignore it all.

davidm

A couple of questions--sorry if they're basic:

1. You mention that when the return operation cuts back a stack that it clears the valid bits on the stack frame's cache lines. Does the clearing of the valid bits have to cascade to all levels of cache?

2. Unless I'm mistaken, the TLB is a cache of PTEs and might not contain all the PTEs in the system (i.e. it's a cache over operation system tables, right?). You mention in the talk that that the during a load-miss that gets to the TLB that also misses in the TLB the TLB directly returns a zero, without having to go to main memory. Wouldn't the TLB have to go to main memory for PTEs, even if it doesn't have to go to main memory for the actual value to be returned, at least some of the time? Are you using a data structure that makes this unlikely (i.e. you can answer "not found" queries without having access to the whole set of PTEs in the TLB) or is it just the fact that you have a large TLB and the "well known region" registers cover a lot of what would otherwise be PTEs and that makes it likely that all PTEs are in the TLB?

Thanks for the answers. I'm a software guy and not a hardware guy, so I'm sorry if the questions betray a lack of understanding.

ivan

Sound questions.

1) Frame exit invalidation is done in the Well Known Region for the stack. It is not necessary to clear the valid bits, because the exited frame is no longer in the user's address space; return automatically cuts back the stack permission WKR. However, there is an issue with inter-core references to foreign stacks. Core B has no WKR for core A's stack, so in normal mode threads cannot reference each others stacks and threads cannot reference their own exited frames. However, there is a legacy mode in which inter-stack references are permitted. In legacy mode the entire stack, both used and unused, has a PLB entry for the turf. So if there are two legacy threads running in the same turf then they can browse each others stack, and can browse the rubble in their own stack. We have never found a way to restrict interstack references to only the live portion; all practical implementations seem to require exposing rubble too.

2) The implementation of the TLB is per-member; it is transparent to the programmer. As currently in the sim, a miss in the TLB requires a search of the PTE tables. However, unallocated regions of the 60-bit address space do not need PTEs and do not occupy physical memory. If the search leads to an address where a PTE might be but is not, then the Implicit Zero logic will return a synthetic line of zeroes. The PTE format is carefully arranged to that a zero PTE means that the region is unallocated in DRAM - and the load in turn returns implicit zero. Thus the page table is in virtual memory, not physical, and there are only entries for mapped space.

Thus the miss will trigger a search, which may miss, which triggers a search... The Tiny Alice algorithm we use bounds the recursion to the number of distinct range sizes (current 5), but in practice the intermediate entries are in the LLC and search depth is nearly always <= 2.

infogulch

I have a question about hoisted deferred loads and function boundaries.

Consider a relatively large function that requires multiple loads. I would expect at the start of the function to be a flurry of address calculations and deferred loads, as much as possible before getting on with the rest of its' functionality in order to hide cache/dram latency as much as possible. I might even call it a 'deferred load preamble', not officially, but I could see it being a common enough pattern to recognize it.

So my first question: Does this scenario sound reasonable? Would you expect it to be that common?

Now lets extend it. Break up the function into three smaller functions. Lets assume it's very simple and you can just group instructions together into their own functions, with outputs flowing to inputs etc. So instead of one big section at the beginning where all the loads are issued, each smaller function has its own 'deferred load preamble'. This would mean that e.g. the last of the three was not able to defer its loads as far and may suffer more from memory latency issues.

Does this also sound reasonable? Is it just the compiler's (|| specializer's) responsibility to inline functions and hoist loads as much as possible or does mill hardware offer any mitigation to this issue? It's not OOO, so I wouldn't really expect it to "peek ahead" to see those loads, but then again the mill's durability to speculation would really help such an implementation.

Thoughts?

ivan

Short functions with loads tend to have extra noops or stalls for the reason you give. A sufficiently smart compiler could issue prefetches in the caller, just as can be done for any architecture. But the general Mill approach is inlining.

Most ISAs inline to avoid call overhead, but calls are cheap on a Mill. We inline to get more for the machine width to work on. On a wider Mill you can inline surprisingly large functions with zero latency cost in the expanded caller, because the called ops fit in the schedule holes of the caller. Of course there is usually some increase in the code space used, but removing the call and return means less thrashing in the cache lines, so net it's a win unless profiling shows the call is never taken.

We're still tuning the inlining heuristics in the tool chain. As a rule of thumb, any function shorter than a cache line is worth inlining on icache grounds alone.

NXTangl

SAS means there are no homonyms on the Mill. Fair enough. However, what about address synonyms? Memory with copy-on-write semantics is obviously not an issue, so this has no effect on fork(), but I seem to recall someone mentioning that all of physical RAM is mapped at startup, and there are definitely a few neat tricks involving doubly-mapped pages (such as fast circular buffers and heap compaction without pointer changes).

« Previous Page