Memory

ivan

Sorry, the protection model is NYF. Yes, we do fork(). Yes, it's a kludge. About as much a kludge as fork() itself is :-(

ivan

You are well on the way to inventing capabilities :-) https://en.wikipedia.org/wiki/Capability-based_addressing

I truly wish that the Mill were a cap(ability) architecture. One of the Mill team is Norm Hardy, who did the KeyKos cap operating system, of which Eros and others are descendants. The first compiler I ever did was for the Burroughs B6500, which was a cap machine before caps were invented. One of the things that Andy Glew and I are in passionate agreement about is that computing needs caps.

That said: I know how to build a cap machine, but I don't know how to sell one. Caps are fundamentally incompatible with C, because caps are secure and C is not. Caps machines like the AS-400 let you use C by giving you a sandbox that is one whole cap in which you can be as insecure as you want within your play-PDP11. Oh well.

And fair warning: language design is an insidious vice - you have a slippery slope ahead. Hear the words of someone whose name is in the Algol68 Revised Report :-)

imbecile

I read papers and documentation on KeyKos and Coyotos. Although that was a few years ago.
And separate address spaces offer most of the advantages of capabilites without being C-incompatible. The traditional problem for separate address spaces is expensive context switches. But on multicore 64 bit processors context switches can be vastly reduced, and the Mill goes into that direction anyway. And with the cache architecture of the Mill and below cache TLBs context switches can become a lot cheaper too even with separate address spaces.
And as you said yourself, it's better to leave the OS out of as much things as possible and let the hardware take care of things, and capabilities must be OS constructs and cannot be hardware data types like virtual addresses can. Or am I wrong here?

And yes, language design is a terrible vice. Ever since I started programming I was unhappy and annoyed and frustrated with whatever I was using. And whenever you try to find or to think of ways to do things better, what you find and learn usually only reveals new annoyances that make you quickly forget about the old ones you have solved.

jabowery

Using real estate for more cores in preference to threading, resulting from the the Mill's other architectural features, brings to mind a question about on-chip memory architecture that, while of no immediate consequence to the Mill chip, might affect future trade offs in real estate use.

With 14nm and higher density technologies coming on line, there is a point where it makes sense prefer on-chip shared memory to some other uses of real estate. This raises the problem of increasing latency to on-chip memory, not only with size of the memory but with the number of cores sharing it. In particular, it seems that with an increasing number of cores, a critical problem is reducing arbitration time among cores for shared, interleaved, on-chip memory banks. In this case, interleaving isn't to allow a given core to issue a large number of memory accesses in rapid succession despite high latency; it is to service a large number of cores -- all with low latency.

Toward that end I came up with a circuit that does the arbitration in analog which, if it works when scaled down to 14nm and GHz frequencies, might result in a architectural preference for a on-chip cross-bar switch between interleaved low-latency memory banks and a relatively large numbers of cores.

This spans disciplines -- a problem well known to folks involved with the Mill architecture which spans disciplines between software and computer architecture (rather than between computer architecture and analog electronics).

I'd appreciate any feedback from folks who understand the difficulty of cross-disciplinary technology and have a better grasp the issues than do I.

akahlich

From following the link you give, I see the general outline of the idea. The challenge of implementing such a device will be to make it deterministic for larger numbers of cores and banks, which is where it would potentially have the greatest benefit. I did not see sufficient detail to make any determination regarding that aspect of the idea's feasibility.

That said, the problem being attacked is determining which core gets access to which bank. There are several sources of latency here: the arbitration mechanism latency, the "bank is busy" latency, routing latency and RAM access latency. This idea only directly addresses the arbitration mechanism latency. By allowing a larger number of banks, it appears to indirectly help the "bank is busy" or "bank access collision" latency. Unfortunately, a larger number of banks or cores will also increase routing latency. So in the end, routing latency may merely replace arbitration latency as being the performance limiting factor.

jabowery

Thank you for pointing out the challenge of routing latency. Rather than further digression from the Mill memory architecture here, can you suggest a proper forum for discussing this?

akahlich

Unfortunately, all of the places I have seen on line papers about multi-core interconnect have been on various university sites. I don't see any of them with a proper forum for discussion.

ivan

Have you tried the comp.arch newsgroup?

harrison partch

Is a cap machine compatible with Oberon?

ivan

Yes, as far as I know. However, there are lots of Oberons, and the checking details are messy and not well documented, so it's not certain.

Symmetry

So, the Mill has memory acccess that are the direct result of the execution of instructions but also implicit loads and stores that happen as a result of function calls as the belt and and outstanding memory accesses overspill their buffers and are stored in the general memory heirarchy. I think this would be infeasible on an architecture where any load or store might generate a fault immediately, but on the Mill it'll just make it's way through the memory heirarchy the same way anything else would and most interesting things will happen at the TLB stage in the same manner and with the same level of decoupling as any other write or read.

How do you handle memory protection and synchonization with implicit operations? Just the same as with normal operaitons, possibly with a duplicated PLB and with the addresses broadcast to the retire stations?

ivan

There are essentially three layers in the spiller. At the top layer, data is stored in flipflops, effectively registers except not addressable by the program. These are buffers, used as parking places for values that are still live on the belt and are in the result register of a functional unit but the FU has another result coming out and needs its result register back. These live operands are first daisy-chained up the latency line within the containing FU pipeline, but eventually the pipeline runs out and the operand, if still live, gets moved to a spiller buffer. This cost of the move is the same as a register-to-register copy on a conventional machine. A running program that doesn't do a function call or return will live entirely in these registers, with no spiller traffic below that.

The second layer is a block of SRAM, connected to those spiller registers. When you do a call operation, we save the belt, which means the belt state in pipe and spiller registers is saved. The save is lazy, but gradually live-but-inactive operands are copied from the needed registers into the spiller SRAM. The SRAM is big enough to hold several belts (and scratchpads and other non-memory frame-related state), so you can nest several deep in calls with everything fitting in the spiller SRAM. Most programs spend most of their time withing a frame working set of five or so, calling in a few, returning out a few, calling in a few, over and over. Such behavior fits entirely internal to the spiller.

However, if a program suddenly switches to a deep run of calls, a run of returns, or if there's a task switch, the state of all those nested calls (or the new thread) will exceed the spiller's SRAM and the spiller then uses the third level, which is the regular memory hierarchy. The spiller does not go direct to DRAM; it talks to the L2 cache instead, which provides still more buffering.

If you compare the spiller with the explicit state save used by a conventional legacy register machine, you will see that the spiller top level is akin to the register machine's rename and architected registers; if a function fits in the registers then there's no traffic, just as if it fits in the belt and scratchpad there's no traffic.

If there are nested calls on a conventional then the register state gets saved to the hierarchy using normal store operations. These go to the D$1 cache and are buffered there. This is akin to the spiller's SRAM, but the spiller has three advantages: it uses no program code to do the save, so the power, instruction entropy, and store buffer contention cost of the stores is avoided; it uses a private repository, so saves are not cluttering the D$1 (which is a very scarce resource); and it's not program visible, removing many possibilities of program bugs and exploits.

Lastly, if you have deeply nested calls on a conventional, the saved state exceeds the capacity of the D$1 and will overflow into the D$2 and eventually DRAM. The spiller does the same when it runs out of private SRAM. Put all this together, and you see that the spiller is in effect a bunch of registers hooked to its own cache, and the overall benefit is to shift state save/restore out of the top level D$ cache and into the spiller SRAM which is in effect a private spiller cache, freeing up space for real data.

One last point: the total Mill state traffic is less than that on a conventional. A conventional callee-save protocol winds up saving registers that are in fact dead, but the callee doesn't know that and saves them anyway. And the existence of the scratchpad on a Mill means that many function locals that would be kept in memory are in the scratchpad and so do not contribute to cache load and memory bandwidth. Combine these effects, and our sims suggest that the Mill save/restore and locals traffic is about a factor of two less than that of a conventional. This saves not only bandwidth but also power.

We do not have large-scale sims yet so the overall results are guesstimates, but it does appear that actual DRAM traffic on a Mill will be overwhelmingly composed of I/O and very large external data sets, which have the same traffic load as on a conventional; save/restore, locals, and the working set of globals will never see DRAM. That's why we the Mill has the Virtual Zero that lets it use "memory" that has no backing DRAM.

All this needs pictures, and we will get to the spiller in the talks eventually.

ivan

p.s. Forgot to respond to your protection question:

The spiller has its own region of the address space. Spiller space is used for all processes, and the save state of the different processes are not in the process's address space and so cannot be directly addressed by the program. In effect you can think of the spiller having its own private PLB, although it need not (and so far is not) implemented that way.

Utilities like debuggers and stack unwinders get access to saved state through an API that runs with PLB rights to spiller space; in effect the API is another process. The API is restricted; you cannot arbitrarily change the downstack links for example. As a result, the usual stack-smash exploit is impossible on the Mill.

Because the application cannot address spiller space, there is no synchronization needed between app use of memory and spiller use of memory; they are necessarily disjoint.

LarryP

I'm curious:
How does/will the Mill implement the semantics of C's volatile qualifier?
From what I've gathered from watching the talks, the Mill always loads data from the cache hierarchy, not directly from main memory. However, the data in volatile variables (for example memory-mapped peripheral registers), can change independently of CPU activity, so cached values of volatile variables may well not be correct.

Is there a mechanism, such as designating certain memory regions as non-cacheable, to ensure that the Mill implements the semantics of volatile variables? A forum search turned up a single forum entry that contains the word volatile. In Mr. Godard's reply #733 in topic "Side channel timing attacks" he writes:

One of those flags is “volatileData”, which forces the request to bypass private caches and go to the top shared cache.

However, going to the top shared cache (instead of to main memory itself) doesn't appear to implement the semantics of C-language volatile variables. Since the Mill is targeted at existing code bases, which include C and C++, I assume that there is (or will be) a way to preserve the semantics of volatile. If it can be revealed at this point, I'd very much like a confirmation that the Mill will preserve the semantics of the volatile qualifier -- and, if possible, how the Mill handles volatile.

Thanks in advance for any reply.

ivan

Being hardware, the Mill is not language-specific, but must support all common languages, common assembler idioms, and our best guess as to where those will evolve.
The best that we, or any design team, can do is to try to isolate primitive notions from the mass of semantic bundles and provide clean support for those primitives, letting each language compose its own semantic atoms from the primitives we supply.

The "volatile" notion from C is used for two purposes, with contradictory hardware requirements for any machine other than the native PDP-11 that is C. One purpose is to access active memory such as MMIO registers, for which the idempotency of usual memory access is violated and cache must not be used, nor can accesses be elided because of value reuse.

The other common purpose is to ensure that the actions of an asynchronous mutator (such as another core) are visible to the program. Here too it is important that accesses are not elided, but there is no need for an access to go to physical memory; they must only go to the level at which potentially concurrent accesses are visible. Where that level is located depends on whether there are any asynchronous mutators (there may be only one core, so checking for changes would be pointless) and the cache coherency structure (if any) among the mutators.

Currently the Mill distinguishes these two cases using different behavior flags in the load and store operations. One of those is the volatileData flag mentioned in the previous post. This flag ensures that the access is to the top coherent level, and bypasses any caching above that level. On a monocore all caches are coherent by definition, so on a monocore this flag is a no-op. It is also safe for the compiler to optimize out redundant access, because the data cannot change underneath the single core.

On a coherent multicore the flag is also a no-op to the hardware but is not a no-op to the compiler: the compiler must still encode all access operations, and must not elide any under the assumption that the value has not changed, because it might have.

On an incoherent multicore (common in embedded work), a volatileData access must proceed (without caching) to the highest shared level, which may be DRAM or may be a cache that is shared by all cores. Again, the compiler must not elide accesses.

For the other usage of C volatile keyword, namely MMIO, the access is to an active object and must reach whatever level is the residence of such objects. On the Mill that level is the aperture level, below all caches and immediately above the actual memory devices, and the flag noCache ensures that an access will reach that level. Again, the compiler must not elide noCache accesses.

Besides the noCache flag, the Mill also maps all of the physical address space to a fixed position (address zero) within the (much larger) shared virtual space. An access within that address range is still cached (unless it is noCache) but bypasses the TLB and its virtual address translation mechanism. NoCache addresses outside the physAddr region (and volatileData accesses if they get past the caches) are translated normally.

There are other access control flags, but that's enough for now :-)

orenbenkiki

Is there any plan for the Mill to support some kind of transactional memory, and in general how does it address atomic read-modify-write operations? Perhaps this will be addressed in the future multi-core talk?

ivan

There are no RMW ops. The Mill uses optimistic concurrency using the top level cache for the write set, much like the IBM and Intel optimistic facility. It’s been mentioned somewhere, but I don’t remember if it was in the talks or here, or maybe comp.arch.

orenbenkiki

Using the L1 as the write set makes sense as the normal coherency protocol informs you of conflicts. But, how is this exposed to the SW? AFAIR, OCC needs Begin/Modify/Validate/Rollback operations. Modify is easy – just normal stores. But there seem to be no way around some form of HW support for Validate, which means some sort of explicit Begin. It also seems very hard for SW to safely do a Rollback, preventing another thread from seeing the middle (muddled) state of a transaction. So does the Mill have something similar to Intel’s RTM (explicit begin/end transaction instructions), handling the Begin/Validate/Rollback operations?

ivan

There are begin, abort, and commit ops. The only unusual parts of the Mill implementation is that you can have both transactional and non-transactional memory references while in the sequence, and how we handle failure recovery (which is NYF).

Roger

Does "the Mill also maps all of the physical address space to a fixed position (address zero) within the (much larger) shared virtual space" not mean you have a virtual alias for the whole physical address space?

« Previous Page Next Page »