The TLB issues didn’t quite sit right with me, which usually is a sign I’m wrong or misunderstood something.
What I obviously was wrong about, and which became clear upon rewatching this memory talk, is that the caches are all virtual addresses, and the overhead of going through the TLB is only done when that overhead is hidden by real DRAM access on a cache miss.
That means you could still have per process address spaces and fix all the bases for direct memory access offsets and still get almost all address indirections for free through the virtual memory in the image created by the specializer. And even better than I thought before, since it only happens on cache loads and writebacks, not on every memory access.
And there are more advantages:
You could get rid of most of those per frame special base registers and replace them with one address space selector/process ID register (or integrate this into the frame ID). You would still need to encode the needed base in the load instructions though as if there were registers, so on the encoding side there is no real difference.
Those hard coded bases/memory regions at the same time serve as the permissions, integrating them into the virtual addresses, making the separate protection tables redundant. Except that when those base selectors serve also as the permission, the encoding for stores and loads becomes smaller. For that to work you would have to make some cache invalidation on context switches though or have the processIDs as part of the key for cache lookup in some way.
Having the permissions in the virtual address itself also enables the specialization of caches and cache behavior. You know certain data is constant or write only, because of its virtual address, allowing you to adjust cache strategies or even to have separate (and smaller and faster) caches for those.
Having fixed virtual bases per process could make NUMA memory architectures (which very likely will only become more wide spread in 64 computing) visible to the processes and compilers themselves, adjusting their algorithms and load strategies and memory layouts and even process and thread scheduling accordingly.
Very big gains from fixed offset bases are to be had in code, too. In this area this would mean a pretty big change in the architecture though. On the other hand, in a way it is bringing some of the core ideas of the Mill to their full potential:
Having several fixed code pointer bases enables you to have even more instruction streams trivially. In the encoding white paper more than two instruction streams are said to be possible via interleaving cache lines, which has some problems. Well, a fixed base address for each instruction stream doesn’t have those problems and it gives you all those advantages of more streams mentioned like more smaller and faster dedicated instruction caches, smaller instruction encoding, specialized decoders an order of magnitude more efficient with each addtional one etc. Although you probably will get diminishing returns when you go beyond 4 streams (likely subdivisions of functionality would be control flow, compare/data route, load/store, arithmetic).
There are some caveats with per process address spaces with fixed memory regions. The first process that sets up the virtual memory must be treated specially with a carefully laid out physical memory image. But considering how carefully laid out and crufted the lower memory regions on a pc are with IO addresses, bios code and tables etc. that isn’t anything that isn’t done routinely.
Also sharing addresses between processes is harder and needs to be arbitrated in some way. And again this is something that is done in most current virtual memory systems. And it is a relatively rare operation, especially in comparison to normal intra-process cross-module function calls that would be much cheaper now.
Another issue could be that actual executable code size is limited per process. But that is just a question of configuring the chip. Even in the biggest applications today the combined text segments of all modules don’t go beyond a few hundred MB. Actually I have never seen one one where actual code size goes beyond a few dozen MB. In essence a lower per process code segment of a few GB is far less limiting than the actual maximum installed memory usable by all processes. Especially in an architecture that is geared towards many cores, few context switches and compact code size.
It could also be beneficial to encode the streams that are actually used after a jump into the control instructions. That spares unncesessary memory accesses. Although in practice at least the control flow stream and the compare/data route stream will always be used.
This post is going quite off topic I guess.
I also admit that I have another rather self centered and whimsical reason for wanting an own full address space for each process to use: I’m playing around with some language design, and letting each process design their own memory layout allows it to make a lot of safe assumptions about addresses, which in turn enable or make cheaper/free a lot of language features I’m interested in.