Forum Replies Created
- AuthorPosts
Just a quick question about multiple inputs: Are the argument lists of the call operation implemented in terms of the args operation used in ganging, or are they completely different?
My guess would be different, because I suspect the call operation is on the branch/pointer side of the instruction streams and ganging on the computation side. If both sides had both it would be kind of redundant.
- in reply to: Many core mill (GPU) #361
Well, Intel tried Larrabee. I would expect the Mill architecture to be much more suited for something like that than x86.
AMD tries hUMA too. In my ignorant lay person opinion, once the memory loads and access patterns can be served by one shared memory it shouldn’t be too much harder to plug two different sets of Mill cores into it. One set for application code, one set for float and graphics code.
Thank you very much for that detailed information.
Those are indeed quite a few base pointer special registers. I would be a bit sad to see the cpReg and cppReg go though. I can think of at least one use case where they would allow for elegant implementation of Haskell thunks/closures, where the closure memory layout/parameter description meta-data and the code thunk should be accessible via the same offset/pointer but must of course reside in different access permissions and thus need different bases.
No idea if those “levels of display” would cover this, because I have no idea what that means in this context here.I suspect the answer is “Not Yet Filed”, but what was the main reason behind deciding to have one global address space for all processes apart from “Because we can” in 64bit.
I don’t really like this for various reasons and it’s about the only thing I don’t like so far. The reasons mentioned in the videos for it seem to be minor and more incidental and not the real motivating factors.
The overarching theme behind me not liking it is that it forces all code to be relocatable, i.e. all calls and jumps are indirect. Even when those instructions themselves are very effcient, they require separate loads and the use of precious belt slots.
I used to think the main reason was because there is no real prefetching, even for code, and all latency issues are covered by the load instruction. But the prediction talk says differently.
Another reason could be the mentioned but not yet explained mechanism that enables function calls without really leaving an EBB.
But, when all processes think they are alone and have the full address space for themselves and all code and data sharing is done via shared virtual memory pages, all code can be statically linked (as far as the process/program itself is concerned), with all the advantages and optimization opportunities that gives, while still having all the advantages of relocatable code without any of the disadvantages. The specializer enables the creation of perfectly laid out memory images for each program in that specific system it is run on, and the virtual address translations, that always happen anyway, do the indirections for free.
- in reply to: Specialized Address Operations #323
I don’t think it would be a good idea to expose the internal use metadata to the programmer without any defined interface, which memory access basically would mean. I think it’s a good distinction to have hardware controlled flags and software controlled flags. Starting to mix them likely would be the start of a whole new set of headaches.
As for using more than 3 bits as tags: in ghc, the tags are used to encode the number of arguments to an unvisited thunk of lazy evaluation and whether it already has been evaluated. If I remember the white paper correctly 4 bits instead of 3 bits would push the case coverage of 70% to well over 90%.
- in reply to: Program Load vs. Program Install #311
Looks like all my technical imaginary concerns are dispelled nicely.
As for how to license/distribute the specializer to OS people, that’s a business decision. But still a decision that can make or break this whole thing.
> except you can have up to 16 of them
Yes, those previous talks are the reason I have this question. If you already can supply an instruction with an arbitrary number of inputs up to 16, why do you need a special mechnaism for other instructions to provide more then 2 inputs?
Could just be an encoding issue. The 2 operand case is by far the most common for all instruction except call. The new belt for the callee is invisible to the caller anyway, so from the callers belt perspective it should be irrelevant whether arguments are provided in an argument list or with additional arg-operations. The caller doesn’t care that in one case the arguments are combined in one case in a new belt and in the other case in internal data paths. It doesn’t see either. All it sees are the eventual results.
I read papers and documentation on KeyKos and Coyotos. Although that was a few years ago.
And separate address spaces offer most of the advantages of capabilites without being C-incompatible. The traditional problem for separate address spaces is expensive context switches. But on multicore 64 bit processors context switches can be vastly reduced, and the Mill goes into that direction anyway. And with the cache architecture of the Mill and below cache TLBs context switches can become a lot cheaper too even with separate address spaces.
And as you said yourself, it’s better to leave the OS out of as much things as possible and let the hardware take care of things, and capabilities must be OS constructs and cannot be hardware data types like virtual addresses can. Or am I wrong here?And yes, language design is a terrible vice. Ever since I started programming I was unhappy and annoyed and frustrated with whatever I was using. And whenever you try to find or to think of ways to do things better, what you find and learn usually only reveals new annoyances that make you quickly forget about the old ones you have solved.
The TLB issues didn’t quite sit right with me, which usually is a sign I’m wrong or misunderstood something.
What I obviously was wrong about, and which became clear upon rewatching this memory talk, is that the caches are all virtual addresses, and the overhead of going through the TLB is only done when that overhead is hidden by real DRAM access on a cache miss.
That means you could still have per process address spaces and fix all the bases for direct memory access offsets and still get almost all address indirections for free through the virtual memory in the image created by the specializer. And even better than I thought before, since it only happens on cache loads and writebacks, not on every memory access.
And there are more advantages:
You could get rid of most of those per frame special base registers and replace them with one address space selector/process ID register (or integrate this into the frame ID). You would still need to encode the needed base in the load instructions though as if there were registers, so on the encoding side there is no real difference.
Those hard coded bases/memory regions at the same time serve as the permissions, integrating them into the virtual addresses, making the separate protection tables redundant. Except that when those base selectors serve also as the permission, the encoding for stores and loads becomes smaller. For that to work you would have to make some cache invalidation on context switches though or have the processIDs as part of the key for cache lookup in some way.
Having the permissions in the virtual address itself also enables the specialization of caches and cache behavior. You know certain data is constant or write only, because of its virtual address, allowing you to adjust cache strategies or even to have separate (and smaller and faster) caches for those.
Having fixed virtual bases per process could make NUMA memory architectures (which very likely will only become more wide spread in 64 computing) visible to the processes and compilers themselves, adjusting their algorithms and load strategies and memory layouts and even process and thread scheduling accordingly.
Very big gains from fixed offset bases are to be had in code, too. In this area this would mean a pretty big change in the architecture though. On the other hand, in a way it is bringing some of the core ideas of the Mill to their full potential:
Having several fixed code pointer bases enables you to have even more instruction streams trivially. In the encoding white paper more than two instruction streams are said to be possible via interleaving cache lines, which has some problems. Well, a fixed base address for each instruction stream doesn’t have those problems and it gives you all those advantages of more streams mentioned like more smaller and faster dedicated instruction caches, smaller instruction encoding, specialized decoders an order of magnitude more efficient with each addtional one etc. Although you probably will get diminishing returns when you go beyond 4 streams (likely subdivisions of functionality would be control flow, compare/data route, load/store, arithmetic).
There are some caveats with per process address spaces with fixed memory regions. The first process that sets up the virtual memory must be treated specially with a carefully laid out physical memory image. But considering how carefully laid out and crufted the lower memory regions on a pc are with IO addresses, bios code and tables etc. that isn’t anything that isn’t done routinely.
Also sharing addresses between processes is harder and needs to be arbitrated in some way. And again this is something that is done in most current virtual memory systems. And it is a relatively rare operation, especially in comparison to normal intra-process cross-module function calls that would be much cheaper now.
Another issue could be that actual executable code size is limited per process. But that is just a question of configuring the chip. Even in the biggest applications today the combined text segments of all modules don’t go beyond a few hundred MB. Actually I have never seen one one where actual code size goes beyond a few dozen MB. In essence a lower per process code segment of a few GB is far less limiting than the actual maximum installed memory usable by all processes. Especially in an architecture that is geared towards many cores, few context switches and compact code size.
It could also be beneficial to encode the streams that are actually used after a jump into the control instructions. That spares unncesessary memory accesses. Although in practice at least the control flow stream and the compare/data route stream will always be used.This post is going quite off topic I guess.
I also admit that I have another rather self centered and whimsical reason for wanting an own full address space for each process to use: I’m playing around with some language design, and letting each process design their own memory layout allows it to make a lot of safe assumptions about addresses, which in turn enable or make cheaper/free a lot of language features I’m interested in.> The Mill design has generally been concerned with commercially important languages, and we haven’t put much thought into support for esoterica, which Haskell still is
Fully understandable. But I also suspect that the frame contexts of which there isn’t too much revealed yet, provide everything you need.
> However, the Mill supports label variables and indirect branches, which should provide a satisfactory substitute.
I also expect the pick instruction to be very helpful there.
> The data structure in which there is a pointer to a closure which contains a pointer to an info record which contains a pointer to the code would be very painful on any modern machine
Yes, a lot has changed there. Most common uses of the info structure have been optimized away, in particular via tagging the pointers. And that is something the Mill supports better than any current hardware.
> That is, a closure pointer is the pair
You most likely wouldn’t even need two full 64bit pointers, just two smaller offsets and use the vector instructions to split them. At least in languages with a runtime.
And I’m no expert by far either, so it should be given a rest here.
> Are you thinking about closures created by a JIT, where the code could be anywhere and the code and data may be moved by the GC? Or compiled closures like lambdas in C++?
I mean compiled closures.
But since we’re getting more specific now I went and looked up this old Peyton Jones paper again, which you are most likely familiar with.
In chapter 7 he describes how haskell core language closures could/should be implemented in stock hardware of the time. It’s very possible that most of that information very much obsolete today, but as you have said so nicely yourself, the architectures of stock hardware haven’t changed/improved much since then.
In particular the optimization in 7.6 cannot really be done when constant data and code are in different memory protection turfs. In that case it would be beneficial to have implicit bases handy for both cases, so that one offset works for both, although the info table itself is far less referenced than the actual code.
- This reply was modified 10 years, 11 months ago by imbecile.
>The great majority of transfers are direct and don’t use pointers, and those have neither load nor belt positions. You can branch to a label or call a function in a single operation with no belt change. Color me confused.
That’s because I was confused due to lack of information and misinterpreting.
>Given a 64-bit address space, static linking is right out
True. Static linking should always be an offset to some implicit base too, especially in 64bit. Whether that base is NULL or some other fixed address or the current program counter for relative addressing doesn’t matter. The important part for static linking is that you can compute the final destinations at compile time and hardcode them into the instruction stream relative to that implicit base.
Now having read your reply I get the impression that there is an implicit base for each module a process uses that is handled somewhere else that is not a belt slot, and that those implicit bases likely are initialized at module load time in some separate mapping structure in parallel to the virtual addresses.
> there’s no advantage to fixing the base
I can actually think of a few advantages to fixing the base (which of course is only possible when you don’t have a SAS). But:
>The obvious one is getting the TLB out of 90+ % of memory access.
When this indeed means that in 90% of the memory accesses the virtual address is the physical address and a translation is unnecessary and not done, I fully agree that this would obliterate any advantage I have in mind. I just can’t imagine how this would be feasibly implemented, you would need to know whether an address needs translating before you actually bother to look it up. I guess this can done with the protection zones.
If that is not the case, and you still have to do address translation on every memory access, then I guess I could start listing and explaining all those advantages I figure fixed bases and offsets have to offer, especially on an architecture like the Mill.
- This reply was modified 10 years, 11 months ago by imbecile.
- in reply to: Program Load vs. Program Install #295
All good points. Just writing about something that’s bothering me.
This module load/assembly library is an essential part of the whole architecture. And the only ones who have something like this are IBM in their mainframe systems. And they have full vertical integration and vast resources and a very specialized market.
Getting something like this to work and accepted in a diverse ecosystem with many more or less independent parties will be one of the crucial details.
This is code that will have to be on every system in a very pivotal position. So that better be fully open source with a liberal license and very adaptable. Neither Microsoft and Apple nor the Linux and BSD people would accept anything less and have a piece of code that they have no control over in the center of their systems. - AuthorPosts