Forum Replies Created
- AuthorPosts
- in reply to: Specification #1146
So just for extreme clarity: The simulator examples given here, and the assembly language shown, is only ever for member-specific code, and never for abstract family code? JITs would generate a LLVM-ish intermediate serialized representation?
(Related to JITs, but slightly off-topic, but I don’t recall which talk covered it: With the way EBBs and function semantics work, is it possible for a subroutine to have multiple entry points? Just considering optimization possibilities for always-vararg languages.)
When bin-filling the binary instruction encoding, does your software take into account expected popularity of instructions to prefer common instructions for tighter encoding and uncommon instructions as an otherwise needlessly larger encoding to make room for the common ones?
Do you anticipate any temporal quantization or aliasing problems with tracking time as (presumably) integer picoseconds in simulation for systems with multiple clock rates? It seems like there could be edge cases where sim would consider activities simultaneous which would be ordered in hardware, depending on the usefulness of that distinction at the sub-picosecond scale.
Also, as a Lisp programmer and heavy DSL creator, I must say that when you were showing C++ metaprogramming and especially the enum issue, I could only think of Greenspun’s Tenth Rule. 😉
You said that memory security model is intended to be very coarse grained. Many x86 garbage collected systems use page-sized protections in the MMU in order to inject read/write barriers based on page type, and to manage dirty flags in old generation memory pages. These security mappings can be modified on every trap, or at least on every GC cycle. Is this sort of thinking compatible with Mill memory security regions?
Systems like the JVM use memory reads into areas made unreadable as a safe-pointing device. To my understanding, the x86’s speculative processing guarantees the trap is raised before any side effects from further instructions are committed. In the more logically asynchronous memory model of the Mill, does this guarantee still hold?
Not really security related: When JITting, do you need to generate member-specific code or can you write to the family-wide portable binary spec and use the loader to optimize and inject it into your memory space?
Just a super-quick verification about things I presume, but weren’t explicitly stated:
For the inp/outp data passing within a local function call (not a portal), the implicit arguments do not undergo any additional memory copying during the function call?
Implicit arguments are simply framePointer[-N], whereas local stack registers are framePointer[+N], and thus are the same cost to access?
I’m pondering the Lisp/Javascript/etc style of effectively always-varargs parameter passing, and it seems that this would be the mechanism employed.
- in reply to: Specification #1154
Will the abstract form expose an infinite-length SSA belt with serial instruction execution in its virtual semantics?
- in reply to: Specification #1153
Thank you as always for your thorough replies. I’ll split them up next time. I took notes while watching, then compiled my questions not answered by the end.
(Greenspun only had 1 “rule”, about complex programs containing an ad-hoc subportion of Common Lisp (later amended to point this finger at Common Lisp itself 😉 ), called it his “tenth rule”, but had no other rules. Which one were you referring to with the seventh?)
To my understanding, exokernels expect hardware drivers, filesystems, and other abstractions to be linked directly into user-space programs, so there is no IPC or context switching in those layers. Application optimizations can therefore drill to any abstraction depth to skip and/or cache more levels of processing and decision making than normal abstraction layers. The kernel security is only permission to hit the hardware, or some portion thereof.
However, the organization is fairly similar to microkernels. One could consider exokernels to be a particular (and peculiar) optimization of microkernel architecture.
Speaking of operating systems, while it has gotten on in years I think AmigaOS would be a great fit. If I recall everything correctly, it already assumes a flat memory model, uses by-reference IPC data passing, and OS calls are normal function calls. Memory allocation requires protection descriptions as to how they’ll be shared.
I don’t know how much of this has changed in AmigaOS 4, but the assumptions made for simplicity and speed back then would gave great alignment with how the Mill accelerates and secures those assumptions.
Given that service calls are synchronous, I would presume that the current thread identifiers still reflect the caller. These should be read-only by both the caller and service, and shouldn’t be spoofable by user-level malicious code. From there you should be able to get to the OS-specific or internal security descriptors.
Sorry, I’ve been using the wrong term. Instead of “slab allocator”, I’ve been actually meaning “tlab allocator”. Nursery allocation with simple inline pointer bumping, each thread having its own preallocated buffer. Many systems try to keep such a pointer in a global register.
Joe: Lisp heap objects are not of fixed size. Pointers to objects contain tag bits, with some of the common types completely contained within in the tag. So cons cells (the basic 2-tuple linked list element) are literally 2 bare words in memory with zero overhead. The fact that it exists as a cell is purely held in the tag bits of pointers reaching that memory location, not on the object itself. More complicated objects obviously have some actualized overhead, generally one word. Of course, objects themselves have no inherent mutexes, hashes, etc, that languages like Java do. Those would be user-managed fields.
Ivan: While this has been focused on Lisp, I think it is a reasonable concern in today’s market to ensure that Javascript, Python, and Java can achieve great performance as well, without hackish workarounds inside their JITs. Javascript and Python especially have a lot of Lisp-like features, with lots of similarities on their input parameter passing. Inline slab allocators, multiple stacks, etc, are common on all these types of systems. Having the hardware track dirty old-generation writes for the garbage collector without needing to trap & interrupt can save a ton of time.
Just something to toss in with the “needs thought” pile. Some of these likely do not affect the core functionality, but might be orthogonal add-ons of varying intensity.
Knowing the expectant number of return values via hardware would help, as far as what I know now. It would at least eliminate having to keep an extra count parameter live from the function entry all the way through to where it’s used at the very end.
I presume that all return values are in a single return instruction, that you can’t stack values up separately in a loop and then have a single shared return? It sounds like it would have to switch/case between “return val”, “return val,nil”, “return val,nil,nil” etc if it’s all set up by the return instruction itself. I’m not sure if there are other better ways of expressing this, but if it’s easy to toss in a specReg, it would definitely open up a decent option.
So what happens if the results (after the first) are communicated in memory? If the shared-register approach works on a conventional, then sharing pseudo-registers at static addresses (w/r/t the program load image) should work too
They’d have to be thread-local, or potentially stack-allocated, not statically addressed. But yes, memory would be the default go-to for passing parameters around,.
(I am not an x86 expert – how many bytes in a load/store if the offset is one byte?)
Looking at some disasms on x86-64, all the ones I’m seeing are 5 bytes long. Opinions about the x86 architecture are not likely to change given this information. 😉
That load can be hoisted arbitrarily high, but if LISP overhead guarantees that any call will last longer than d$1 then hoisting is not necessary.
In existing implementations, the only Lisp overhead in returns is the callee either clearing carry, or setting it & the count of returned values. The caller has no overhead before accessing at least the first 3 returned values.
Now, I haven’t even gotten into passing parameters _into_ a call, which is much more complex. 😉 There are optional parameters, order-independent optional keyword parameters, raising the trailing set of parameters into a list object, referencing the entire parameter list as well as its parts, etc. There is an ‘apply’ function for dynamic buildup or pass-through of parameter lists. Current implementations set a register to list the count of parameters, similar to multi-value returns. A lot of the same issues all come up here, too, and having a specReg for the count of incoming parameters would avert some overhead as well. At least there are no older values on the function’s belt in this case.
Yes, I agree with the tradeoffs you’ve chosen. It does have some impedance mismatch with what Lisp would ideally want, but that’s already the case in x86 (even the slab allocation pointer is in thread-local storage, not a register), and it’s still speed-competitive there.
Like I said, it’s an interesting issue to wipe the slate of optimization assumptions clean and look at how the top-level goals for this class of language compilers will be accomplished on a new architecture.
The calling code can either bind a specific number of return values (implicitly 1 unless you manually bind more), or capture all returned values into a list with no expectation of count. The latter is usually used in interactive & debugging tools more than anything else, so the former is the common case.
However, it is never an error to have a mismatch when expecting a specific number. If there are more than expected, then the rest are ignored; if there are fewer than expected, the remaining bindings are set to NIL. Neither case causes an error. So technically, the caller always “cares” but doesn’t know, and freely accepts whatever it gets at runtime.
It’s hard to say how often a mismatch occurs in calling user-written & library functions, but it’s relatively common in calling the standard functions: Hashtable accesses, mathematical floor/ceiling/truncate/round, file reading functions, parsing strings, and many others return multiple values which are commonly used only for their first return value. I don’t have the infrastructure to directly measure percentages.
I do think it counts as “genuinely dynamic”, as you describe above, at least in the compiler assumptions for current platforms. What is the hit on an 8-slot belt for having 25% of the belt taken by every return, instead of 12.5%? Do values swapped to the scratchpad or spiller for short term use effectively cost little to nothing?
The Mill has some “global” registers like the thread-local pointer. Are there some that are available for user code as well? They’d be handy here to extend the calling convention, or just in general to extend the ABI that dynamic & GC’d programming languages can build their infrastructure from (slab allocation pointers, multiple stacks, current closure, dynamic environments, local/scoped constant tables, etc). On SBCL PowerPC, a whopping ~14 registers are globally reserved for such infrastructure (though some just hold constant addresses for speed & compactness). Smaller register architectures have to trade off which of these very commonly used pointers will be offloaded to RAM.
(Sorry for my late reply, I wasn’t aware that the email notification option didn’t include activity on the whole thread.)
I also did think of one other software situation where the carry flag is important: Emulators, especially when the target architecture is of the same register width as the host.
Anyway, in the Lisp situation, a common case is where a function returns multiple values, but the user of that function only bothers with the first (and idiomatically the most important) return value. Since we can freely pass lambdas around, if we as the caller are only interested in 1 return value, it’s unknown whether the function we’re eventually calling will return more or not.
A common example is the hash table accessor ‘gethash’. It returns 2 values; the value looked up (or NIL if not found), and a boolean explicitly stating whether or not it was found. The 2nd return value is needed for disambiguation if NILs are stored as values in the hash table. This second return value is always generated, but ignored in the majority of cases.
The default calling syntax only passes through the first return value, but you can specifically capture the multiple return values if you’re interested in them.
The returned values can each be anything, primitive (immediate register value) or compound/boxed (tagged pointer). Returning multiple values is not the same as returning a list, which is returning one value.
—
In the x86-64 compiler implementation, there are 4 registers unsaved between calls. If there is only 1 return value, the first of these registers is used, with carry clear. If there are more return values, carry is set, the first 3 registers are used for return values, with the count in the 4th. If there are more than 3 return values, they are spilled to the stack.
The nice thing is that if the caller only cares about 1 value, they just read that first output register, ignoring carry & the others. The calling convention keeps everything tidy so stack spill storage is not lost or trampled.
If the caller wants 2 or 3 return values, they’re immediately available as registers. The count & carry check can be elided when running with “safety level 0” optimizations, or if the type propagation can guarantee the number of expected return values. It’s uncommon, but safe & supported to have more than 3; it just has to go out to memory.
—
Regarding the Mill, I agree that it looks like the calling convention there would likely be two return values per call (1st value and count), with the >1 return values stored externally.
Looking back at the belt video again, right, it doesn’t look like the scratchpad can be used to pass data across function boundaries. Since Lisp multiple value returns are effectively extra side-band data to use optionally, it would be a bit unfortunate to have it always manage system memory for writing this data that is often ignored.
However, I’m sure many things could be mitigated, like passing the number of expected return values to a function, or having different function entry points for single- or N- valued return. I’ve only dived deep into the optimized assembly output for 1 architecture (x86-64), so my view on what goes on inside might not encompass all the tricks used on other platforms.
Just like the Mill is a complete rethink of what goes on in a CPU, it is a natural conclusion that optimizing compilers targeting the family would require a complete rethink of strategies to best take advantage of it. I’m sure your C compilers reflect this already, and the optimization opportunities are especially wide open to be solved for languages that do complex things well beyond “portable assembly code”.
- AuthorPosts