Forum Replies Created
- AuthorPosts
- in reply to: multi-cpu memory model #3586
Memory consistency models are subtle and confusing. The exact definitions are too detailed and specialized to present here; start with https://en.wikipedia.org/wiki/Consistency_model.
The Mill presents sequential consistency, *not* global consistency. In practice this means that any single thread works as if the machine had only one core. The x86 is almost sequentially consistent, so any algorithm that works on an x86 will work on a Mill, but a few that work on a Mill won’t work reliably on x86.
That’s the hardware model, that you would see when writing in conAsm. The translation from HLL to conAsm is also subject to ordering issues (on any ISA, not just Mill). The Mill architecture is designed to let the compiler do massive reordering and speculation. Any bog-standard out-of-order architecture does the same; the difference is that the Mill’s static design does it at compile time. For both Mill and OOO, reordering and speculation is not intrinsically harmful; what matters is whether the reordering/speculation is visible to program semantics or a potential attacker.
Unlike other architectures, the great mass of Mill instructions are idempotent: you can execute them in any order consistent with dataflow, and speculate them with abandon. You will know if a compiler bug violates dataflow because your program won’t work, but otherwise you are good to go. The few non-speculable instructions, which are order-sensitive, require special handling.
The Mill compiler is based on LLVM. Languages like C present a single-thread model, and when used in contexts where there is potential language-opaque asynchronous access there are well known examples in the literature where the compiler did something to the code that was not what a naive programmer would expect; see Linus Rants(tm). We are subject to the same issues – there are C semantic issues that no ISA can fix, although liberal use of “volatile” will help.
Once we get the genAsm from LLVM, the translation to conAsm must preserve order semantics and be exploit-free (delta bugs of course). The general rule is that no non-speculable Mill instruction may alter machine state based on a value read out of program order. The details are many, but the crucial one is that the memory request streams on any single thread are always in program order, and any speculated operation is guarded in such a way that the guard is verified before the instruction alters machine state.
Consequently, the translation may move instructions over branches to speculate them, but only by carrying the branch predicate along as a guard. This gets rid of branch overhead and its attendant risk of misprediction costs.
Atomicity: the Mill uses a conventional optimistic model, with no locking. It works essentially the same as in the IBM Z-series mainframes, and what Intel tried to do but couldn’t get to work (in fairness to Intel, it’s a lot harder to be optimistic in an OOO). We don’t expect to do a video about it, although there will of course be technical documentation whenever we can get to that.
- in reply to: Ivan do you have any insight why rust is popular? #3576
Sorry, I have no idea. I work well below the language level, and have never used Rust in particular.
This is at a level above the hardware and architecture, a matter of OS and language. The hardware supports user-mode hardware interrupt handlers for code (like your friendly local super-collider) that wants to work at that level, but more typically the code would use libraries like threads and asyncio packages, and would approach the problem as exposed by the model the libraries expose.
Sorry, business side details are NDA only. Patience, waiting is.
- in reply to: Does the mill have any hardware for copying bits? #3573
Several questions, with only some answers 🙂 In the future, please put each question in a separate posting.
First, regarding mixed width access. There are no dynamic-width loads; Mill access widths are static, so a particular load op is bound to a particular width. But the logic that tells what width to use can be used to predicate selectively among the hardware-supported widths, essentially a switch statement where each case is a different width. Any ISA can do the same; what’s different in the Mill is that the (frequently missing) branches that other ISAs need to implement the switch are unnecessary. Instead, the code will fire off all the different-width loads at once, with each guarded by the width predicate so only one actually gets to memory.
How long that takes depends on the provisioning of the Mill member running the code. For a mid-range Mill with two load units it takes two cycles; the predicate generation will overlay with the load instructions for free.
For your second question, about async DMA. There are two approaches. One can put explicit device-specific hardware in the configuration that accumulates and buffers data and interrupts a CPU when some desired number is available (or a timeout happens). Such hardware is common in the embedded world, and would work as well with a Mill as with any other ISA. Alternatively, one can use explicit Mill facilities for that kind of access, but those facilities are NYF. Sorry 🙂
Lastly, timers. Yes, there are timers, of several sorts that can be configured into a particular member chip. These are (in general) accessed through MMIO. Like any memory access, code can only use MMIO to addresses (and hence devices) for which it has permissions in the PLB. A permission manager can ensure that different threads don’t stomp on each others’ countdown.
- in reply to: thread limits #3560
It’s unclear whether Go on a Mill should map its threadlets onto Mill hardware threads; they may be too heavy-duty for Go use. Frankly, I don’t know enough about Go and its usage to say. One thing that’s safe to say is that no matter what size we choose for a thread id there will be someone who wants more threads. 🙂
Thread runaway is not only possible due to malice, but also due to program bug. In this a runaway spawn is much like runaway call, i.e. recursion and runaway malloc, and has the same fix: quotas.
As for virtualization, a virtualized Mill presents what appears to be a real Mill, which has a full shared virtual address space both to the app and to the (guest) kernel/OS. The hypervisor runs in another (real) Mill, with a full space disjoint from all the guest spaces, but has means by which it can see and manage the guests. How, you ask? Sorry, that’s NYF (Not Yet Filed).
The paper is a valuable contribution to the analysis of the security defects of legacy ISAs. It’s not clear what x86 and friends can do about the reported vulnerability, other than disabling run-ahead speculation completely, and pay the performance cost. However, I’m no x86 expert and there may in fact be some clever mitigation that will be available in the next ping cycle or two.
As for Mill, the short answer is “inherently immune”. The long answer is that Mill has no kernel state that can read all of physmem at will.
The paper is greatly concerned with the fact that the kernel is mapped into the same address mapping as the user image; by finding cases where the user virtual address matches a kernel virtual address and trigger a speculative kernel reference, the user can cause the secret to get moved to the L1 whence established exploits can exfiltrate it at useful bitrates. Mill actually makes the problem easier, because Mill uses a Single Shared Address Space model, with the caches in virtual. Consequently the attacker doesn’t need to coax the kernel mapping to match the attacker’s; they are the same a priori.
But things end there. Mill does not do speculative state update. Hence even giving the kernel (if there were such a thing) an attacker (or kernel) address, while it can initiate a load, no state gets updated, in cache or elsewhere, until the load passes protection check.
It is true that a user can initiate a load (valid in the user’s permissions) and then call (or get interrupted into) a different process, and that load may complete in the background and its result be made available after the call/interrupt returns. But the load is marked with the initiator, and the called code can’t see it. Of course, assuming the load was valid for the loader it will as a side effect update cache and other state in a way that is visible to to the callee; this is no different than a caller seeing the presence state of the caches after whatever work the caller had done. But there’s no way for either side to control what the other loads, because there’s no speculation.
Everyone in the security world knows that the only real security would come from avoiding all speculation. The only real difference between Mill and any other ISA is that the Mill avoids speculation in a cost-effective way.
ISA lock-in, to x86 or any other, can be addressed by runtime emulation or binary-to-binary translation. Emulation tends to be expensive, though if your application give satisfactory performance when written in JavaScript then the cost of binary emulation should be ignorable.
Binary translation is a more promising solution; the problem is bug-for-bug compatibility. Essentially all x86 codes in particular contain Undefined Behavior. However, the industry is moving toward greater acceptance of translation, and we feel that ISA incompatibility will be of increasingly lesser importance.
- in reply to: Ivan do you have any insight why rust is popular? #3591
Presumably the language translations aspects are already handled by LLVM Rust, and the LLVL/IR-to-genAsm step handles all IR already, so the task can assume valid genAsm input and never look inside the compiler. The specializer handles all generic IR already, so that leaves on the intrinsics and whatever runtime is directly called.
That is where the work is: figuring out how a particular intrinsic/runtime behavior should be implemented in a Mill, probably initially by writing a function for it that does not itself use the intrinsics. There’s already specializer logic to replace an arbitrary intrinsic with a native function call, but the function body still has to be written; that needs an understanding of the machine but not of compilers. Some existing intrinsics are replaced by in-line code rather than by function calls, and that requires a greater understanding of how things are represented in the specializer, but there are plenty of examples and that optimization can be deferred anyway.
So the task is not really a compiler task; it’s a runtime system task. And it may be an OS kernel task – remember that the tool chain drops code on the bare machine, not on a Linux loader with an OS behind it. Done right, the kernel itself could be written in Rust.
But only if there’s someone willing to take it on 🙂
- in reply to: Ivan do you have any insight why rust is popular? #3589
The compiler aspect is no problem; I’d expect the present tool chain to handle it with minimal effort. However, Rust needs its own runtime, and this is likely to be a piece of work on a machine with secure stacks; you can’t just muck directly with the stack links on a Mill.
I agree that it would be an interesting and useful project. We’ve had several requests for Rust beyond your own, but no one yet has stepped forward and offered to join the team and port Rust. Offers, anyone?
- in reply to: multi-cpu memory model #3587
Those ops support the Mill’s optimistic concurrency model, essentially a bounded hardware transaction memory (HTM). Google for it 🙂 You can use them to implement pessimistic locking primitives like compare-and-swap (CAS), but will lose the gain available if you used transactions instead.
I think the instructions you sorta remember were a description of the semantics of transactions as implemented in the hardware. The ones in the ISA in the Wiki provide the bounds of the transaction; they work in conjunction with loads and stores that are marked are transaction participants. The sequence is enter -> some loads and stores -> exit, and the changes between enter and exit happen all-or-none atomically.
Exactly; the user app uses whatever interface its host language and/or libraries provide. Thus a blocking call might portal into the OS, which would start up the IO and attach to a condition variable and then call the dispatcher. The thread (i.e. the user’s thread, which has OS calls on top of it) then sleeps until the CV gets notifies, and than exits back to the app.
An async IO would portal into the async library which would do exactly the same thing except omitting the call on the dispatcher. The app can poll the CV. A driver library could register a handler function and then visit the dispatcher and the interrupt would get handled in the app. None of these have anything to do with the ISA except the use of portals instead of process switches.
- in reply to: Meltdown and Spectre #3564
Yes, it lets the code stream be organized as basic blocks. I suppose you can see it as a step toward what the Mill does but I’m biased and see it more as a device for the special-case of single-issue. YMMV 🙂
- in reply to: thread limits #3562
Thread id, turf id, and address spaces are all virtualized. If we have any other kernel-known spaces they will be virtualized too.
The main advantage of the Mill ISA is that there are no privileged operations; all protection is by address using the PLB. Consequently it is immune to the endless corner cases that special ops cause for other ISAs.
- in reply to: Meltdown and Spectre #3555
That’s an interesting paper; thank you for posting it.
We looked at splitting branches into a herald (their “bb” instruction) and a transfer, essentially as described in the paper, though without their attention to Spectre. After all, it is a natural companion to our split loads, which also count out an instruction-supplied cycle count. And we ultimately decided against.
The authors make a reasonable case that they will find a big enough latency gap between resolution of the predicate (the branch condition) and the transfer to save a few cycles. But their study deals with a single-issue, in-order architecture, which a Mill most emphatically is not. Instead it is a wide-issue, and a single instruction bundle can (and often does) have several branch instructions. The multiple branches would muck up the author’s scheme, but not irretrievably.
But the wide-issue mix of compute and control flow in the same bundle obviates their approach. In open code, the predicates are commonly computed by the same bundle that contains the branch. Even legacy ISAs commonly have “compare-and-branch” ops that combine determining the predicate and taking the transfer. With no advance notice of the predicate it’s not possible to gain any static cycles on the transfer. Of course, in loops (“closed code”) you can roll forward the predicate, potentially arbitrarily far at the cost of extending the loop prefix, but loop branches predict extremely well.
What really decided us against a variable delay slot was the Mill replacement of branch prediction with exit prediction. Exit prediction lets the predictor run arbitrarily far ahead of execution, prefetching as it goes but without needing to examine the fetched code. The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.
Of course, there will be blocks with a single exit (avoiding the multi-branch problem) whose predicate is known early (avoiding the lacking latency problem). The question is whether their frequency would justify the feature, and we decided that no, they would not.
- AuthorPosts