- goldbugParticipantJanuary 4, 2018 at 8:03 amPost count: 45
We are all panicking over Meltdown and Spectre. Spectre in particular seems to work on Intel, AMD and ARM processors. The workaround appears to be very expensive for all operating systems.
Here is a very good and simple explanation of how Meltdown works. Basically they play games with speculative execution, out of order execution and caching to read arbitrary memory positions. Very interesting read, and very clever exploit.
I know the Mill is a completely different beast designed by extraterrestrials, but since the mill is big on speculative execution and caching, I was wondering if a similar attack is possible in the mill. The Mill is not out of order, but I could see phasing being exploited in the same way.
- goldbugParticipantJanuary 4, 2018 at 9:14 amPost count: 45
To describe the problem:
An instruction is speculated. The speculated instruction has one side effect: it fills a cache line (with valid data). The specific line that is filled depends on data that is not accessible. After the speculated instruction is discarded, the program can check which cache line got filled and use that to determine what the inaccessible value was.
The mill can speculate a lot of instructions in advance, and the speculation can fill cache lines no problem.
I believe the mill would not be vulnerable to this problem because reading the inaccessible data will simply return a NaR which will be passed to the subsequent speculative operations. The actual data never makes it to the cpu correct? on x86, the data makes it to the CPU before the page fault is generated, so a couple instructions can be speculated using that value.
- goldbugParticipantJanuary 4, 2018 at 10:21 amPost count: 45
- SymmetryParticipantJanuary 4, 2018 at 10:39 amPost count: 28
I think there would be NaRs involved on the Mill, at a guess. When a conventional processor speculates a load to an invalid address it has to suppress the fault until it knows whether the speculation was correct or else it has triggered a fault that it should not have only through its own failure rather than the codes. The Mill doesn’t have to suppress returning a NaR in this case because it would be invisible if the execution is later rolled back as false speculation. And since the speculating thread never has its hands on data from outside what it should be able to access there doesn’t seem to be any danger of data exfiltration. Well, I can imagine attacks along these lines that would let someone figure out which hidden addresses are in caches or main memory but not access the contents of the data at those addresses.
This is speculation, though, so maybe I should just wait for the authorities to answer. 🙂
- goldbugParticipantJanuary 4, 2018 at 10:43 amPost count: 45
Yes, that is what I figured as far as Meltdown goes.
But Spectre is a different beast, much more complicated, no NaR’s are involved at all.
“Well, I can imagine attacks along these lines that would let someone figure out which hidden addresses are in caches”
This is precisely the problem. With spectre, you can trick a program to load different areas of a cache depending on secret information. The attacker can then check which cache line was loaded and deduce the secret information based on that. The contents of the cache are not important and can remain inaccessible, what is important is which line was loaded.
- SymmetryParticipantJanuary 4, 2018 at 11:32 amPost count: 28
No, I’m talking about Spectre. The example given in the paper was
if (x < array1_size) y = array2[array1[x] * 256];
If array1[x] evaluates to NaR then you won’t get different parts of array2 in cache depending on what values were in memory where you ran off the end of array1. But of course there are lots of Spectre-like attacks that might be possible and who knows if the Mill is resistant to all of them.
- goldbugParticipantJanuary 4, 2018 at 12:02 pmPost count: 45
Still, it is certainly possible that array1[x] is withing the process memory space. My understanding is that the mill does not do array bounds check, so array1[-1] would simply read data from another part of memory which could be within the turf address space. If you pass the right value for x you can make it read any byte within the process address space. No NaR would be generated in this case.
If you pass the wrong value for x, that would point to a memory area that the turf does not have access to, then yeah, you would get a NaR, but an attacker would be careful to pass the right value.
- Will_EdwardsModeratorJanuary 4, 2018 at 10:49 amPost count: 98
Meltdown is not a fundamental bug in speculation, its a bug in various Intel implementations of it. AMD, for example, say they are not vulnerable even though they implement the same x86 ISA. I think its safe to say that Meltdown won’t work on future Intel CPUs – speculating page faults is game-over and the mitigation with KAISER so expensive that Intel will disable it.
Needless to say, Meltdown doesn’t work on the Mill either.
Spectre is a whole different ballgame. Its named because its going to haunt the industry for a long time 🙁
At first and second reading, Spectre as narrowly described doesn’t work on the Mill. After a third reading we’ll make a more detailed post explaining exactly why – please give us time.
- goldbugParticipantJanuary 9, 2018 at 7:32 amPost count: 45
I’m really curious on how the mill can defend against spectre.
As I understand it, there are really 2 ways to have speculative execution in the mill:
1) Branch prediction, somewhat similar to your typical OOO, only you predict exits.
2) Compiler scheduled. The compiler flattens 2 or more branches, executes them all simultaneously and then picks the result of the winning branch.
Both mechanisms can have side-channel effects, for example in the cache or the spiller. Especially the compiler scheduled speculation since it can be very large.
An attacking program can then check the status of the cache lines or whether one of his scratch pad areas have been spilled and deduce secret data from the victim across turfs. Other than checking array bounds (tough or impossible in C), how on earth can you defend against that?
Or maybe there is something I am not aware of?
- This reply was modified 2 years, 9 months ago by goldbug.
- Ivan GodardKeymasterJanuary 12, 2018 at 8:07 amPost count: 594
We will publish a white paper with details on Monday 1/15/2018.
- goldbugParticipantJanuary 16, 2018 at 8:02 amPost count: 45
Thank you for taking the time to write that paper. It is very enlightening.
I saw that you did have a Spectre-like bug and you fixed it by using loadtr prefixed by the guard.
Is that a new operation? I don’t recall seeing a loadtr in the wiki before?
By the way, awesome job, such a simple and elegant solution.
- Ivan GodardKeymasterJanuary 22, 2018 at 2:16 pmPost count: 594
Actually there are problems with the predicated forms of load: with them you can hoist over a branch, as shown in the paper, but you can’t hoist over the computation of the predicate. Spectre has prompted much internal discussion about an alternative that doesn’t require the predicate until retire (easy) and is still safe from Spectre-like attacks (real hard). NYF for now.
- Joe TaberParticipantJanuary 4, 2018 at 11:12 amPost count: 24
The scary thing about this is that it’s an attack on how speculation works in the context of caches and prediction in very general terms.
Mill has a lot of things going for it: like returning NaR in loads without permission, not being OOO, and turning traditionally branch-heavy code into linear code with a pick at the end (basically auto-converting tons of code into constant-time algorithms for free… which might turn out to be an important *security* feature now that I think about it).
So it looks like some of these attacks are straightforwardly not possible on the mill. But other attacks seem so general… Spectre is essentially a whole new category of timing attacks. I’d like to hear the mill team’s initial impressions.
- FindecanorParticipantJanuary 7, 2018 at 5:32 amPost count: 16
Memory timing is used by the Meltdown and Spectre attacks only as a side-channel to leak the loaded values but there are other types of attacks that use cache timing (or even DRAM timing!) for side-channels or for snooping on other processes.
The x86 is especially susceptible to cache timing-based attacks because instructions for flushing the cache and fine-grained timing (Flush+Reload) are available for programs in user-mode. On ARM for instance, the similar instructions are privileged thus making it not impossible but at least harder for an attacker to perform timing-based attacks with high accuracy.
Therefore, that the Mill architecture does not have any privileged mode does worry me a bit.
One way to make instructions privileged in an operating system (on both the Mill and x86) could be to enclose a manifest in the executable in which the compiler has described which (groups of) instructions that the executable is using. Then permit a program to run only those instructions that the parent process/runtime has explicitly granted it.
That would require protection of the binary, however: either through strong guarantees in the OS or through a cryptographic signature.
I have been trying to find prior similar work but not found anything – where binaries have been signed by a toolchain instead of just the person/organisation that published it. If that reminds you of something, please let me know because I would like to code something up. (and not just for binaries for The Mill, but a generic framework)
- sydbarrett74ParticipantJanuary 11, 2018 at 7:58 pmPost count: 3
That would require protection of the binary, however: either through strong guarantees in the OS or through a cryptographic signature.
I have been trying to find prior similar work but not found anything – where binaries have been signed by a toolchain instead of just the person/organisation that published it.
Does this link to a discussion on Stack Overflow help any?
- Ivan GodardKeymasterJanuary 12, 2018 at 8:04 amPost count: 594
The Burroughs main frames (the A series – the first compiler I ever write was for the B6500) used this approach – the OS would only load code produced and signed by a known compiler. In the controlled environment typical of mainframes this worked, but in the free-for-all of PCs it would be too restrictive.
- FindecanorParticipantJanuary 13, 2018 at 6:23 amPost count: 16
Please correct me if I’m wrong but I was under the impression that Mill software was supposed to be distributed as GenAsm and compiled to machine code at install time – by a local compiler which would have to be trusted implicitly anyway. That would be an opportunity for code-checking to create a list of which groups of instructions that are used in an object file.
There are centralised compiler systems out there already. Apple encourages iOS developers to submit apps in LLVM bitcode for ARM which Apple optimises differently for each of their CPUs and then those Mach-O binaries are signed. A Linux distribution could accept packages in source form and build and sign them on trusted build servers, and it would not change much because all binary packages are from that source anyway.
Anyway, obviously MMIO would be a better design choice for protecting CPU features than some weird software. I think however that some kind of enclosed compiler log could perhaps become useful as a complement if some bug or vulnerability would need to be mitigated (knock on wood…) or if there is a change in the ABI. For one thing, it could make it easier to find which and what would need to be recompiled.
I also had a wider scope in mind (I started before Meltdown was public): Not just low-level issues but also for safe linking of object files compiled from “type-safe” languages where the rules of the language are relied upon to enforce security policy.
- Ivan GodardKeymasterJanuary 15, 2018 at 3:18 amPost count: 594
You are right: we expect distros to be in genAsm and to be specialized to the actual target at install time. The chip will ship with a specializer in ROM with the BIOS. Nothing in the hardware stops the user from writing his own specializer, for whatever benefit or bugs that brings. For that matter, nothing in the hardware stops the user from writing his own genAsm. Subjects such as safe languages are matters of policy, above the architecture, and must be enforced by software. The Mill with its clean protection model offers an excellent platform for such things, but we as a hardware company do not expect to provide them ourselves.
- NXTanglParticipantFebruary 19, 2018 at 11:36 amPost count: 13
On at least some architectures, you could use the spiller with the prefetching service to mitigate (not eliminate, but reduce the reliability of) cache snooping techniques. Since fast PLB lookups would drop
NaRs on the belt before the cache gets touched at all, the most reliable way to snoop would be to fill the cache, portal out (to the task switcher or the victim code), and then time loads when the portal call returns.
However, there’s nothing stopping you from spilling the cache line base addresses across calls, so you can reload the old cache data in advance of a predicted return. Of course, that reduces other processes’ ability to snoop.
Not sure how viable this is; regardless, it’s offered as an open-source technique, free of charge.
I came up with this while thinking about how to potentially fix Spectre on OOO machines; I was thinking about a scheme to eliminate the side-channel by tagging every cache line with the hardware turf that originated it, so that IPC could avoid a full cache invalidation, but then I realized that the attacker could still snoop to see which of their own lines were evicted so we’d need a full cache invalidation to hide it anyway.
- Ivan GodardKeymasterFebruary 19, 2018 at 5:10 pmPost count: 594
Yes, one could reload a cache image at portal exit, or even simpler just evict everything. The spectre attack depends on getting the victim to speculatively touch a line that it would not have touched in ordinary execution. It’s not clear that it’s very useful for an attacker to know which cache ways a victim touched during the normal execution of the portal.
- NXTanglParticipantFebruary 26, 2018 at 12:37 pmPost count: 13
Well, I guess it’s mostly proof against optimizing compilers that do if-conversion on bounds check code when generating genASM. Since most bounds checks are going to pass, this is a valid optimization IFF the index doesn’t come from untrusted input. Of course, I suppose you can’t really stupid-proof an architecture against a compiler.
- paulmenageParticipantFebruary 26, 2018 at 1:33 pmPost count: 1
Perhaps C++20 (or whenever the next standard is) will include an __attribute__((untrusted)) to indicate that a variable (or some/all of a function’s parameters) may contain values that need additional scrutiny and less optimization.
- GhadiParticipantAugust 3, 2020 at 10:47 amPost count: 4
- Ivan GodardKeymasterAugust 3, 2020 at 3:37 pmPost count: 594
That’s an interesting paper; thank you for posting it.
We looked at splitting branches into a herald (their “bb” instruction) and a transfer, essentially as described in the paper, though without their attention to Spectre. After all, it is a natural companion to our split loads, which also count out an instruction-supplied cycle count. And we ultimately decided against.
The authors make a reasonable case that they will find a big enough latency gap between resolution of the predicate (the branch condition) and the transfer to save a few cycles. But their study deals with a single-issue, in-order architecture, which a Mill most emphatically is not. Instead it is a wide-issue, and a single instruction bundle can (and often does) have several branch instructions. The multiple branches would muck up the author’s scheme, but not irretrievably.
But the wide-issue mix of compute and control flow in the same bundle obviates their approach. In open code, the predicates are commonly computed by the same bundle that contains the branch. Even legacy ISAs commonly have “compare-and-branch” ops that combine determining the predicate and taking the transfer. With no advance notice of the predicate it’s not possible to gain any static cycles on the transfer. Of course, in loops (“closed code”) you can roll forward the predicate, potentially arbitrarily far at the cost of extending the loop prefix, but loop branches predict extremely well.
What really decided us against a variable delay slot was the Mill replacement of branch prediction with exit prediction. Exit prediction lets the predictor run arbitrarily far ahead of execution, prefetching as it goes but without needing to examine the fetched code. The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.
Of course, there will be blocks with a single exit (avoiding the multi-branch problem) whose predicate is known early (avoiding the lacking latency problem). The question is whether their frequency would justify the feature, and we decided that no, they would not.
- VeedracParticipantAugust 5, 2020 at 6:19 amPost count: 24
The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.
In a top-end OoO it’s perhaps more true to say the real win is filling the reorder buffer. Your predictor doesn’t have to be that good to deal with fetching code in time, given caches are getting so large and looping is so common, but if you want to fill a 560ish instruction ROB, you need to be predicting 560 instructions ahead, and predictor quality and performance is intensely important for that. But yeah, that’s not so relevant for a Mill.
- Thomas DParticipantAugust 17, 2020 at 8:43 pmPost count: 18
I am a little confused by your answer. While you see the
bbinstruction as a herald instruction, hinting to the processor that a branch is coming, I see it more as a hint to the processor that the next block of instructions is a basic block. So, the instruction’s intent, to me, is to recreate a fundamental aspect of the Mill’s operation: that it computes on basic blocks (however, the Mill doesn’t do so for the same reason: it requires the compiler to optimize the data flow a priori, and uses the basic blocks for exit prediction and split-stream encoding). Or, am I missing something basic from my skim over the paper?
You must be logged in to reply to this topic.