Meltdown and Spectre

Author
Posts
goldbug
Participant
January 4, 2018 at 8:03 am
Post count: 53
#3145 |
We are all panicking over Meltdown and Spectre. Spectre in particular seems to work on Intel, AMD and ARM processors. The workaround appears to be very expensive for all operating systems.
Here is a very good and simple explanation of how Meltdown works. Basically they play games with speculative execution, out of order execution and caching to read arbitrary memory positions. Very interesting read, and very clever exploit.
I know the Mill is a completely different beast designed by extraterrestrials, but since the mill is big on speculative execution and caching, I was wondering if a similar attack is possible in the mill. The Mill is not out of order, but I could see phasing being exploited in the same way.
- This topic was modified 6 years, 9 months ago by goldbug.
- This topic was modified 6 years, 9 months ago by goldbug.
goldbug
Participant
January 4, 2018 at 9:14 am
Post count: 53
#3148
To describe the problem:
An instruction is speculated. The speculated instruction has one side effect: it fills a cache line (with valid data). The specific line that is filled depends on data that is not accessible. After the speculated instruction is discarded, the program can check which cache line got filled and use that to determine what the inaccessible value was.
The mill can speculate a lot of instructions in advance, and the speculation can fill cache lines no problem.
I believe the mill would not be vulnerable to this problem because reading the inaccessible data will simply return a NaR which will be passed to the subsequent speculative operations. The actual data never makes it to the cpu correct? on x86, the data makes it to the CPU before the page fault is generated, so a couple instructions can be speculated using that value.
goldbug
Participant
January 4, 2018 at 10:21 am
Post count: 53
#3149
Spectre seems a bit more complicated, as no NaR’s are involved, would it work on the mill?
https://spectreattack.com/spectre.pdf
- Symmetry
  Participant
  January 4, 2018 at 10:39 am
  Post count: 28
  #3150
  I think there would be NaRs involved on the Mill, at a guess. When a conventional processor speculates a load to an invalid address it has to suppress the fault until it knows whether the speculation was correct or else it has triggered a fault that it should not have only through its own failure rather than the codes. The Mill doesn’t have to suppress returning a NaR in this case because it would be invisible if the execution is later rolled back as false speculation. And since the speculating thread never has its hands on data from outside what it should be able to access there doesn’t seem to be any danger of data exfiltration. Well, I can imagine attacks along these lines that would let someone figure out which hidden addresses are in caches or main memory but not access the contents of the data at those addresses.
  This is speculation, though, so maybe I should just wait for the authorities to answer. 🙂
  - goldbug
    Participant
    January 4, 2018 at 10:43 am
    Post count: 53
    #3151
    Yes, that is what I figured as far as Meltdown goes.
    But Spectre is a different beast, much more complicated, no NaR’s are involved at all.
    “Well, I can imagine attacks along these lines that would let someone figure out which hidden addresses are in caches”
    This is precisely the problem. With spectre, you can trick a program to load different areas of a cache depending on secret information. The attacker can then check which cache line was loaded and deduce the secret information based on that. The contents of the cache are not important and can remain inaccessible, what is important is which line was loaded.
    This reply was modified 6 years, 9 months ago by goldbug.
    This reply was modified 6 years, 9 months ago by goldbug.
    - Symmetry
      Participant
      January 4, 2018 at 11:32 am
      Post count: 28
      #3158
      No, I’m talking about Spectre. The example given in the paper was
      if (x < array1_size) y = array2[array1[x] * 256];
      If array1[x] evaluates to NaR then you won’t get different parts of array2 in cache depending on what values were in memory where you ran off the end of array1. But of course there are lots of Spectre-like attacks that might be possible and who knows if the Mill is resistant to all of them.
      - goldbug
        Participant
        January 4, 2018 at 12:02 pm
        Post count: 53
        #3159
        I see.
        Still, it is certainly possible that array1[x] is withing the process memory space. My understanding is that the mill does not do array bounds check, so array1[-1] would simply read data from another part of memory which could be within the turf address space. If you pass the right value for x you can make it read any byte within the process address space. No NaR would be generated in this case.
        If you pass the wrong value for x, that would point to a memory area that the turf does not have access to, then yeah, you would get a NaR, but an attacker would be careful to pass the right value.
Will_Edwards
Moderator
January 4, 2018 at 10:49 am
Post count: 98
#3153
Meltdown is not a fundamental bug in speculation, its a bug in various Intel implementations of it. AMD, for example, say they are not vulnerable even though they implement the same x86 ISA. I think its safe to say that Meltdown won’t work on future Intel CPUs – speculating page faults is game-over and the mitigation with KAISER so expensive that Intel will disable it.
Needless to say, Meltdown doesn’t work on the Mill either.
Spectre is a whole different ballgame. Its named because its going to haunt the industry for a long time 🙁
At first and second reading, Spectre as narrowly described doesn’t work on the Mill. After a third reading we’ll make a more detailed post explaining exactly why – please give us time.
- goldbug
  Participant
  January 9, 2018 at 7:32 am
  Post count: 53
  #3164
  I’m really curious on how the mill can defend against spectre.
  As I understand it, there are really 2 ways to have speculative execution in the mill:
  1) Branch prediction, somewhat similar to your typical OOO, only you predict exits.
  2) Compiler scheduled. The compiler flattens 2 or more branches, executes them all simultaneously and then picks the result of the winning branch.
  Both mechanisms can have side-channel effects, for example in the cache or the spiller. Especially the compiler scheduled speculation since it can be very large.
  An attacking program can then check the status of the cache lines or whether one of his scratch pad areas have been spilled and deduce secret data from the victim across turfs. Other than checking array bounds (tough or impossible in C), how on earth can you defend against that?
  Or maybe there is something I am not aware of?
  This reply was modified 6 years, 9 months ago by goldbug.
  - Ivan Godard
    Keymaster
    January 12, 2018 at 8:07 am
    Post count: 689
    #3168
    We will publish a white paper with details on Monday 1/15/2018.
    - goldbug
      Participant
      January 16, 2018 at 8:02 am
      Post count: 53
      #3183
      Thank you for taking the time to write that paper. It is very enlightening.
      I saw that you did have a Spectre-like bug and you fixed it by using loadtr prefixed by the guard.
      Is that a new operation? I don’t recall seeing a loadtr in the wiki before?
      By the way, awesome job, such a simple and elegant solution.
      - Ivan Godard
        Keymaster
        January 22, 2018 at 2:16 pm
        Post count: 689
        #3188
        Actually there are problems with the predicated forms of load: with them you can hoist over a branch, as shown in the paper, but you can’t hoist over the computation of the predicate. Spectre has prompted much internal discussion about an alternative that doesn’t require the predicate until retire (easy) and is still safe from Spectre-like attacks (real hard). NYF for now.
Joe Taber
Participant
January 4, 2018 at 11:12 am
Post count: 25
#3155
The scary thing about this is that it’s an attack on how speculation works in the context of caches and prediction in very general terms.
Mill has a lot of things going for it: like returning NaR in loads without permission, not being OOO, and turning traditionally branch-heavy code into linear code with a pick at the end (basically auto-converting tons of code into constant-time algorithms for free… which might turn out to be an important *security* feature now that I think about it).
So it looks like some of these attacks are straightforwardly not possible on the mill. But other attacks seem so general… Spectre is essentially a whole new category of timing attacks. I’d like to hear the mill team’s initial impressions.
- Ivan Godard
  Keymaster
  January 22, 2018 at 2:21 pm
  Post count: 689
  #3189
  Our impression is that the whole thing makes our heads hurt.
Findecanor
Participant
January 7, 2018 at 5:32 am
Post count: 34
#3162
Memory timing is used by the Meltdown and Spectre attacks only as a side-channel to leak the loaded values but there are other types of attacks that use cache timing (or even DRAM timing!) for side-channels or for snooping on other processes.
The x86 is especially susceptible to cache timing-based attacks because instructions for flushing the cache and fine-grained timing (Flush+Reload) are available for programs in user-mode. On ARM for instance, the similar instructions are privileged thus making it not impossible but at least harder for an attacker to perform timing-based attacks with high accuracy.
Therefore, that the Mill architecture does not have any privileged mode does worry me a bit.
One way to make instructions privileged in an operating system (on both the Mill and x86) could be to enclose a manifest in the executable in which the compiler has described which (groups of) instructions that the executable is using. Then permit a program to run only those instructions that the parent process/runtime has explicitly granted it.
That would require protection of the binary, however: either through strong guarantees in the OS or through a cryptographic signature.
I have been trying to find prior similar work but not found anything – where binaries have been signed by a toolchain instead of just the person/organisation that published it. If that reminds you of something, please let me know because I would like to code something up. (and not just for binaries for The Mill, but a generic framework)
- Wolf480pl
  Participant
  January 7, 2018 at 6:53 am
  Post count: 1
  #3163
  Well, you could make it so that cache can only be controlled by an MMIO register, and not by CPU instructions. That’d allow easily deciding which turfs are allowed to control cache and which aren’t.
- sydbarrett74
  Participant
  January 11, 2018 at 7:58 pm
  Post count: 3
  #3166
  That would require protection of the binary, however: either through strong guarantees in the OS or through a cryptographic signature.
  I have been trying to find prior similar work but not found anything – where binaries have been signed by a toolchain instead of just the person/organisation that published it.
  Does this link to a discussion on Stack Overflow help any?
  - Findecanor
    Participant
    January 12, 2018 at 1:36 pm
    Post count: 34
    #3169
    Thank you. Not quite what I was looking for but there were a couple of leads to follow and now I have downloaded a few more megabyte of papers to read.
    This reply was modified 6 years, 9 months ago by Findecanor.
- Ivan Godard
  Keymaster
  January 12, 2018 at 8:04 am
  Post count: 689
  #3167
  The Burroughs main frames (the A series – the first compiler I ever write was for the B6500) used this approach – the OS would only load code produced and signed by a known compiler. In the controlled environment typical of mainframes this worked, but in the free-for-all of PCs it would be too restrictive.
  - Findecanor
    Participant
    January 13, 2018 at 6:23 am
    Post count: 34
    #3171
    Please correct me if I’m wrong but I was under the impression that Mill software was supposed to be distributed as GenAsm and compiled to machine code at install time – by a local compiler which would have to be trusted implicitly anyway. That would be an opportunity for code-checking to create a list of which groups of instructions that are used in an object file.
    There are centralised compiler systems out there already. Apple encourages iOS developers to submit apps in LLVM bitcode for ARM which Apple optimises differently for each of their CPUs and then those Mach-O binaries are signed. A Linux distribution could accept packages in source form and build and sign them on trusted build servers, and it would not change much because all binary packages are from that source anyway.
    Anyway, obviously MMIO would be a better design choice for protecting CPU features than some weird software. I think however that some kind of enclosed compiler log could perhaps become useful as a complement if some bug or vulnerability would need to be mitigated (knock on wood…) or if there is a change in the ABI. For one thing, it could make it easier to find which and what would need to be recompiled.
    I also had a wider scope in mind (I started before Meltdown was public): Not just low-level issues but also for safe linking of object files compiled from “type-safe” languages where the rules of the language are relied upon to enforce security policy.
    - Ivan Godard
      Keymaster
      January 15, 2018 at 3:18 am
      Post count: 689
      #3172
      You are right: we expect distros to be in genAsm and to be specialized to the actual target at install time. The chip will ship with a specializer in ROM with the BIOS. Nothing in the hardware stops the user from writing his own specializer, for whatever benefit or bugs that brings. For that matter, nothing in the hardware stops the user from writing his own genAsm. Subjects such as safe languages are matters of policy, above the architecture, and must be enforced by software. The Mill with its clean protection model offers an excellent platform for such things, but we as a hardware company do not expect to provide them ourselves.
NXTangl
Participant
February 19, 2018 at 11:36 am
Post count: 21
#3219
On at least some architectures, you could use the spiller with the prefetching service to mitigate (not eliminate, but reduce the reliability of) cache snooping techniques. Since fast PLB lookups would drop NaRs on the belt before the cache gets touched at all, the most reliable way to snoop would be to fill the cache, portal out (to the task switcher or the victim code), and then time loads when the portal call returns.
However, there’s nothing stopping you from spilling the cache line base addresses across calls, so you can reload the old cache data in advance of a predicted return. Of course, that reduces other processes’ ability to snoop.
Not sure how viable this is; regardless, it’s offered as an open-source technique, free of charge.
I came up with this while thinking about how to potentially fix Spectre on OOO machines; I was thinking about a scheme to eliminate the side-channel by tagging every cache line with the hardware turf that originated it, so that IPC could avoid a full cache invalidation, but then I realized that the attacker could still snoop to see which of their own lines were evicted so we’d need a full cache invalidation to hide it anyway.
- Ivan Godard
  Keymaster
  February 19, 2018 at 5:10 pm
  Post count: 689
  #3220
  Yes, one could reload a cache image at portal exit, or even simpler just evict everything. The spectre attack depends on getting the victim to speculatively touch a line that it would not have touched in ordinary execution. It’s not clear that it’s very useful for an attacker to know which cache ways a victim touched during the normal execution of the portal.
  - NXTangl
    Participant
    February 26, 2018 at 12:37 pm
    Post count: 21
    #3239
    Well, I guess it’s mostly proof against optimizing compilers that do if-conversion on bounds check code when generating genASM. Since most bounds checks are going to pass, this is a valid optimization IFF the index doesn’t come from untrusted input. Of course, I suppose you can’t really stupid-proof an architecture against a compiler.
    - paulmenage
      Participant
      February 26, 2018 at 1:33 pm
      Post count: 1
      #3246
      Perhaps C++20 (or whenever the next standard is) will include an __attribute__((untrusted)) to indicate that a variable (or some/all of a function’s parameters) may contain values that need additional scrutiny and less optimization.
Ghadi
Participant
August 3, 2020 at 10:47 am
Post count: 4
#3554
A new paper from DJB and others, thought it may be interesting to this forum.
BasicBlocker: Redesigning ISAs to Eliminate Speculative-Execution Attacks
- Ivan Godard
  Keymaster
  August 3, 2020 at 3:37 pm
  Post count: 689
  #3555
  That’s an interesting paper; thank you for posting it.
  We looked at splitting branches into a herald (their “bb” instruction) and a transfer, essentially as described in the paper, though without their attention to Spectre. After all, it is a natural companion to our split loads, which also count out an instruction-supplied cycle count. And we ultimately decided against.
  The authors make a reasonable case that they will find a big enough latency gap between resolution of the predicate (the branch condition) and the transfer to save a few cycles. But their study deals with a single-issue, in-order architecture, which a Mill most emphatically is not. Instead it is a wide-issue, and a single instruction bundle can (and often does) have several branch instructions. The multiple branches would muck up the author’s scheme, but not irretrievably.
  But the wide-issue mix of compute and control flow in the same bundle obviates their approach. In open code, the predicates are commonly computed by the same bundle that contains the branch. Even legacy ISAs commonly have “compare-and-branch” ops that combine determining the predicate and taking the transfer. With no advance notice of the predicate it’s not possible to gain any static cycles on the transfer. Of course, in loops (“closed code”) you can roll forward the predicate, potentially arbitrarily far at the cost of extending the loop prefix, but loop branches predict extremely well.
  What really decided us against a variable delay slot was the Mill replacement of branch prediction with exit prediction. Exit prediction lets the predictor run arbitrarily far ahead of execution, prefetching as it goes but without needing to examine the fetched code. The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.
  Of course, there will be blocks with a single exit (avoiding the multi-branch problem) whose predicate is known early (avoiding the lacking latency problem). The question is whether their frequency would justify the feature, and we decided that no, they would not.
  - Veedrac
    Participant
    August 5, 2020 at 6:19 am
    Post count: 25
    #3556
    The real win with a predictor is not in avoiding miss rewinds (at least on a Mill where a miss is five cycles) which the authors scheme helps with, it’s in moving instruction lines up the memory hierarchy.
    In a top-end OoO it’s perhaps more true to say the real win is filling the reorder buffer. Your predictor doesn’t have to be that good to deal with fetching code in time, given caches are getting so large and looping is so common, but if you want to fill a 560ish instruction ROB, you need to be predicting 560 instructions ahead, and predictor quality and performance is intensely important for that. But yeah, that’s not so relevant for a Mill.
  - Thomas D
    Participant
    August 17, 2020 at 8:43 pm
    Post count: 24
    #3563
    I am a little confused by your answer. While you see the bb instruction as a herald instruction, hinting to the processor that a branch is coming, I see it more as a hint to the processor that the next block of instructions is a basic block. So, the instruction’s intent, to me, is to recreate a fundamental aspect of the Mill’s operation: that it computes on basic blocks (however, the Mill doesn’t do so for the same reason: it requires the compiler to optimize the data flow a priori, and uses the basic blocks for exit prediction and split-stream encoding). Or, am I missing something basic from my skim over the paper?
    - Ivan Godard
      Keymaster
      August 18, 2020 at 12:08 am
      Post count: 689
      #3564
      Yes, it lets the code stream be organized as basic blocks. I suppose you can see it as a step toward what the Mill does but I’m biased and see it more as a device for the special-case of single-issue. YMMV 🙂
Author
Posts