Mill Computing, Inc. Forums The Mill Architecture Mill vs. Spectre: Performance and Security

  • Author
    Posts
  • staff
    Keymaster
    Post count: 49
    #3331 |

    Talk by Ivan Godard – September 27, 2018, at
    The Strange Loop

    Slides: 2018-09-27.Spectre (.pptx)

    The Meltdown and Spectre attacks, disclosed last year, have upended the industry. With them an attacker can read any location in memory and extract the secret content at high rates. The attacks are unique because they gain access, not by exploiting some bug in application or kernel code, but through a fundamental architecture design flaw in most modern commercial CPUs. Working around the flaw reliably can cost a third or more of program performance.

    The keyword above is “most”. General purpose CPUs today commonly use Out of Order (OOO) scheduling and speculative execution to obtain higher performance. Unfortunately, Spectre and Meltdown have revealed that the increase in speed provided by OOO comes with an inherent cost: total loss of security. However, not all CPUs use the OOO architecture. Many low-end architectures that are more concerned with power usage than speed use an older approach, In-Order (IO), and eschew speculation. Such chips are inherently immune to Meltdown/Spectre. In fact, the microcode workarounds applied to OOO machines to prevent these attacks in effect convert them into IO machines – that run at In-Order speed while using OOO power to do it.

    There is an exception to this gloomy news. The Mill architecture was designed from the beginning to provide OOO performance on an IO power budget. It does no hardware speculation and so, serendipitously, is immune to Meltdown and Spectre. That’s the easy part – a Z80 does no hardware speculation and is immune too. The hard part is getting the performance of speculation without opening security holes. The talk will explain the security problem, show why the Mill is immune, and will lightly address why Mill performance does not need OOO.

  • PeterH
    Participant
    Post count: 41

    Looking at this:
    – traditional architectures respond to an invalid memory access (and some other invalid operations) by faulting.
    – with OOO speculative execution, the fault can’t be issued until the speculation is resolved and we know whether the code needed to run.
    – So speculatively executed code has security deferred and applied retroactively. Which would not be a problem except for side effects, like the cache.

    The mill, in contrast, uses NARs to defer faulting as far as possible. We expect to be explicitly executing code with bad input, and then tossing the result based on a comparison performed in parallel. The invalid access is marked BEFORE the result can be used for any operation which might produce side effects. So the exfiltration step gets not a speculative value, but a very real NAR.

    I’m thinking that a mostly traditional architecture could avoid Specter if it also used NAR tagging on invalid memory reads.

    Going back to the example, what if A1[x] was a legal read for the program, but outside the range of the array A1? No fault or NAR would be generated. So the execution would have to be ordered to perform the range check before the lookup in A2 was performed. An easy enough check for the Mill specializer, not such an easy patch for existing OOO hardware.

    • Ivan Godard
      Keymaster
      Post count: 689

      Protection granularity will bite you in the butt. There is no one size – er, granularity – fits all. The way we organize programs these days seems to lead to three broad granule ranges on interest in different contexts.

      There’s atomic granules, in which the item is indivisibly unitary; you deal with it as a whole, and it’s not meaningful to treat it as a composite with substructure. An int or char, say.

      Then there’s collection granules, in which the item is composite and you expect to access substructure and move around the inside, but it has a hull that you are not supposed to wander beyond. Think array, struct, class.

      Then there’s arena granules, random assemblages of other items, which are united in that a contained item may refer to another contained explicitly. Think processes and protection spaces.

      These different granularities need different implementations, different controls, and different programming practices and language constructs. Which may need different cost models and different points at which you say “don’t do that; if you do you deserve it” and leave it unprotected.

      At the atomic granularity most machines, including Mill, encourage but do not enforce granule protection. There’s nothing to prevent you from taking an int apart into pieces; it’s just annoying to do.

      At the arena granule level the Mill has turfs, which have been covered in the talks. They give better granule and access control than the processes offered by other machines.

      The middle size, collection granularity, is poorly supported except by segmented architectures and capability machines. The problem is that putting every array or malloc result in a separate process or turf is way overkill and will hose machine throughput because the turnover rate is too much to track. And you can’t use an atomic granularity mechanism because these things have substructure. The industry has followed the lowest common approach, and there’s no collection-granularity protection at all in mass market machine.

      Surprise! the Mill offers collection-level granularity. A granule can be anything from a byte to all of memory. Tt works with existing code and ordinary languages without reprogramming. You can navigate around the substructure safely and it keeps you from navigating outside. It supports the C/C++ one-past-the-end rule. To support it we lose a couple of bits in the physical address space, and require that the granule be a power-of-two size; if you navigate into padding within the granule, well you deserve it.

      We are very unsure how popular this Mill feature will be. Nearly all existing codes violate the language access boundary rules. Run times and OSs too. Think about memmove(…) – would you be surprised to know that it uses C undefined behavior when it decides whether to copy up or down? So the standard Mill runtime will offer a –UNSAFE-MALLOC option that lets you treat the whole process as a single collection-level granule. And you will deserve it.

  • Ghadi
    Participant
    Post count: 4

    More Spectre like exploits — this one seems particularly serious:

    https://eprint.iacr.org/2018/1060

    Simultaneous Multithreading (SMT) architectures are attractive targets for side-channel enabled attackers, with their inherently broader attack surface that exposes more per physical core microarchitecture components than cross-core attacks. In this work, we explore SMT execution engine sharing as a side-channel leakage source. We target ports to stacks of execution units to create a high-resolution timing side-channel due to port contention, inherently stealthy since it does not depend on the memory subsystem like other cache or TLB based attacks. Implementing said channel on Intel Skylake and Kaby Lake architectures featuring Hyper-Threading, we mount and end-to-end attack that recovers a P-384 private key from an OpenSSL-powered TLS server using a small number of repeated TLS handshake attempts. Furthermore, we show that traces targeting shared libraries, static builds, and SGX enclaves are essentially identical, hence our channel has wide target application.

    How does this intersect with the Mill?

    • Ivan Godard
      Keymaster
      Post count: 689

      We don’t do SMT. Our numbers suggest that available chip area and power resources are better put into more cores instead of into SMT sharing of a single core. It may seem wasteful to have a core idling in stall when a load misses in cache, but for this architecture the overhead of trying to use the core for something else is greater than any saving from sharing. And, for free, having no SMT means a large number of security attacks are impossible on a Mill.

      • Dave
        Participant
        Post count: 10

        Speaking of which, does Mill Computing have an idea yet about when the time will be right to talk about multi-core Mill CPUs? My recollection is that the matter was touched on in I think the threading talk, but not in any great detail.

    • Witold Baryluk
      Participant
      Post count: 33

      Hi, Ghadi. The particular exploit, was actually exploiting the bug in the OpenSSL. It was actual bug (non constant-time operation in crypto code) in OpenSLL, that got patched. SMT made it easier to exploit, but even without SMT it was likely it can be exploited.

      As of SMT, SMT makes a lot of things easier indeed. It is very unlikely Mill of any kind will ever by SMT enabled, because it would be really hard to do. SMT is beneficial on OOO because it is pretty easy to implement if you already have OOO available. Adding SMT to in-order core, with no OOO is relatively speaking more expensive, and at this point it is often better to simply add more cores for even better performance gain. Obviously SMT is still probably beneficial even on a Mill like design, if the SMT is wide (i.e. 8 way), and the code you are executing is limited by the memory latency. Time will tell, maybe you are better off by using GPU / Xeon Phi like CPU, then, or doing software smt, and try to load as much stuff in parallel, by rewriting the code, or develop SMT enabled Mill, but I am sure this is really not worth discussing at the current stages of development.

  • goldbug
    Participant
    Post count: 53

    They recently discovered SplitSpectre, which is a spectre variant with a much simpler gadget.

    With regular spectre, this was the gadget needed in the victim space:

    
    if (x < array1_size)
      y = array2[array1[x] * 4096];
    

    Which is not that common.

    With SplitSpectre, this is the gadget needed:

    
    if (x < array1_size)
      y = array1[x];
    

    Which happens practically everywhere.

    Access to array2 can be in the villan’s space if y is returned.

    From your talk, I reckon the mill is still not affected.

    • This reply was modified 5 years, 7 months ago by  goldbug.
    • This reply was modified 5 years, 7 months ago by  goldbug.
    • This reply was modified 5 years, 7 months ago by  goldbug.
  • Witold Baryluk
    Participant
    Post count: 33

    Hi,

    I was following Mill developments for few years, and I just watched this video about Spectre vs Mill today, and I still have an issue. Yes, Mill is not speculative in hardware, but as you said, compiler will do a speculation in software by reordering loads to get best performance and hide stalls.

    So code like:

    
    int foo(int x) {
      if (x < 100) {
        return 0;
      }
      return A2[A1[x] * 64];
    }
    

    will produce op codes like this:

    
    compare x, 100
    return if false
    load A1[x] to k
    shift k
    load A2[shifted k] to r
    return r
    

    any optimizing compiler will convert this into:

    
    load A1[x] to k, delay=2
    compare x, 100
    return if false
    shift k
    load A2[shifted k] to r
    return r
    

    or

    
    load A1[x] to k, delay=0
    shift k
    load A2[shifted k] to r, delay=2
    compare x, 100
    return if false
    return r
    

    semantic is the same, even if loads accesses invalid memory, and produce NaR, and ultimate a fault in ‘return r’.

    So, the loads will still pollute the cache, and a timing attack can be performed. And I do not see how Mill is any safer, because code will still do speculation due to compiler.

    To prevent speculation by compiler, you will need to either tell compiler not to do so (loosing some performance), or annotate specific fragments of code as critical and not be speculated by compiler, which is as error prone as mitigations and barriers you would put manually on OoO speculative machines.

  • Witold Baryluk
    Participant
    Post count: 33

    Hi,

    I was following Mill developments for few years, and I just watched this video about Spectre vs Mill today, and I still have an issue. Yes, Mill is not speculative in hardware, but as you said, compiler will do a speculation in software by reordering loads to get best performance and hide stalls.

    So C code like:

    
    int foo(int x) {
      if (x < 100) {
        return 0;
      }
      return A2[A1[x] * 64];
    }
    

    will produce op codes like this:

    
    compare x, 100
    return if false
    load A1[x] to k
    shift k
    load A2[shifted k] to r
    return r
    

    any optimizing compiler will convert this into:

    
    load A1[x] to k, delay=2
    compare x, 100
    return if false
    shift k
    load A2[shifted k] to r
    return r
    

    or even more aggressively into

    
    load A1[x] to k, delay=0
    shift k
    load A2[shifted k] to r, delay=2
    compare x, 100
    return if false
    return r
    

    semantic is the same, even if loads accesses invalid memory, and produce NaR, and ultimate a fault in ‘return r’.

    So, the loads will still pollute the cache, and a timing attack can be performed. And I do not see how Mill is any safer, because code will still do speculation due to compiler.

    To prevent speculation by compiler, you will need to either tell compiler not to do so (loosing some performance), or annotate specific fragments of code as critical and not be speculated by compiler, which is as error prone as mitigation and barriers you would put manually on OoO speculative machines.

    • Dave
      Participant
      Post count: 10

      So, the loads will still pollute the cache, and a timing attack can be performed.

      I read somewhere that one way around the issue is to have some amount of “speculative” cache. That is, data which doesn’t get stored in the normal cache and can only be accessed by the speculatively executing code, and said speculative code accessing any data already in the cache doesn’t affect whether said cached data gets evicted from said cache. If the CPU runs out of “speculative” cache, the speculative execution would stop pending resolution of the branch.

      I think you could also fix it by making all* of the cache be one physical level which the CPU could then dynamically partition into per-process private caches and multiple logical shared caches. Processes could then tell the CPU whether they want a larger private cache or request access to one (or more, I suppose) of the logical shared caches if it’s one app that needs to share data across multiple concurrent threads. Since cache wouldn’t be shared between unrelated processes, malicious code wouldn’t be able to observe any changes to other processes’ cache, and non-malicious code couldn’t accidentally leak data by changing malicious code’s timing. Dunno what that’d cost in transistors, though, nor do I know what the performance implications would be for only having one physical cache level.

      *Except maybe some amount per-core, small enough that it could be flushed along with the belt every context switch.

      • Witold Baryluk
        Participant
        Post count: 33

        Dave, that sound like a solution that could be implemented on any architecture, and adds additional complexities in terms of cache coherency, and adds latency to check in both caches probably.

        I am all for QoS or minimum-maximum ranges of allocated cache space (in LLC) per turf, so meory intensive and cache trashing other processes do not completely kill applications that want low latency and some small amount of cache (mostly for code, and a bit of data), that is not constantly evicted on context switches. But I do not think L1 has enough space do do the same for data.

        • Dave
          Participant
          Post count: 10

          It could be implemented on any arch, yes. The Mill already has turfs for main memory, though, so a key part of the mechanism is already there. IIRC, they also do something different with caching that lets a Mill check whether a process has read/write permissions to cache locations more quickly than at least x86 (don’t quote me on that… I need to watch the talk on caching again).

          It’s not that this would all have to fit in L1 as opposed to L2 or L3, it’s that physically everything is essentially L1. I’d imagine that this would make the cache slower (and possibly smaller overall)… With regard to performance, the question is whether the extra control over what stays in cache makes up for those possibly quite significant downsides. I wouldn’t know how to even begin answering that question.

          • Witold Baryluk
            Participant
            Post count: 33

            That is completely off topic.

            My comment was about the claim that Mill is immune to Spectre because it doesn’t do speculation as OoO machines, and during talk there was mention that fixing OoO require labor-intensive and error-prone annotations of the code, and claim that Mill doesn’t require that.

            How is that true, in the context of my code example and compiler hoisting loads before branches, which WILL happen in 95% of open code and loops.

          • goldbug
            Participant
            Post count: 53

            I asked the same thing a while ago.

            They published a paper with the answer
            https://millcomputing.com/blog/wp-content/uploads/2018/01/Spectre.03.pdf

            see the section called “Software and compiler speculation”

            Their answer seems to be a loadtr operation, that will only perform the load if the predicate is met, avoiding branching.

            When I asked, they said they were still trying to decide if there was a better way.

            • This reply was modified 5 years, 7 months ago by  goldbug.
          • Witold Baryluk
            Participant
            Post count: 33

            Thanks goldbug. I see they do admit in the paper that indeed compiler hoisting loads to do speculative loads before checks, can make Mill be vulnerable to Spectre. “In our analysis we found and fixed one such bug in the Mill compiler tool chain.” And the solution was to improve compiler. I am not sure how it figures out where to make these speculative loads and where not, as it is really hard to predict where untrusted data is being used, without annotations or barriers in code. (and without killing performance in some other workloads that benefit from compiler doing speculative loads).

            The fact that Meltdown-like load doesn’t pollute the cache is nice, and it puts a NaR as a result. This isn’t that much different than what AMD is doing on their CPUs, where they also do protection/TLB check soon enough, and do not put stuff in cache. Intel was doing this too late in the pipeline, and it was polluting cache. So nothing extremely special about that here.

            For the Spectre variant 2, it is interesting that Mill actually does restore entire CPU, including caches, to a correct state even on missprediction in branch predictor. I guess this is doable because the misprediction latency is low, and only few instructions would be fetched and decoded, and it is easy to restore data cache back (because if it was L1 miss, the data would not arrive from L2 anyway in that time). Similarly if the misprediction targeted instructions that are not in instruction cache, (which would make it a bad branch predictor anyway, and is unlikely to be a miss), they will not arrive on time from L2 either, so it is easy to cancel the load and go back into proper path.

            There are examples in the paper, exactly discussing the same example I was pointing on.

            It appears the solution was to actually not do speculative loads before all checks that would skip the load, are executed, if possible in single instruction, i.e.

            
            lsssb(%3, %2) %5,
               loadtr(%5, %0, 0, %3, w, 3) %6;
            

            This is nice, and has very little performance penalty. In both cases the load here will produce a result on a belt, but if the condition is false, it will put a NaR on belt. So, rest of the code can still do speculative stuff, or use this value (real or NaR) as input to other stuff (including other loads indexes for example). I really like NaRs.

            And people who write more performance oriented stuff, can probably pass a flag to do more aggressive (and Spectre prone) scheduling.

            I guess this is not bad.

You must be logged in to reply to this topic.