Mill Computing, Inc. Forums The Mill Architecture Performance counters

  • Author
    Posts
  • Witold Baryluk
    Participant
    Post count: 33
    #3359 |

    Hi,

    I am a software developer that is obsessed with performance and benchmarks, and in the past I developed specialized tools and libraries for measuring precise timing of various pieces of code, not just for performance tuning, but also for load balancing and overload protection on servers, by measuring accurate timing of every server request independently (CPU time and wall clock time, in userspace), that allows to find requests that are expensive (even if one request is executed by multiple threads, or across multiple asynchronous callbacks, executed by different thread, but in the context of the same request).

    The big trick is to do performance measurements without impacting the performance of the code itself. I.e. in high performance server code, it is a big no-no, to do syscalls to kernel and do complex time keeping all the time. This is why my software was using various tricks and hardware counters, plus implicit channels between user space and kernel space to detect pre-emption and cpu migrations (so, even in the event of cpu core migration, preemption by other thread on the same cpu core, or hardware interrupt to handle something, the accumulated cpu and wall clock time would actually still correct), and in majority of cases would be be a zero cost overhead (cost would be few nanoseconds in majority of cases, which is insignificant compared to hundredths of microseconds of even the fastest server RPC handling).

    I was wondering what are the plans for performance counters on Mill, and if it would be possible to make them per thread (per-tasklet?). I.e. if I my code is executing and I load some “start” descriptor, then do arbitrary amount of computations, loads, stores, calls to other tasklets / kernel calls / premptions (which are implicit calls via interrupts, or due to handling of some other interupts, i.e. to handle some new data from network for other thread), and then at the end load some “end” descriptor, and compute the difference, of how much my tasklet alone took wall clock time (including stalled time, or being preempted by others, even if tasklet was runnable) and cpu time (actual cycles executed when it was both runnable and actually running), and possibly other metrics (i.e. number of memory loads, number of D$1 misses, etc), but with everything else going on the core, not affecting the correctness of the result.

  • Findecanor
    Participant
    Post count: 34

    I disagree. While measuring performance is important, I think that too fine-grained performance counters should be unavailable to unprivileged user programs for security reasons.

    There has been a lot of talk on the CPU-level vulnerabilities Spectre and Meltdown this past year. Those consist of two components: first the use of speculation to access secrets and second the use of side-channels to exfiltrate the secrets to a receiver. The side-channels in question use timing of memory accesses to find cache hits and misses. Now, we all know that The Mill is impervious to Spectre and Meltdown because it stops the access because it does not have speculative execution (except as explicit instructions put there by the compiler..), but there are many other types of CPU-level attacks out there that have variations of the second: side-channels that depend on precise timing.
    Among these are various attacks that monitor the CPU time and memory use of other processes to determine what they do: for instance for sniffing password prompts and monitoring encryption algorithms to reduce the search space for encryption keys.
    Having fine-grained timing privileged does not make it impossible to conduct all types of side-channel attacks, but it could make some attacks significantly harder to pull off.

  • Witold Baryluk
    Participant
    Post count: 33

    I think forum is broken. I see user Findecanor replied in this topic, but there is no reply when you actually open the topic.

    • Findecanor
      Participant
      Post count: 34

      I think forum is broken. I see user Findecanor replied in this topic, but there is no reply when you actually open the topic.

      Before there were more posts, my reply used to be visible when logged out but hidden when logged in. Now it is not visible at all. Weird…
      BTW. The bug got triggered when I tried to edit my post. The edit didn’t get take.

      What my post was about:
      I would like to see that The Mill should make it possible for an operating system to make access to performance counters be privileged to the operating system, and/or that care should be taken about what it is exactly that performance counters in user-mode does measure.

      The concern is about security. CPU cycle counters are often used for side-channel attacks to find out what another process does: measuring its own portion of total CPU usage to find out the target process’ CPU-usage (“timing attack”) or measuring the time of memory accesses to find out which addresses the other process had loaded into cache. (“Cache attack”) Cache side-channels are maybe best known to be a major part of Spectre and Meltdown (The second phase: the “exfiltration” part. While the first phase of Spectre and Meltdown are not possible on the Mill because the CPU does not execute instructions speculatively there are many other types of side-channels attacks that don’t rely on speculative execution.) Some attacks target password prompts. Other target encryption algorithms, reducing the search space for cracking encryption keys.
      Access to CPU cycle counters are privileged instructions on ARM but in user-mode on x86. Therefore many attacks are easier to conduct on x86 and harder or even impossible on ARM.

      I do realise that the issue is not easy.
      I know first hand from working with video compression that there are many cases where the performance of your code depends on the data cache, and where you therefore really want to be able to measure how changes to the memory layout affects caching performance.

      • Witold Baryluk
        Participant
        Post count: 33

        I had the same issues with post editing. Sometimes I will edit the post minute after posting, and after saving it will then disappear.

        Ability to make some operations privileged (not just by OS, but by any process) seams nice on the surface (i.e. i.e. in some virtualization scenario it would be nice to disable wide vector FPUs ops – not just on side of specializer, but actually to trap in CPU when used; or to disable explicit instruction/data cache flushing). I understand the concern of hiding performance counters. But even with performance counters being privileged and not accessible to normal programs, there are ways to get accurate timings from user space and execute side channel attack successfully. Example: Two threads, one spinning and updating a counter in memory or L3 cache. Calibrated using normal timers or real clock (i.e. with just milisecond or second accuracy even!). Another thread reading this data back on start and end. Aka you just recreated very precise (accurate to even few cycles probably) timer. Even more accurate if you make your threads have thread affinity and be close by on the chip and sharing L2 or L3 in specific way, and not migrating often. This is very easy to accomplish.

        Trying to hide timers is just a workaround, not a real solution. Making access to very accurate hardware counters will just makes more trouble in normal use.

        Also, notice that I do not ask for “absolute” timer. I am specifically asking for a timer that is only progressing during execution of specific thread (turf), or a facilities to do so. This way it is immune to what other threads are doing to big extent, even if the turf was context switched to something else, and then back. It will still see L1/L2 cache latencies if it was touched by other thread during its own execution of course. I see absolutely no way to prevent that in general. Meltdown is easy to fix by other means. Spectre can be addressed by compiler and hardware too, and Mill does do few tricks to make it work well without any impact on performance.

  • Veedrac
    Participant
    Post count: 25

    I don’t know details, but there are some trivial counters listed http://millcomputing.com/wiki/Registers.

    There are so many cool things a Mill is uniquely capable of in this space relative to an OoO machine; it would be a shame not to hear that they have something interesting planned.

    • Witold Baryluk
      Participant
      Post count: 33

      Oh, I was on that page, but I did not see information about cycles counter before. My bad.

      Indeed, there is a cycleCounter register, that can be read using rd operation. It is spilled on task switch (turf?), so that would mean it is per thread, and basically if I read this register it automatically deals with task switches and core migrations, and it should read only cycles spent in specific task/thread/turf, which is exactly what I would have to do (and x86/Intel/AMD performance counters DO NOT provide).

      So, it is possible that this is exactly what is needed to cover the real low overhead per-thread CPU cycles/CPU time accounting from user space.

      The documentation doesn’t specify if this are issue cycles, or actual cycles. I.e. does it increments when the core and pipelines are stalled (i.e. during cache misses, or poor instruction scheduling / parallelism).

      Operations counter (per thread, from decoder for both instruction stream sides), would also be extremely helpful, to compute IPC.

      The per-thread stats for cache loads/hits/misses would be also extremely helpful.

      • Veedrac
        Participant
        Post count: 25

        ignore this

        • This reply was modified 5 years, 10 months ago by  Veedrac.

You must be logged in to reply to this topic.