Oh, I was on that page, but I did not see information about cycles counter before. My bad.
Indeed, there is a
cycleCounter register, that can be read using
rd operation. It is spilled on task switch (turf?), so that would mean it is per thread, and basically if I read this register it automatically deals with task switches and core migrations, and it should read only cycles spent in specific task/thread/turf, which is exactly what I would have to do (and x86/Intel/AMD performance counters DO NOT provide).
So, it is possible that this is exactly what is needed to cover the real low overhead per-thread CPU cycles/CPU time accounting from user space.
The documentation doesn’t specify if this are issue cycles, or actual cycles. I.e. does it increments when the core and pipelines are stalled (i.e. during cache misses, or poor instruction scheduling / parallelism).
Operations counter (per thread, from decoder for both instruction stream sides), would also be extremely helpful, to compute IPC.
The per-thread stats for cache loads/hits/misses would be also extremely helpful.