This too is mostly the domain of language and OS. The usual instrumentation methods will work on a Mill of course, but transparent instrumentation has really nasty issues with security that can only be addressed by the OS security model, not the architecture. Do you really want to give this turf the ability to dynamically change (i.e. instrument) the code being run by that turf?
My question was, have you thought about adding tools that give the benefits of valgrind-style instrumentation without actually intercepting instructions?
Say my program is running much more slowly than I expected, and I’m trying to figure out why. Right now my options are:
– Run the program with perf, and get a coarse-grained profile of my program. I can get a flamegraph of the function calls (though it will be an averaged approximation) and a rough summary of various indicators (instructions retired, cache misses, branch mispredicts, etc).
– Run the program with cachegrind, and get a fine-grained profile of my program. The emulator will tell me exactly how many cache misses each line of code will trigger, with the caveats that they’re misses in an emulated architecture, not on my actual computer; some things like stride prefetching and reorder buffers aren’t actually emulated. Also, obviously the execution is about 10x to 50x slower.
– Use a domain-specific profiler (eg if I’m using a database and it comes with one).
– Implement my own homemade solution to measure the metrics I care about.
Of the choices above, only perf actually uses CPU instrumentation. The problem is that it’s extremely coarse; the CPU will only have counters for some types of events, to give you a summary at the end of execution. (Also, I think it needs some privileges to run.)
What I’d want is something giving me a journal of the entire execution, from which I would get the metrics I need.
For instance, say what I care about is cache misses. I would want to tell Mill “execute this program, and every time you run a
load instruction, log the cache level plus the instruction pointer in a journal, and save the journal to disk every 10MB”. This is a super-simplified example (if nothing else, the resulting journal would be huge), but it’s the kind of information I would want. From that journal, I could get a summary that tells me “this function has a lot of L1 misses, this function has some L2 misses, this specific line of code has lots of L3 misses and keeps stalling the entire program”, etc.
The Mill has the advantage of being fully deterministic except for a very small number of instructions; so, from a small journal, you might be able to reconstruct the entire execution trace, and find better insights than you’d get from other CPUs.