Forum Replies Created

Viewing 4 posts - 16 through 19 (of 19 total)
  • Author
    Posts
  • NarrateurDuChaos
    Participant
    Post count: 23

    – 7) Benchmarking support

    Are there plans to support stable benchmarking in the Mill? Benchmarks often have to go through all sort of contortions to get filter out performance noise: warming up caches, running multiple times to smooth outliers out, maybe even recompiling multiple times to test performance with different code layouts, etc.

    The Mill has some natural advantages: the Exit Table can be preloaded, and no runtime reordering means better stability. Do you have plans to include features specifically for stable execution? Things like purging caches, avoiding interrupts, pinning cores, etc?

    – 8) Profiling support

    Similar to the previous question, do you have plans to support fine-grained profiling? Profilers often give out coarse information like “this function was executed roughly X times”, “this function took roughly Y cycles on average”, etc.

    You can use emulation tools like valgrind to get more fine-grained info like the number of L2 misses from a specific instruction, at a cost of massive performance loss. Could the Mill provide tools to help get fine-grained data without the massive overhead?

    – 9) Profile-guided optimization

    How does Mill plan to benefit from PGO?

    Traditional PGO is mostly centered on layout: the profiling phase gets coarse-grained info about which functions are called most often in which order, and from that the compiler knows both which functions to optimize for size vs performance (in other words, which loops to unroll and vectorize) and which functions should be laid out together in memory.

    It feels like, since the Mill is a lot more structured than traditional ISA, it could get a lot more mileage from PGO. It can already get benefits through the Exit Table. Are there plans to facilitate other optimizations, besides general code layout?

    – 10) Specializer hints

    Because of the whole Specializer workflow, the genASM format is essentially a compiler IR. As such, have you thought about including hints to help the Specializer produce better local code?

    Some hints might include (straight from LLVM): aliasing info (saying that a load and a store will never alias), purity info for function calls (saying a function will never load/store/modify FP flags), generally saying that a loop is safe to unroll, etc.

  • NarrateurDuChaos
    Participant
    Post count: 23

    – 4) Tail calls

    Does the Mill support tail-call instructions? I’m told they’re important for some functional programming languages.

    You might be able to save instruction entropy by encoding tail calls as “a return ganged with a function call”.

    – 5) Exception handling

    Do you have first-class C++ style exception handling?

    By “first class”, I mean “zero-cost if no exception is thrown”, so no branch-after-every-function-return or things like that.

    Also, can you handle setjmp/longjmp?

    – 6) Coroutines

    Do you support stackless coroutines in hardware? By coroutines, I mean functions you can jump into, yield from, and then jump back into at the yield point.

    By stackless, I mean “unlike a thread, they don’t need their dedicated stack and add stack frames directly to the caller’s stack”. I guess the big difficulty would be spilling directly to userspace.

  • NarrateurDuChaos
    Participant
    Post count: 23

    – 1) Exit prediction with dynamic addresses.

    What’s the current story for exit prediction with dynamic function pointers, vtables and such? Obviously, if you have a loop iterating over an array of virtual objects, the predicted address for a single call instruction isn’t going to be too helpful. But it’s probably suboptimal to throw up your arms and say “it can’t be predicted”, since sometimes these addresses will be available tens of cycles ahead of time.

    I think you mentioned potentially using deferred branches, but eventually gave up on the idea. If nothing else, deferred branches wouldn’t help you with the “array of objects” cases, since it could presumably only predict the first one.

    Ideally, you would want to have predictions available for each vtable in the array, so you can call the next method as soon as you returned from the last one. Maybe you could have a prediction queue of some sort? Or an alternate encoding scheme for when the Exit Table predicts that a given exit will always be a dynamic branch?

    – 2) Hardware prefetching

    How does the Mill handle data-side hardware prefetching? Traditional CPUs mostly rely on detecting stride patterns, eg “If you asked for addresses 12037, 12047 and 12057, you’re probably going to ask for 12067”. Do you expect Mill cores to do stride detection?

    Deferred load can help hide the latency of an L1 miss, but obviously don’t help with an L3 miss. But there are some common patterns (eg tree traversal, arrays of pointers) where stride detection and deferred loads don’t help at all, but the hardware would still have enough information to prefetch lots of data ahead of time. For instance, a foreach loop iterating depth-first over a binary tree might want to take the left branch immediately and prefetch the right branch, thus skipping half the L3 misses. Does the Mill provide anything to facilitate this?

    – 3) Inner-loop specific optimizations

    The “inner” instruction seems like it would open a lot of potential for optimizations. Have you explored loop-specific optimizations?

    I’m thinking about things like exact exit prediction (if you’re looping from “i = 0” to “i < array.size”, you know exactly how many iterations the loop will run), smart prefetching (in a loop with no branches, you know exactly what loads the next iterations will run, so you can prefetch them ahead of time), etc.

    I know that a lot of software prefetching is wasted because it prefetches past the end of a loop and things like that, but the hardware would have enough info to be precise.

  • NarrateurDuChaos
    Participant
    Post count: 23

    Hello!

    It’s been exactly a year and a week since I wrote this thread. Has there been any externally visible progress made since? Eg progress in switching the management structure on in the FPGA implementation, or anything of the sort?

    Like… I don’t mean to be rude, but… have you guys done anything at all in the past year with observable results?

    Because if not, I don’t see how anyone can take the statement “Mill is not stalled” seriously.

Viewing 4 posts - 16 through 19 (of 19 total)