Layout can dramatically affect performance

Tagged: Layout

Author
Posts
Validark
Participant
May 15, 2023 at 7:27 am
Post count: 21
#3892 |
I just watched this awesome talk called “Performance Matters” by Emery Berger:
From 9:41-15:37, he describes how arbitrary circumstances related to layout affect the performance of a program due to a number of factors, including: cache effects in the heap and stack, branch addresses being different and therefore having different collisions in the branch predictor, and potential differences in prefetching. Since the environment variables change how much is on the stack, that too can affect performance! There’s also apparently TLB issues when something spans multiple pages, but the memory talk said that the Mill moves the TLB off the critical path, so maybe the Mill won’t suffer from TLB issues to the same extent that conventional machines do?
My question is, What does the Mill bring to this party? Can the Mill have the specializer choose a layout that’s more optimal for some of these issues? I heard the Mill intends to store exit predictions from previous runs on disk and prefill the branch predictor with that data when starting an application. Can that or something like that help with any of these kinds of issues?
Here’s the paper the talk is referencing:
https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
- This topic was modified 1 year, 2 months ago by Validark. Reason: typo
NXTangl
Participant
July 7, 2023 at 12:08 pm
Post count: 18
#3935
My guess would be that the Mill is significantly less sensitive to this kind of thing just by accident.
Firstly, laying things out to avoid I$1 conflicts is certainly something that could be implemented, but it would require knowing which functions usually need to be called at the same time during specialization. However, exit prediction prefetch means that a much higher percentage of jumps end up as cache hits anyway, so the impact of not doing that is mitigated.
Secondly, the Mill will inherently store much less data on the application-visible stack because register saving is handled by the hardware and the Mill can store huge amounts of data in the Scratchpad. This makes stack data more compact, inherently giving better locality.
Thirdly, the Mill is smart about stack frames. A stack frame that is evicted doesn’t necessarily incur a cache miss when it is used again, since the Mill tracks stack frames across function call boundaries. In particular, operations on fresh stack frames should just work.
Fourthly, as a SAS-with-fork()-hack processor, the Mill will only hit the TLB if it’s going to memory anyway. Permission checks have to be done on each load, store, branch, and call, but the PLB is effectively fully associative and only stores one entry per range of bytes with a given permission, making it completely insensitive to both code and data layout changes that don’t cross protection boundaries.
- Ivan Godard
  Keymaster
  July 7, 2023 at 10:38 pm
  Post count: 689
  #3936
  Remarkably clear exposition. Thank you!
Author
Posts

You must be logged in to reply to this topic.