L2 cache control in software??

Author
Posts
NXTangl
Participant
July 22, 2024 at 2:29 pm
Post count: 21
#3994 |
Take a look at this paper. What are your thoughts on integration with the Mill? It seems like it fits pretty well with how the Mill likes to handle cache stuff…
Art
Participant
July 24, 2024 at 6:29 am
Post count: 11
#3997
There are some differences in approach that don’t give the LLC (Last Level Cache) described in that paper any advantages over the cache we had envisioned for the Mill Architecture.
The biggest justification for this cache methodology rests in an assumption that DRAM access times will continue to remain at a high multiple of processor clocks. Consider that Dennard scaling has decreased dramatically, leaving processor core clock rates not much higher than in 2004, yet DRAM clock rates have continued to climb since that time, reducing LLC miss penalties – especially with on chip memory controllers.
The paper describes a transparent physical address cache. For a variety of reasons the Mill uses virtually addressed caches. The virtual to physical translation is done by TLB logic in the DRAM memory controller. Leaving the virtual to physical translation in the DRAM memory controller does at least a couple of good things: it reduces the number of TLB accesses and it allows the Mill to reduce needless writes to DRAM of stack allocated data from functions that have exited.
The paper also describes the tag structure as an initial 4-way hashed tag lookup, with the tags having N-way associativity to the data memory they map to, followed by a hash chain to accommodate hash collisions. The presumption here is that cache miss rate improvement is enough to make the hash chain traversal worth the increased latency, but only a simple timing model is cited as evidence of this. This is then compared using trace analysis to 4-way associative with LRU replacement. While we could use a 4-way associative LRU cache as a Mill’s LLC, more likely would be to use an 8 or 16 way associative LLC with pseudo random replacement – where pseudo random replacement has been shown to have lower levels of replacement thrashing than LRU for 8 or 16 way associative caches. We also would recommend on-chip DRAM controllers for Mill processors, with DRAM access times well under the “Break-even miss latency” given in the last column of Table 3 in the paper.
I could go on, but I think I have spent enough time looking at this to decide that this may be an interesting, but not a likely path to either higher performance or lower power than other, more conventional LLC approaches.
Author
Posts

You must be logged in to reply to this topic.