Mill Computing, Inc. › Forums › The Mill › Architecture › L2 cache control in software?? › Reply To: L2 cache control in software??
There are some differences in approach that don’t give the LLC (Last Level Cache) described in that paper any advantages over the cache we had envisioned for the Mill Architecture.
The biggest justification for this cache methodology rests in an assumption that DRAM access times will continue to remain at a high multiple of processor clocks. Consider that Dennard scaling has decreased dramatically, leaving processor core clock rates not much higher than in 2004, yet DRAM clock rates have continued to climb since that time, reducing LLC miss penalties – especially with on chip memory controllers.
The paper describes a transparent physical address cache. For a variety of reasons the Mill uses virtually addressed caches. The virtual to physical translation is done by TLB logic in the DRAM memory controller. Leaving the virtual to physical translation in the DRAM memory controller does at least a couple of good things: it reduces the number of TLB accesses and it allows the Mill to reduce needless writes to DRAM of stack allocated data from functions that have exited.
The paper also describes the tag structure as an initial 4-way hashed tag lookup, with the tags having N-way associativity to the data memory they map to, followed by a hash chain to accommodate hash collisions. The presumption here is that cache miss rate improvement is enough to make the hash chain traversal worth the increased latency, but only a simple timing model is cited as evidence of this. This is then compared using trace analysis to 4-way associative with LRU replacement. While we could use a 4-way associative LRU cache as a Mill’s LLC, more likely would be to use an 8 or 16 way associative LLC with pseudo random replacement – where pseudo random replacement has been shown to have lower levels of replacement thrashing than LRU for 8 or 16 way associative caches. We also would recommend on-chip DRAM controllers for Mill processors, with DRAM access times well under the “Break-even miss latency” given in the last column of Table 3 in the paper.
I could go on, but I think I have spent enough time looking at this to decide that this may be an interesting, but not a likely path to either higher performance or lower power than other, more conventional LLC approaches.