Mill Computing, Inc. › Forums › The Mill › Architecture › New cache design speeds up processing time by 15 percent › Reply To: New cache design speeds up processing time by 15 percent
Sorry for the delay in responding to this: last week was the Security presentation a Google, and prep for presentations takes priority. The paper also takes some thought.
The paper and its proposals mostly address the problem of locality distribution in caches: some lines, once references, will be hit over and over again, while others will be one-shots. When the one-shots crowd out the heavily used (which can occur when the working set exceeds the top level cache) the result is cache thrashing and poor performance. Thrashing is especially a problem in distributed caches. The paper proposes a predictor mechanism that guides cache placement and lifetime decisions of individual lines.
There are several Mill aspects that address this problem; none of them involve a predictor mechanism as proposed.
The simplest Mill mechanism is that the Mill does not automatically load a missing line to the TLC. Instead, a miss hoists a line a single level from wherever it is found. Recall that a Mill load may be satisfied piecemeal from several different levels of the hierarchy, using the valid bits attached to each byte in the caches. (See the Memory talk). Each copy of the line that supplies at least one byte to the result of a load is automatically hoisted; if the destination level already contains a copy of the line (necessarily with disjoint validity) then the existing and incoming lines are merged.
The delivery of loaded bytes to the retire station is completely separate from this hoisting process, and uses distinct data paths. The hoist is purely an optimization, so if there is contention at the destination level then the hoist can simply be abandoned without error.
Mill hoisting thus provides much of what the authors set out to achieve: a difference in handling between high- and low-locality lines. In a three-level system (L1, L2, DRAM), a first load will hoist a line from DRAM to L2, and it takes a second load to hoist the line to L1. A line loaded twice during its L2 residency lifetime is a good candidate for high-localityaccess.
The drawback of the Mill approach is that if the second load immediately follows the first then the second must take an L2 latency, whereas if the line were brought in to the TLC by the first load then a second load would have only a L1 latency to pay. The details of the timing are implementation- and member-dependent, but in the lazy hoisting used by the Mill the total latency of moving a line up twice is quite close to the latency of moving the line to the top at once; the latencies are dominated by wire times, and the total distance is the same.
The description above is couched in terms of a monocore, whereas the paper is concerned about distributed caches, both on- and off-chip. In distributed systems, the cost differential between high- and low-locality lines is greater than for a non-distributed system, but the benefit of the Mill approach remains. In effect, the Mill sees a remote cache fragment as just another level. For example, if the L2 is distributed, then the first load that hits in a remote L2 will hoist the line to the local L2, just as a hit in DRAM will be hoisted to the local L2. It takes a second load hit (now in the local L2) to get the line into the (local and private) L1.
A second aspect to the Mill that addresses the paper is the cache-byte valid bits. A large amount of the inter-cache traffic in systems such as those modeled in the paper is due to false sharing among threads. False sharing arises from the line-granularity of conventional coherence protocols. The Mill uses byte-granularity (supported by those valid bits) and so is immune to false sharing: threads in different cores can use disjoint parts of a single line without the ping-ponging that is the bane of conventional shared memory. As a result, there are substantially fewer loads of new lines for which residency decisions must be made: if a line ping-pongs ten times then that’s ten look-ups in the paper’s predictor, whereas in the Mill it is just two hoists and no further.
The paper makes much of avoiding full line moves for low-locality data; they return a single word. This “benefit” is obviated in modern machines with vector units working on data that approaches or equals cache line size: a cache-line datapath is required, so there is no gain (other than power, if that) in fetching narrower data. Line-width is also needed for code. Of course, the paper’s predictor avoids the power consumption of Mill hoisting on predicted low locality lines, but in exchange has the power consumption of the predictor itself; that’s a tradeoff that would require measurement to evaluate, but I doubt the predictor is a power win against the simple Mill hoist mechanism.
One aspect of the paper that confounds this analysis is that the paper assumes a directory-based distribution. I suspect that the choice was motivated more by the existence of a prior simulator implementation than for design reasons, but no matter. Directories are frequently used in large-scale (multi-chip and frequently multi-rack or multi-room) systems where fabric is a major cost, but they are a less obvious choice for on-chip fabrics such as those contemplated for the Mill. While we have not put a lot of though into multi-core Mills yet, I expect that we will use a broadcast fabric with snoop filters rather than a directory; the coice will depend on the chip fab technology available when we do a multi-core, which is several years out and need not be decided now. However, the costs and benefits of the proposal clearly depend heavily on the cache addressing and coherence mechanism used; the relative merit of the proposal in its application to the Mill cannot yet be fully determined.
Nevertheless, in light of the above (and a bit more NYF), I don’t think that the author’s proposals would be of value on a Mill; we’re there already, and mostly we simply don’t have the problem anyway and so have no need for a fix. (That’s a tune I find myself singing frequently 🙂 ).
Going a bit further than the specific application to the Mill though, I am somewhat doubtful about the paper’s proposals even when applied to a conventional machine. This is little more than a gut feeling, and I have no numbers to back it up. However, I think they underestimate the cost of their proposal, in power, area and latency. In addition, I rather strongly suspect that the simulation they use has the effect of hiding internal dependencies that would be visible in hardware; the proposal looks to invite race conditions, and alleviating those might eat up all the apparent gain.