- Ivan GodardKeymasterMay 5, 2016 at 8:16 pmPost count: 607
The notion of “volatile” is not restricted to C and has different semantics in different languages. In addition the hardware implementation that is accessed by “volatile” varies across architectures and even between machines within an architecture. See https://en.wikipedia.org/wiki/Volatile_%28computer_programming%29 for a good informal introduction.
As we do hardware, we must support facilities that can be used for what the various languages specify, and if possible what some important other architectures actually do. There need not be a one-to-one language-to-hardware mapping (and in fact there is not), but the hardware must provide somehow what the languages define.
There are three aspects that need addressing: atomicity; active data; and ordering.
The “volatile” keyword does not guarantee atomicity, having been replaced by sig_atomic_t. The Mill hardware uses optimistic concurrency with a set of transactional operations. We haven’t yet looked at how those map to the compilers, but as our facilities can model LLSC and other architectures that provide LLSC (such as PowerPC) and have standard-conforming implementations of atomicity, we know that we can too. And more, because we also model DLDSC and richer transactional semantics as well. Even though “volatile” is not guaranteed to be atomic by the languages, it is atomic on many implementations and many programs depend on its atomicity. We currently guarantee atomicity for simple loads and stores except in the case where an unaligned access happens to cross a cache line boundary; those are double-pumped, and suitable code can see half an update. It is not clear whether we should add a hardware guarantee for this case too, and that question has been deferred.
The “volatile” keyword is traditionally used for active data, in which the hardware act of access has side effects. Because of the side effects, an access to active data cannot be speculated nor omitted (optimized away) nor duplicated. Speculation and omission might occur in the compiler tool chain, and speculation and duplication might occur in the hardware. The altered semantics of a volatile access is indicated to the hardware via “attributed” load and store operations, which carry a set of flag bits in the instruction that collectively indicate the desired kind of special handling. One of those flags is the “volatileDataCode” flag; others are “noCacheCode” and “atomicCode”.
In the Mill, the hardware can speculate or duplicate regular loads but not stores. A particular regular load may be reissued several times. The loads may be arbitrarily intermixed with stores, and we define that the observed result will be the value as of the retire of the load, rather than as of the issue of the load as on other machines. This makes regular loads unsuitable for active data. The Mill hardware also is write-back-cached, so there is no guarantee that regular loads or stores will be visible to any level of the cache hierarchy other than the top level cache. This also obviates use of regular loads and stores for active data, which generally is located beyond the caches and beyond the controller, if not off chip entirely.
The use of any attributed load or store prevents use of the deferred forms of load; there’s no encoding that is both deferred and attributed. However a non-deferred load is not instantaneous, and there will be a member-dependent number of cycles between the load’s issue and its retire. During that time, the program might issue a call operation which takes an arbitrary amount of time to execute. On return from the call a normal load is reissued.
It would be possible, with suitable compiler cooperation, to prevent execution of explicit calls while waiting for an attributed load to retire. However, interrupts and traps act like involuntary calls on a Mill, so to guarantee against reissue the hardware would have to block interrupts and traps for the duration of the load. This could be difficult if the load itself caused a trap, such as a translation trap in the TLB or a protection trap in the PLB. There are two ways to handle this: either prevent traps/interrupts/calls for the duration of the load, or do not support active-data loads. Currently we do not support active loads, and a device that expects an active load (such as a take-a-number register) must sit behind a hardware shim that reads the active location to a buffer register when the shim is (actively) written to, followed by a normal (inactive, possibly reissued) load of the buffer register.
Stores, both regular and attributed, are never reordered or reissued. Normal stores only go to the top level cache, whereas active data stores must go to the controllers. Consequently active stores must use the “noCacheCode” attribute to ensure that the request actually gets to the active location. These active stores are subject to normal protection rules and may trap on a PLB miss or TLB miss, but are held over a trap and not reissued. Consequently it is guaranteed that exactly one request will reach the active location for each noCache store issued.
The “volatile” keyword in Java defines a strict global ordering among accesses to data declared “volatile”. This ordering must be enforced both within and between threads, and both intra- and inter-core. Thus such accesses act as if they were protected by a mutex, and indeed could be so implemented, where the mutex would itself be implemented using the Mill’s primitive optimistic concurrency operations. However, Mill load and store operations are already strongly ordered within a core, so the added Java requirement of a global ordering can be achieved by ensuring an ordering between cores. The Mill uses a sequential consistency model, which is weaker than a global ordering model.
To ensure a global order, the hardware must not only preserve the order as seen by a single core, but must also ensure that the inter-core visibility provided by cache coherence also preserves that order. In the hardware, there exists one level of the hierarchy which has strict inter-core ordering with respect to both accesses to a single location and between locations; this coherence level is typically the L2. The whole hierarchy of any one core is ordered with respect to accesses to any one location, but is not ordered with respect to multiple locations in the levels down to the coherence level.
As an illustration, consider core C1 and locations X and Y where the coherence level is the L2. Any sequence of accesses to X within C1 are strictly ordered (sequential consistency), as are any sequence of accesses to Y. However, none of these accesses are visible to core C2 so long as X and Y remain the the D1 cache. Only when X or Y are evicted from the D1 to the L2 does their new value become visible to C2.
Meanwhile C2 may have its own copy of X and Y in its D1, and be updating them; C2 will see a sequentially consistent sequence of values of its own copy, and C1 will only see C2’s values when C2’s X and Y get evicted to the L2. If C1 and C2 both evict their X and Y to the L2 then the hardware defines an ordering between the two cores. The effect of this protocol is to move the cost of cache coherence to the L2 level (or whichever level has been chosen to be the coherence level), avoiding the overhead of coherence in the much-more-frequently accessed higher level(s).
However, because the evict actions are load-dependent the actual ordering of evicts is essentially random. There is no guarantee that C2 will see a new value of X from C1 before it sees a new value of Y, nor vice versa. To ensure a program desired order between visible update of X and Y, the program must ensure that those updates reach the coherency level in the desired order. That is the job of the “volatileDataCode” flag: an attributed store with this flag set causes, in effect, a normal store followed by an immediate chain of evicts down to the coherency level, whence other cores will duly receive an order preserving notice. Two stores with “volatileDataCode” set are guaranteed to reach the L2 in issue order.
Hardware vs. language:
As described above, an access from C or C++ to active data must use the “noCacheCode” flag, and an access from Java to ordering data must use the “volatileDataCode” flag. Java does not support active data and would use JNI or intrinsics if it had to. C++ disparages use of volatile for ordering; what we will do with sig_atomic_t is still unclear, but using “volatileDataCode” is plausible.
The C use of volatile is a problem. If a particular access is intended to be for active data then we must use “noCacheCode” so the access will go to the controllers and likely off-chip. If it is for ordering then we must use “volatileDataCode” and the access only needs to go as far as the L2. If we don’t know the intent, we must use both “noCacheCode” and “volatileDataCode” to be safe, and the program doing ordering will pay the cost of an unnecessary DRAM trip for every access. There doesn’t seem to be a language-standard way to distinguish the uses; perhaps we should introduce a pragma.
You must be logged in to reply to this topic.