I have no more expertise in timing attacks than you have, and have put no thought into it. However, after reading the paper (thank you for the citation) I think that well-written software on the Mill will be less subject to such attacks than software on conventional out-of-order CPUs.
Thinking about data dependent timing variations (the crux of the attack), there seem to be three sorts: data-dependent control flow, principally branches; data-dependent data accesses, principally loads from S-box arrays; and data-dependent execution speeds, principally execution issue order.
Clearly the last (issue ordering) does not apply to an in-order exposed-pipeline architecture such as a Mill. Absent external stalls (such as cache misses), Mill code will run in program order lock-step and there should be no variations at all.
The Mill branch prediction and instruction cache is subject to data-dependent timing variation. However, branches can be removed on a Mill and replaced by the pick operation, which is constant time. When fully unrolled, even the outer loop branch of AES goes away. The AES is small enough to fit in the i$1, so if the branch predictor predicts entry into the AES (or if it doesn’t, the mispredict recovery logic) then the prefetcher will bring the whole algorithm into cache as a single EBB. This prefetch may be subject to variations of timing depending on where the individual code lines must be obtained from, but this variation is not data dependent.
That leaves the S-box array lookup timing. These arrays are small enough to fit in a modern d$1 cache, so one obvious approach is simply to have a pre-loop that loads from each cache line of the array, bringing them all to cache and thereby removing the timing variation in the actual data-dependent lookups. However, future crypto algorithms may need larger tables that don’t fit in cache and so cannot be preloaded.
Another possibility is to avoid use of the top-level cache entirely, and keep the tables in the D$2, which is certainly big enough to hold them. Data can be forced to reside at the join-point of the cache hierarchy (which is the D$2 on our current configurations, but on any configuration will be big enough). Mill memory-reference operations carry a small set of special-handling flag bits to control per-request cache behavior. One of those flags is “volatileData”, which forces the request to bypass private caches and go to the top shared cache.
A volatileData request will be slower than a normal load, because it will suffer a d$2 latency (~10 cycles, member dependent) vs. a d$1 latency (~3 cycles, member dependent). However, I suspect that most or all of this latency can be hidden by use of the Mill deferred loads, overlapped with the rest of computation.
The above are software solutions not requiring hardware configuration. In hardware, we have long expected that hard-real-time applications would want to have constant-time access memory (SRAM) on the chip. On other RT-oriented chips this is provided by either a NUMA segment or by pinning either a fixed portion of cache or a set of lines within the cache. Either solution could be, and quite possibly will be, configured in Mill chips for the RT market. However, AES will also be run on non-RT Mill chips, so this is not a general solution. Likewise, though it would be possible to configure a dedicated AES functional unit or co-processor, I doubt that would be present on every Mill.
So after thinking about the issue, I conclude: for ultra-high-speed crypto one should select a Mill with a hardware AES box with its tables in dedicated fixed-latency SRAM, and for general-purpose use one should declare the tables as “volatile” and write branchless code for the algorithm.
Thank you for bringing this use-case to our attention. We had not considered it before, and it is gratifying to see that a vanilla Mill appears to offer a practical solution to the troubles that the case gives a conventional CPU.