Mill Computing, Inc. › Forums › The Mill › Architecture › Side channel timing attacks
- squizzleParticipantMarch 9, 2014 at 4:31 amPost count: 5#731 |
I’m wondering if there’s any consideration put in to protecting against side-channel timing attacks in the mill, either through dedicated instructions for things such as AES or some other magic/advanced tech.
With a statically scheduled instruction set it’s easier to write constant-time code, but things like a cache miss could still possibly give you a bad time. As I understand it, the abstract MI also hides exact cycle timing (or does it?), so matching cycle counts on different code paths is more complicated.
djb has some advice for CPU designers in his paper, which inspired the question (http://cr.yp.to/antiforgery/cachetiming-20050414.pdf)
I’m hardly an expert in CPU design, cryptography, or any other fun stuff like that, so I don’t fully know if the questions are even relevant to the mill, but thought I’d ask as side-channel questions might come up in the security talk.
- Ivan GodardKeymasterMarch 9, 2014 at 12:30 pmPost count: 679
I have no more expertise in timing attacks than you have, and have put no thought into it. However, after reading the paper (thank you for the citation) I think that well-written software on the Mill will be less subject to such attacks than software on conventional out-of-order CPUs.
Thinking about data dependent timing variations (the crux of the attack), there seem to be three sorts: data-dependent control flow, principally branches; data-dependent data accesses, principally loads from S-box arrays; and data-dependent execution speeds, principally execution issue order.
Clearly the last (issue ordering) does not apply to an in-order exposed-pipeline architecture such as a Mill. Absent external stalls (such as cache misses), Mill code will run in program order lock-step and there should be no variations at all.
The Mill branch prediction and instruction cache is subject to data-dependent timing variation. However, branches can be removed on a Mill and replaced by the pick operation, which is constant time. When fully unrolled, even the outer loop branch of AES goes away. The AES is small enough to fit in the i$1, so if the branch predictor predicts entry into the AES (or if it doesn’t, the mispredict recovery logic) then the prefetcher will bring the whole algorithm into cache as a single EBB. This prefetch may be subject to variations of timing depending on where the individual code lines must be obtained from, but this variation is not data dependent.
That leaves the S-box array lookup timing. These arrays are small enough to fit in a modern d$1 cache, so one obvious approach is simply to have a pre-loop that loads from each cache line of the array, bringing them all to cache and thereby removing the timing variation in the actual data-dependent lookups. However, future crypto algorithms may need larger tables that don’t fit in cache and so cannot be preloaded.
Another possibility is to avoid use of the top-level cache entirely, and keep the tables in the D$2, which is certainly big enough to hold them. Data can be forced to reside at the join-point of the cache hierarchy (which is the D$2 on our current configurations, but on any configuration will be big enough). Mill memory-reference operations carry a small set of special-handling flag bits to control per-request cache behavior. One of those flags is “volatileData”, which forces the request to bypass private caches and go to the top shared cache.
A volatileData request will be slower than a normal load, because it will suffer a d$2 latency (~10 cycles, member dependent) vs. a d$1 latency (~3 cycles, member dependent). However, I suspect that most or all of this latency can be hidden by use of the Mill deferred loads, overlapped with the rest of computation.
The above are software solutions not requiring hardware configuration. In hardware, we have long expected that hard-real-time applications would want to have constant-time access memory (SRAM) on the chip. On other RT-oriented chips this is provided by either a NUMA segment or by pinning either a fixed portion of cache or a set of lines within the cache. Either solution could be, and quite possibly will be, configured in Mill chips for the RT market. However, AES will also be run on non-RT Mill chips, so this is not a general solution. Likewise, though it would be possible to configure a dedicated AES functional unit or co-processor, I doubt that would be present on every Mill.
So after thinking about the issue, I conclude: for ultra-high-speed crypto one should select a Mill with a hardware AES box with its tables in dedicated fixed-latency SRAM, and for general-purpose use one should declare the tables as “volatile” and write branchless code for the algorithm.
Thank you for bringing this use-case to our attention. We had not considered it before, and it is gratifying to see that a vanilla Mill appears to offer a practical solution to the troubles that the case gives a conventional CPU.
- ArtParticipantMarch 9, 2014 at 1:24 pmPost count: 8
Yes, this is good stuff indeed! I also read the paper, and the conclusions on standard CPUs are scary. I also note that there is specific AES hardware on current Intel E3, E5, and E7 series server processors – likely for both performance and vulnerability concerns, although I have no specific knowledge of an Intel claim that their AES hardware reduces vulnerability to timing encryption cracking.
I also noted in the paper that the tables could be eliminated through the direct use of the operations the tables are meant to replace, which on a Mill could actually be faster than a table lookup (or sufficiently fast), and certainly could be fixed latency. I have not looked at the complexity of AES in detail to see if that is indeed the case, it might not be. The AES “performance contest” was held on the CPUs of the day, the Mill has characteristics that may lead to different implementations being optimal.
As for a hardware AES box for the Mill, I suspect that dedicated (dynamically configurable) hardware to compute the algorithm may well be a better implementation than using tables in fixed-latency SRAM. At least such an implementation should be investigated, rather than assuming the implementation for old CPU ISA’s will also be optimal for direct hardware.
You must be logged in to reply to this topic.