Forum Replies Created
- AuthorPosts
Sorry for the delay in responding to this: last week was the Security presentation a Google, and prep for presentations takes priority. The paper also takes some thought.
The paper and its proposals mostly address the problem of locality distribution in caches: some lines, once references, will be hit over and over again, while others will be one-shots. When the one-shots crowd out the heavily used (which can occur when the working set exceeds the top level cache) the result is cache thrashing and poor performance. Thrashing is especially a problem in distributed caches. The paper proposes a predictor mechanism that guides cache placement and lifetime decisions of individual lines.
There are several Mill aspects that address this problem; none of them involve a predictor mechanism as proposed.
The simplest Mill mechanism is that the Mill does not automatically load a missing line to the TLC. Instead, a miss hoists a line a single level from wherever it is found. Recall that a Mill load may be satisfied piecemeal from several different levels of the hierarchy, using the valid bits attached to each byte in the caches. (See the Memory talk). Each copy of the line that supplies at least one byte to the result of a load is automatically hoisted; if the destination level already contains a copy of the line (necessarily with disjoint validity) then the existing and incoming lines are merged.
The delivery of loaded bytes to the retire station is completely separate from this hoisting process, and uses distinct data paths. The hoist is purely an optimization, so if there is contention at the destination level then the hoist can simply be abandoned without error.
Mill hoisting thus provides much of what the authors set out to achieve: a difference in handling between high- and low-locality lines. In a three-level system (L1, L2, DRAM), a first load will hoist a line from DRAM to L2, and it takes a second load to hoist the line to L1. A line loaded twice during its L2 residency lifetime is a good candidate for high-localityaccess.
The drawback of the Mill approach is that if the second load immediately follows the first then the second must take an L2 latency, whereas if the line were brought in to the TLC by the first load then a second load would have only a L1 latency to pay. The details of the timing are implementation- and member-dependent, but in the lazy hoisting used by the Mill the total latency of moving a line up twice is quite close to the latency of moving the line to the top at once; the latencies are dominated by wire times, and the total distance is the same.
The description above is couched in terms of a monocore, whereas the paper is concerned about distributed caches, both on- and off-chip. In distributed systems, the cost differential between high- and low-locality lines is greater than for a non-distributed system, but the benefit of the Mill approach remains. In effect, the Mill sees a remote cache fragment as just another level. For example, if the L2 is distributed, then the first load that hits in a remote L2 will hoist the line to the local L2, just as a hit in DRAM will be hoisted to the local L2. It takes a second load hit (now in the local L2) to get the line into the (local and private) L1.
A second aspect to the Mill that addresses the paper is the cache-byte valid bits. A large amount of the inter-cache traffic in systems such as those modeled in the paper is due to false sharing among threads. False sharing arises from the line-granularity of conventional coherence protocols. The Mill uses byte-granularity (supported by those valid bits) and so is immune to false sharing: threads in different cores can use disjoint parts of a single line without the ping-ponging that is the bane of conventional shared memory. As a result, there are substantially fewer loads of new lines for which residency decisions must be made: if a line ping-pongs ten times then that’s ten look-ups in the paper’s predictor, whereas in the Mill it is just two hoists and no further.
The paper makes much of avoiding full line moves for low-locality data; they return a single word. This “benefit” is obviated in modern machines with vector units working on data that approaches or equals cache line size: a cache-line datapath is required, so there is no gain (other than power, if that) in fetching narrower data. Line-width is also needed for code. Of course, the paper’s predictor avoids the power consumption of Mill hoisting on predicted low locality lines, but in exchange has the power consumption of the predictor itself; that’s a tradeoff that would require measurement to evaluate, but I doubt the predictor is a power win against the simple Mill hoist mechanism.
One aspect of the paper that confounds this analysis is that the paper assumes a directory-based distribution. I suspect that the choice was motivated more by the existence of a prior simulator implementation than for design reasons, but no matter. Directories are frequently used in large-scale (multi-chip and frequently multi-rack or multi-room) systems where fabric is a major cost, but they are a less obvious choice for on-chip fabrics such as those contemplated for the Mill. While we have not put a lot of though into multi-core Mills yet, I expect that we will use a broadcast fabric with snoop filters rather than a directory; the coice will depend on the chip fab technology available when we do a multi-core, which is several years out and need not be decided now. However, the costs and benefits of the proposal clearly depend heavily on the cache addressing and coherence mechanism used; the relative merit of the proposal in its application to the Mill cannot yet be fully determined.
Nevertheless, in light of the above (and a bit more NYF), I don’t think that the author’s proposals would be of value on a Mill; we’re there already, and mostly we simply don’t have the problem anyway and so have no need for a fix. (That’s a tune I find myself singing frequently π ).
Going a bit further than the specific application to the Mill though, I am somewhat doubtful about the paper’s proposals even when applied to a conventional machine. This is little more than a gut feeling, and I have no numbers to back it up. However, I think they underestimate the cost of their proposal, in power, area and latency. In addition, I rather strongly suspect that the simulation they use has the effect of hiding internal dependencies that would be visible in hardware; the proposal looks to invite race conditions, and alleviating those might eat up all the apparent gain.
- in reply to: Side channel timing attacks #733
I have no more expertise in timing attacks than you have, and have put no thought into it. However, after reading the paper (thank you for the citation) I think that well-written software on the Mill will be less subject to such attacks than software on conventional out-of-order CPUs.
Thinking about data dependent timing variations (the crux of the attack), there seem to be three sorts: data-dependent control flow, principally branches; data-dependent data accesses, principally loads from S-box arrays; and data-dependent execution speeds, principally execution issue order.
Clearly the last (issue ordering) does not apply to an in-order exposed-pipeline architecture such as a Mill. Absent external stalls (such as cache misses), Mill code will run in program order lock-step and there should be no variations at all.
The Mill branch prediction and instruction cache is subject to data-dependent timing variation. However, branches can be removed on a Mill and replaced by the pick operation, which is constant time. When fully unrolled, even the outer loop branch of AES goes away. The AES is small enough to fit in the i$1, so if the branch predictor predicts entry into the AES (or if it doesn’t, the mispredict recovery logic) then the prefetcher will bring the whole algorithm into cache as a single EBB. This prefetch may be subject to variations of timing depending on where the individual code lines must be obtained from, but this variation is not data dependent.
That leaves the S-box array lookup timing. These arrays are small enough to fit in a modern d$1 cache, so one obvious approach is simply to have a pre-loop that loads from each cache line of the array, bringing them all to cache and thereby removing the timing variation in the actual data-dependent lookups. However, future crypto algorithms may need larger tables that don’t fit in cache and so cannot be preloaded.
Another possibility is to avoid use of the top-level cache entirely, and keep the tables in the D$2, which is certainly big enough to hold them. Data can be forced to reside at the join-point of the cache hierarchy (which is the D$2 on our current configurations, but on any configuration will be big enough). Mill memory-reference operations carry a small set of special-handling flag bits to control per-request cache behavior. One of those flags is “volatileData”, which forces the request to bypass private caches and go to the top shared cache.
A volatileData request will be slower than a normal load, because it will suffer a d$2 latency (~10 cycles, member dependent) vs. a d$1 latency (~3 cycles, member dependent). However, I suspect that most or all of this latency can be hidden by use of the Mill deferred loads, overlapped with the rest of computation.
The above are software solutions not requiring hardware configuration. In hardware, we have long expected that hard-real-time applications would want to have constant-time access memory (SRAM) on the chip. On other RT-oriented chips this is provided by either a NUMA segment or by pinning either a fixed portion of cache or a set of lines within the cache. Either solution could be, and quite possibly will be, configured in Mill chips for the RT market. However, AES will also be run on non-RT Mill chips, so this is not a general solution. Likewise, though it would be possible to configure a dedicated AES functional unit or co-processor, I doubt that would be present on every Mill.
So after thinking about the issue, I conclude: for ultra-high-speed crypto one should select a Mill with a hardware AES box with its tables in dedicated fixed-latency SRAM, and for general-purpose use one should declare the tables as “volatile” and write branchless code for the algorithm.
Thank you for bringing this use-case to our attention. We had not considered it before, and it is gratifying to see that a vanilla Mill appears to offer a practical solution to the troubles that the case gives a conventional CPU.
- in reply to: Prediction #822
Predictions load only if we have no prediction for the call, and the loading is in parallel with the miss processing, so it will necessarily result in faster execution than simply missing and building up experience without loading.
The prediction table holds the target address of a prediction as an offset from the entry address of the caller EBB (the same addressing that branch/call uses). The offsets are within a single load module, which has bounded size much smaller than the total address space; that way we need only 20-30 bits per address in the table, rather than 60 bits for an arbitrary address.
In contrast, portals may be set up anywhere in the global address space, and so would need more bits in the prediction. We expect portal calls to be rare compared to ordinary calls, so we economize on table space and wires and don’t try to predict portal calls.
The Mill does good with a lot of things π
Seriously, while we know that we will do a reference microkernel, the decision of which extant system to build on must wait until the new compiler is up enough to take the load, which won’t be for a while yet. It also depends on who we find as partners in the effort; if a grad student at University A gets fired up about porting system X to the Mill then we will arrange access to our tools and simulators and, eventually, hardware. We’ll also support ports by B of Y, C of Z and so on; may the best and fastest win. The Mill is intended to be policy-agnostic, so the more different systems that get ported the more confident we are in the hardware model provided. As research ports come out, we will be working to make one or another be commercial grade, or driving a merger of the most successful ideas of several.
But all that waits on the tool chain for now.
Neither threads nor turfs are processes in the Unix sense, so pid_t can remain unchanged. A Unix process can be seen as an initial thread residing in a new turf, but you can have turfs without threads on the Mill, which is not possible on Unix because there is no way to have an isolated load module that is a pure service.
Stack segments can be arbitrary sized, subject to OS enforced quanta per usual. The info block describes the top segment of the segmented stack. A new service call (not a callback) will always use the reserved stacklet, but it initially has no data frame (spiller has a frame of course). Creating a frame is a distinct Mill operation stackf, which carries a size in bytes. If that exceeds the current limit then the hardware traps to a handler (which would ordinarily be a portal to a service) that runs out and allocates enough space for another segment, fixes up the links and the frame and stack pointer registers, and updates the info block. The handler enforces whatever quantum policy is desired. The alloca operation can also trigger stack overflow, with similar handling.
Subsequent callbacks will use the new segment. An exit followed by a new call can be optimized if the handler keeps a previous, now empty, segment around to avoid allocate/deallocate overhead when call/returns are bouncing over a segment boundary.
We are working on a reference port of Linux/L4 (https://en.wikipedia.org/wiki/L4_microkernel). Other implementers will have their own approach; Windows on a Mill is an interesting idea π
Turfs, threads, and portals (other than the first, created by the hardware) are built by trusted services, principally the loader. The critical steps include allocating turf and thread ids and blessing a hunk of address spaces with the Portal permission. Family members vary in the hardware help for these steps, but all unsafe parts of the work are done via MMIO, so only those with rights to the relevant MMIO regions can do it.
The software that does these steps will presumably have some notion of quantum, to prevent “portal bombs” such as you suggest. However, that’s all policy, and so not defined by the Mill architecture.
We have had no problem mapping Mill security to the C API. The general strategy is to use a user-space shim that is linked right into the app. The app calls the shim, and the shim does the portal call, any necessary passing of arguments, and so on. However, we are far from done with the work and may find a gotcha as we get further. If so we’ll deal with it then.
Callbacks are handled much like linking of dynamic libraries (.so files). hen you link to a dynlib on a conventional system, the app must itself export its callback entry points so the linker can fix up the calls in the library just like it fixes up the calls in the all to point to the entry points in the library; thgis is a fairly sophisticated use of the linker, but well known and quite routine.
On a Mill you do the same thing, only the exported entries are made into portal by the loader, rather than rewriting the code addresses. However, if you are a JIT so the entry address isn’t known at load time then the entry portals (in either direction) must point to trampoline code at fixed addresses, where the trampoline indirects to the moveable entry point. It costs one more cycle in and out, so no buig deal.
There’s currently no way to grants something unless you have the same access yourself. It’s doable – another permission bit – but we have no use case that needs it.
Ivan
- This reply was modified 10 years, 8 months ago by Ivan Godard.
A bit more – while you can use your convention-of-choice to pass arguments in memory, passing them in the belt will be much faster, with a performance difference roughly equal to the difference between passing in registers vs. memory on a legacy architecture. So I’d expect performance-oriented Lisp/JS implementations to use the belt when they can, and restrict the always-varargs approach to blind calls when they can’t optimize.
Could a thread manipulate its own spReg to allow it to make calls without overflowing the stack, BUT overflowing the state in the spiller?
Not unless it had rights to the MMIO register that shadows spReg, and if it has rights like that then it’s part of the OS and better know what it is doing. spReg (and nearly all the specRegs) are not user-assignable except via MMIO. Ofg course, they change all the time as a side effect of executed operations like call, but not directly.
Could a thread do a grant in a loop, generating regions until something breaks? It sounds like the OS wonβt get a chance to stop it until PLB eviction occurs. Also, if itβs on a family member which does it in hardware, that removes the OSs ability to regulate a rogue process.
The grant quantum, while set by policy, is managed by hardware and will throw a fault if exceeded. We didn’t want to track it at evict time in case we have members that do evicts (with PLB entry creation) in hardware.
If sounds like if a thread fills the region table with regions overlapping one address, the augmented interval tree will handle it happily (ignoring the previous question), however what if the PLB is (mostly) full of regions that all overlap an address, which a thread then accesses? Would it need to hit an excessive amount of regions in the cache to resolve the final permission set?
The region table is searched until the first entry that would give permission is found. Entries are not concatenated during the search; permission for the whole access must be provided by a single entry. That keeps the search cost manageable; if you really need to have an access that is part permitted in different entries then you must use a service that will push a new grant covering the union of your existing entries.
On the topic of overlapping regions, I assume itβs an OR of the permissions, not an AND?
Any single entry is satisfied if all of the region, thread, turf and rights match.
Does the iPLB ignore read permissions? (ie, can code be execute, non-read)
The iPLB is used only for execute access, and its entries need only distinguish eXecute from Portal. Similarly the dPLB is used only for load and store, and need only have Read and Write permission bits. Table entries have all the bits so as to reduce the number of entries.
- This reply was modified 10 years, 8 months ago by Ivan Godard.
- in reply to: How can I get involved? #768
Sorry for the delay in response; I’ve been busy with the Security talk at Google, just completed.
There is a lot of material, in various degrees of up-to-date-ness – 500+ pages of filings and 450+ slides, plus the talks so far and a bit more. Dumping the lot on a newcomer would be too much support for us; our time is better spent in other ways than educating people one at a time about the Mill.
However, if you are interested in a particular sub-topic from one of the presentations, and can realistically commit the time, and can show that you already have the skill, then the relevant material from the trove could be peeled out for you.
So if this fits you, send me a suggested sub-topic and a sample of your writing and we’ll work something out.
Ivan
If you work for Oracle, your employer is going to be very unhappy about you spilling the beans on their revolutionary new CPU.
Otherwise, you might look a bit more closely about the design of the Sparc you mention. To be able to execute 30 operations of the slide-26 kind (a constant, an add, and a store) in one cycle, you would need ten load-store units and twenty ALUs – actually, 30 ALUs, because the Sparc takes two instructions to build a 32-bit constant. I’m not up on the most recent Sparc offerings, but I think the biggest they have has two load-store units and two ALUs, which is a little short of what you need. π
So why not build a Sparc with more functional units? Well, knowing why not is why hardware engineers get paid, but the short answer is that the units have to be able to talk with each other, and the cost of the connections increases as to the square of the number of units. Rather quickly you reach a point at which the power required will melt the chip. You might Google “dark silicon problem” for more.
The Mill avoids this barrier using a method long used in the embedded world: static scheduling with exposed pipeline. That solves the melting problem, but unfortunately such designs give very bad performance on general purpose programs. The issues are run-time (cache misses and the like), so compiler improvements don’t help. The Mill has solved those issues, and is able to bring DSP power-performance numbers to general purpose code.
I wish it were as easy as you believe; I could have spent the last decade on a beach. π
To expand a bit on what Will said: the instruction (for that Gold member) may carry up to 30 operations. The compiler does not have to find 30 for every instruction; you would never see that level of ILP in open code except in contrived examples. However, in loops, with pipelining and/or unrolling, there is no limit to the amount of available ILP. One-instruction loops are common on a Mill, with the instruction containing large numbers of ops, approaching the encoding limit of the particular family memory.
- in reply to: Threads/Coroutines #742
Everything is lazy and background, so performance is very good unless you do a deeply nested run of calls with lots of arguments and no work except the next nested call. “Very good” here means the usual one-cycle cost of a call.
There is support for coroutines (though NYF) and non-resumeable continuations. Scheme-style repeatable continuations require a garbage collector, which is not what hardware does π
What happens if you have an FMA instruction but fail to also issue an Args? Or is it that the existence of an Args in an instruction turns the multiply operation into a FMA operation?
Am I right in guessing that which operations are connected to which in ganging is a result of which slot in the instruction the ops occupy?
Decode never needs to look at an adjacent slot to disambiguate an operations; needing to do so would complicate and slow down decode. Consequently FMA is a distinct opcode and not a mul-with-baggage.
The sanity check for ganging (at least on the exu side where these ops live) is done during execution. The FMA functional unit knows that it needs arguments from its buddy, but it has the execution cycle to work out whether or know it has a buddy with the right arguments. No buddy causes an invalidOperation fault, even though the decoder proper didn’t complain.
Gangs are always encoded in adjacent slots. Otherwise there are encoding issues and added data paths to get everything to the FU involved.
This wouldnβt be a Mill architecture, but I wonder if there would be a problem with adding more phases and having different otherwise identical functional units do their work in different phases. For example, maybe read, pick1, op1, call, op2, pick2, write. I suppose that would induce limitations as to how quickly you could schedule instructions back to back, but it would make decoding easier for a given width. Were there other important benefits and limitations Iβm not thinking of? Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?
The phase structure is largely dictated by the encoding and the decode process. It takes three cycles to fully decode the extremely wide Mill instructions, but ops in some blocks are completed in each of those cycles and we issue them right way as soon as available; that gives us three phases. The pick op has no cycle, and the call op uses the cycle of the called function, which gives us five phases in all.
We could do as you suggest and have added phases with dedicated FUs. However, the ops for those phases would already have been decoded and would just be waiting for their phase, so why not issue them at once? We could of course go to a five-block encoding, which would take four cycles to decode, requiring another phase. However, three seems to be all that’s useful.
Assignment of ops to blocks (and hence phase) is ad hoc but guided by hardware constraints. For example, it is not possible to get the belt bypass set up in parallel with decoding the first (reader) block, so ops in reader phases cannot have belt arguments. Consequently the reader phase can only have pure-source ops or administration ops that do not take a belt argument. Reader phase also cannot have an op taking a large constant (such as the stackf frame allocation op) because the constant would wreck the entropy of the encoding of the other reader ops which are quite small.
- This reply was modified 10 years, 8 months ago by Ivan Godard.
- This reply was modified 10 years, 8 months ago by Ivan Godard.
- This reply was modified 10 years, 8 months ago by Ivan Godard.
- in reply to: new CALL opcode variants (suggestion) #713
> So thank you. It is a pleasure seeing your work. I would prefer to be an investor! π
Actually you can be, if you are either a non-US national, or are an accredited investor per Regulation D of the SEC. If either apply, sign up for the Investor’s Mailing :ist at http://oootbcomp.com (the list sign-up explains the regulations).
> How long do you think it will take before hackers can have /their/ $10 Mill uC?
Longer than anyone would like π Don’t hold your breath, but we will get there eventually.
- AuthorPosts