Forum Replies Created
- AuthorPosts
- in reply to: Specification #1083
No deep relationship. The sim accepts specializer output (or manual asm).
- in reply to: Specification #1064
The video and slides from the Specification talk will be on the site as soon as post-production is done; we hope this week.
- in reply to: Hard/soft realtime #1049
Yes.
But a glossary/terminology note: we call them “frame ids” or “frame depth number”, not “belt id”, because it identifies more than just belt operands, including the frame’s scratchpad, frame-saved specRegs, static and dynamic down-stack linkage, return addresses, etc. And a “belt id” might be confused with a “belt position number”.
- in reply to: CPU socket compatable #1028
That’s quite a ways away. There are license issues, socket issues, target market issues, … For later 🙂
- in reply to: CPU socket compatable #1024
That’s actually the plan, for at least initial chips, although we are far from picking a socket yet – there will be new ones by the time we have something to put in one.
Being hardware, the Mill is not language-specific, but must support all common languages, common assembler idioms, and our best guess as to where those will evolve.
The best that we, or any design team, can do is to try to isolate primitive notions from the mass of semantic bundles and provide clean support for those primitives, letting each language compose its own semantic atoms from the primitives we supply.The “volatile” notion from C is used for two purposes, with contradictory hardware requirements for any machine other than the native PDP-11 that is C. One purpose is to access active memory such as MMIO registers, for which the idempotency of usual memory access is violated and cache must not be used, nor can accesses be elided because of value reuse.
The other common purpose is to ensure that the actions of an asynchronous mutator (such as another core) are visible to the program. Here too it is important that accesses are not elided, but there is no need for an access to go to physical memory; they must only go to the level at which potentially concurrent accesses are visible. Where that level is located depends on whether there are any asynchronous mutators (there may be only one core, so checking for changes would be pointless) and the cache coherency structure (if any) among the mutators.
Currently the Mill distinguishes these two cases using different behavior flags in the load and store operations. One of those is the volatileData flag mentioned in the previous post. This flag ensures that the access is to the top coherent level, and bypasses any caching above that level. On a monocore all caches are coherent by definition, so on a monocore this flag is a no-op. It is also safe for the compiler to optimize out redundant access, because the data cannot change underneath the single core.
On a coherent multicore the flag is also a no-op to the hardware but is not a no-op to the compiler: the compiler must still encode all access operations, and must not elide any under the assumption that the value has not changed, because it might have.
On an incoherent multicore (common in embedded work), a volatileData access must proceed (without caching) to the highest shared level, which may be DRAM or may be a cache that is shared by all cores. Again, the compiler must not elide accesses.
For the other usage of C volatile keyword, namely MMIO, the access is to an active object and must reach whatever level is the residence of such objects. On the Mill that level is the aperture level, below all caches and immediately above the actual memory devices, and the flag noCache ensures that an access will reach that level. Again, the compiler must not elide noCache accesses.
Besides the noCache flag, the Mill also maps all of the physical address space to a fixed position (address zero) within the (much larger) shared virtual space. An access within that address range is still cached (unless it is noCache) but bypasses the TLB and its virtual address translation mechanism. NoCache addresses outside the physAddr region (and volatileData accesses if they get past the caches) are translated normally.
There are other access control flags, but that’s enough for now 🙂
- in reply to: Prediction #1058
Rather than manual annotation, why not just run the program a few times and let it train?
Modifying the binary, such as by adding bits to the branch ops, hits the problem that the decoders cannot know they have a branch until after they have already committed to taking or not taking the transfer. This is a consequence of the pipelining of the decode logic; we have to know to issue the line fetch at least a cache-latency ahead of getting the line, and it’s a couple of cycles more until decode has figured out that a transfer is to be executed. There are two possible resolutions of this: predictive fetch, or retarded execution. We use the former, and have generalized predictive fetch to achieve arbitrary run-ahead. Retarded execution is used by OOO machines, by delaying operations that depend on control flow until the control flow is resolved, using the OOO hardware to keep the execute units busy once the pipe has started up.
The two approaches can approximate each other in the steady state, but predictive fetch has much less startup cost than retarded, so we have a five cycle mispredict penalty rather than the 15 or so typical of retarded OOO. Predictive does require more table state than retarded for equivalent accuracy, but the Mill avoids that issue through our trainable prediction, which in effect gives us arbitrarily large tables in fixed and small hardware.
There may be a third approach to deal with pipelined decoder timing beyond predictive and retarded, but I don’t know of one.
- in reply to: Hard/soft realtime #1053
The spiller is lazy and, being a hunk of hardware, is buffer-granularity and not frame (or other language construct) granularity. For a horrible example of the difference, consider a hypothetical Mill with a 14-cycle hardware divide operation, whose first instruction contains:
div(..), calltr(...)
That is, it starts a divide and recursively calls itself conditionally.By the time the very first divide result comes out of the function unit, there are already thirteen other divides in flight in the FU, and we are 14 deep in nested function calls. The spiller is going to spill that result (the FU output latch will be getting a new value next cycle), but the frame (and belt) that the value belongs to is far away, and the retires for that frame may be temporally interleaved with the retires of many other frames. Remember: a call on the Mill looks (on the belt) like any other operation, like an add for example, and has (or rather appears to have) zero latency.
Consequently, as in most things, spill/fill on the Mill doesn’t work like a conventional machine. We don’t spill frames, we spill values, and only when we need the space for something else. And this is true between the belt latches and the spiller buffers, between the buffers and the SRAM, and between the SRAM and the memory hierarchy.
Also, “frame” is an overloaded term, meaning on the one hand the stack-allocated locals of a function activation, and on the other hand the activation itself; this can be confusing. The spiller has nothing to do with the former sense; the program-declared local data is explicitly written to memory by the program. just as in a conventional. The spiller is concerned with internal and transient state. On a conventional this too is explicitly written to memory by compiler-inserted preamble/postamble code, or equivalent asm code for those writing in assembler. Not so on a Mill; the internal state is in general not explicitly accessible by the program, and save/restore is done by the spiller.
Consequently, spiller performance is dissociated from programming language constructs, and is constrained only by bandwidth. Certain program actions, if sustained for long enough, can generate internal and transient values at a rate large enough to overwhelm the spiller bandwidth capacity; you will stall. Granted, you will have to work to make it happen, but you can do it.
However, the stalls induced by the spiller will still leave your program running faster than on a conventional machine (which of course has to save the same information too) using explicit code to save and restore the general registers with every one of those recursive calls.
- in reply to: ASLR (security) #1047
Closing implicit side channels is interesting intellectual play but not very real-world IMO. In principle, if you have access to the box and unlimited prepared-text attack ability then you can learn a ton by measuring the power drain at the wall socket. Or you can etch the lid off a chip and do RF sniffing at the nanometer level. And I’m sure there are 3-letter agencies that do exactly that sort of thing. But I doubt that we are looking at customer sales impact from whatever can be extracted from the global pattern of mmaps, even if you had an exact list of all such calls without having to infer anything.
I feel that the automatic sloppy randomization that will come from the shared address space will in fact help the Mill get and maintain a reputation for solidity. I don’t think it’s anything that we should trumpet or make marketing muchness out of, but it will make an attacker’s job harder even then the user turns off ASLR by oversight or misguided “tuning”, and that has to be a good thing.
YMMV.
- in reply to: Hard/soft realtime #1039
A personal note first – it’s funny for me to be addressed as Doctor Godard. I come from the generation before ubiquitous computers – during my brief and inglorious college career, my well-respected college did not own a computer. Though I have taught computer science at the graduate and post-doc levels, I have never taken a course in the subject and have no degree of any kind. We learned the subject because some co-worker was willing to sit the green kid down in front of a whiteboard and explain how it really worked. I try to do the same today.
Speaking of which, in answer to your question: there is no constraint on the spiller or core speed other than spiller bandwidth. This is true for spilling scratchpad and internal state as well as for spilling belt operands.
About special registers:
There is a suite of specRegs, some always present and some only present in some member configurations. These are internal registers contained in the logic of the core proper and located where they are needed; they are not contained in a register file. Some of these are frame-save, some thread-save, some not saved at all; the spiller does what saving is needed. Some are visible to the rd() and/or wr() operations, and most are visible in the MMIO map. At some point we will have doc on all of them, but we’re a little busy 🙂
- in reply to: Prediction #1037
The Mill doesn’t put “expects” in the code, because the entropy cost of the field in the encoding then would have to be paid on every use of such instructions. Instead “expects” go in the loadable prediction table in the load module. The effect as far as initial prediction quality is the same, but the predictor table is promptly overwritten by actual experience history.
Also, “expects” in the code only can supply taken/not taken information, which arrives much too late to be useful in guiding decode. Instead the Mill predictor carries “where to next” information so that it can run-ahead of fetch and thereby avoid code fetch stalls.
Will is right that anything the program can do to help guide the process is useful and will be incorporated in the program for better prediction. The difference is in how that added information is represented, and where.
- in reply to: Hard/soft realtime #1032
There’s a bit of confusion here. The belt is not addressed by an ID drawn from a pool; it is addressed by position number. The most recent operand to drop to the belt is by definition b0. It does not have a permanent ID as it is moved along the belt by subsequent drops, but becomes b1, b2, .. bN.
Similarly, stack frames do not really have an ID, they are indicated by the ordinal stack depth. Thus the very first frame in a stack is frame zero, and so on. These too do not get recycled, although you can think of the automatic reuse (as return operations cut back the stack and later calls add new frames) as being a form of recycling.
The only pool IDs on a Mill are the thread and turf IDs. Because creating a new thread or turf is an infrequent event, allocation and recovery can be a more expensive process. The actual mechanism involves hardware allocation from a pool of IDs which is kept full by a lazy background maintenance thread, but details are NYF.
Consequently neither belt nor call frame IDs have any HRT impact at all, nor does thread or turf allocation, although thread and turf recovery does require some (very small) spare background execution outside the HRT itself.
The spiller that Will described is concerned with the data, not with the data’s name (ID). It is possible in a deeply nested sequence of prompt calls with lots of belt arguments (or rapid exit from such deep frames) to overrun the spiller’s capacity to move data. However, the constraints are the same memory bandwidth constraints that are faced for any other use of bandwidth; the spiller does not add any demand that would not have been needed by equivalent loads and stores on a register machine saving its registers. So if a HRT program hits the spiller bandwidth limit then it would also have hit the load/store bandwidth limit on a conventional. The Mill is pretty good, but it cannot magically create memory bandwidth that the DRAM chips don’t supply. If you hit the spiller bandwidth limit, your only solution is to buy a Mill family member with more memory pins, just as you would have had to do on an x86, ARM, or whatever.
- in reply to: Wafer feature efficiency #1025
The SEC would be all over us 🙁
Ivan
- in reply to: Wafer feature efficiency #1020
I wish! 🙂
We are not comfortable with simply asking for donations. If the Mill gets anywhere it will be big, and those of us who are or become shareholders will do well. Meanwhile the donors get nothing. That seems unfair; we don’t want to be an Oculus Rift.
On a Kickstarter you offer rewards in exchange for donations, i.e. make pre-sales. That works fairly well for a CD, even a movie, but we are still a long way and much money from having a product we can ship as a reward. Also, how many donors would actually want a chip, without board and box and …?
So about all we have to offer is stock or other interest in our future. However, there are rather severe rules around offering interests, rules that on the whole I’m glad are there, even though they do bite. Among those rules is that we can’t solicit people to invest; people who want to invest have to come to us (and at the top left you’ll see a sign-up place so we can let them know when and if the opportunity arises).
We appreciate your enthusiasm – do go tell your friends, one and all, especially those that are bazillionaires – but heavy semi, which the Mill is, is neither fast nor cheap.
- This reply was modified 10 years, 8 months ago by Ivan Godard.
- AuthorPosts