Forum Replies Created
- AuthorPosts
- in reply to: Fundamentals & References #1285
People who are interested in CPU design in general might want to hang out on the comp.arch newsgroup. It is quite active, and covers many topics that are outside the specific Mill domain that you see in these forums.
Incidentally, would you consider bringing that article up-to-date by including the Mill design and contrasting it with the other approaches?
- in reply to: Pipelining #1265
(Thanks Ivan for describing latency ordering of drops; I don’t think you’d previously covered that publicly.)
It wasn’t – we just said the order didn’t matter so long as the hardware and scheduler agree. The actual order, and why it matters, was in the 7/13 provisional filing in advance of the Pipelining talk, so INF (It’s Now Filed)
- in reply to: Creating an LLVM backend for the Mill #1190
The current LLVM port effort does not use the LLVM back end as such, nor does it accept the LLVM middle-to-backend IR, which discards information that the specializer needs. The replacement for that IR is a serialization of the middle-end dataflow graph structure, or at least we think so; work is still to early to be sure.
The intent is that the input to the specializer is of a form that permits a very fast and simple specialize step. Operation selection has been done in the compiler, using an abstract Mill with all operations supported.
We also will be adding a few passes to the middle-end, primarily to do operation selection. It’s good news if you have done some of these passes already. Type regularizing is certainly applicable. It’s not clear to me whether VARARGS can be handled in the compiler for us, because until we know what the belt length is (at specialize time) we don’t know how many can be passed in the belt. Of course, we might ignore that and pass all VARARGS on the stack; it’s not like performance matters much for that.
Large arguments by value is an interesting problem for us because the Mill call may cross a protection boundary. It is necessary to have such arguments addressable by both the caller (to pass them) and the callee (to use them), and there are semantic and security issues involved. For example, can the caller pass a struct and put a pointer to the passed struct in a global variable, make the call, and then another thread of the same process modifies the passed struct while the callee is working on it? Such things are hard to get right; it much easier to ignore such problems, and say that exploits are an application problem, but that’s not Mill.
It sounds like your expandStruct converts structs into tuples. That too is something we want to do, in part because we have to use tuples to be able to support Mill multi-result call operations. Although we are very tempted to add pass-by-copy to C/C++ as an extension as part of our work; IN, OUT, and INOUT parameters are a much more natural way to express multiple results than tuples IMO.
If you and your team are Bay Area, I’d be happy to buy you a coffee and gain what we can from your experience with LLVM. Likewise any other Forum participant with LLVM internals experience who would like to help. You can reach me at ivan@millComputing.com.
You are right that exposing the belt length by admitting temporal addressing does make the binary target-dependent, but the Mill is an exposed-pipeline machine and so the binaries are target-dependent for many other reasons, so temporal addressing does not add any complication that is not already there. Our solution is the specializer, which let’s us have target-independent load modules even though the binary is not. As no one but hardware verifiers (and the code of the specializer library) ever looks at the bits in memory the nominal binary dependence doesn’t matter; you get the same portability you get on any other machine family.
It is my impression (please correct me) that the ring-buffer-based queue machine forces a uniform and universal opset in each executing position. This is trivially supplied if there is only one execution pipe, but if there are more than one then non-uniformity has advantages. Thus if there are three uniform pipes the the QM can issue three ops if there are arguments available, but if say there are two pipes with ALUs and one pipe with a multiplier then the arguments for the adds and the multiply must be in relative order to each other to align with the intended pipes.. Alternatively, the adds and the mul could use your positional-addressing extension to get the correct arguments, but the result is no longer a QM.
The conform op was explained in the talks – Memory if I recall. Conform normalizes the belt, arbitrarily reordering the belt contents. It is used to force all control flow paths to a join point to have the same belt structure, at least w/r/t data that is live across the join. The implementation is just a rename, because actual data movement would be much too expensive. As with the rest of the Mill, the design takes into account the need for a commercially viable product in silicon.
Rescue is a similar but more compact encoding of conform, in which the belt order is preserved but dead belt operands are squeezed out, whereas conform admits arbitrary reorder. Rescue is used, as the name suggests, to prevent live operands from falling off the end of the belt.
Mill encoding is wide-issue like a VLIW, but otherwise very different; see the Encoding talk. The need to have not only live belt operands in the belt but also multi-cycle computation in the FUs across control flow forces us to use a very different scheduler than the DAG approach which is inherently restricted to basic blocks. We use a time-reversed tableau scheduler, similar to what some VLIWs use.
- in reply to: Pipelining #1283
Remember, this is hardware, not compiler internals; there is no such thing as taints
In hardware terms, your suggestion would turn into predication on the store, where the value of the predicate results from the exit test. There have been predicated machines, most recently Itanium and older ARMs; I did a compiler for the Data General Nova1200 once, which had predicated operations implemented as skip predicates; it was an interesting machine.
The problems arise in the implementation. Predication adds another operand to all the operations (or at least the non-speculable ones, on a Mill) which adds another data path between the Belt and the functional units, which is an expensive proposition. One also needs a way to encode the extra argument, which would completely change the encoding; store in particular fully saturates the argument space available. Then there’s the problem of keeping the predicate value live on the belt; it increases belt and scratchpad pressure.
This is not to say that it couldn’t be done; the Mill ISA already has conditional branches and conditional returns, which can be thought of as a form of predication. When the tool chain gets further and we start implementing pipelining in the specializer and try it on real code then we may rethink the design. For now, we don’t think we need anything beyond what we have now.
- in reply to: Pipelining #1273
What to do when you run out of scratchpad:
The next and final level is “extended scratchpad”, in memory. The extended space is not (at present; some arguments) accessible to load/store, but uses specific spill/fill ops and its own allocator and a stack-like allocation regime. As with regular scratchpad, the extended spill/fill deal with operands, not bytes, and unlike memory but like scratchpad the operands retain their metadata. As a consequence, an extended spill/fill must double-pump the cache datapaths, once to get the operand data and once to get the metadata. Extended is cached normally.
Epilogue:
There are loop bodies where you have loop work that still needs to be done after a while-predicate has determined the that exit should occur. If you want to think about it, the general solution is to cause any data sources in the pipeline after the predicate to produce Nones. However, how to do that depends on the kind of source, some of which are NYF, and all really need pictures to explain what to do. We’re also not sure we have found all the cases yet.
Wiki:
Working to get the ISA Wiki doc mechanically generated from the specs.
- in reply to: Pipelining #1269
Sorry, NYF
We hope to have some of the hardware team give one or more talks on the gate-level hardware this fall. Don’t hold your breath for Verilog though
- in reply to: Pipelining #1263
Yes, latencies are member dependent, so order is too. Yes, same latency drops in slot order, and ops with multiple results drop in result-number order. Yes, the specializer deals with all that in the scheduling.
As for level of presentation, it’s real hard to balance the needs of the newbies against the impatience of the old hands, and fit it all into a talk of reasonable length. I try to cut things into reasonably complete topics with adequate background, mindful of the physical audience; at FaceBook only half the audience had prior exposure to notions like the Belt or Nones, and I don’t want to blow off the other half.
In many talks I can speak faster, but in this one the animations largely paced the talk; they can’t be too fast for the audience either. I was forced to leave out quite a few niceties; the drop-order issue for example, and things like extended scratchpad – I’m a little surprised that no one here has asked about what you do when you run out of scratchpad. And the whole question of epilogues.
The new Wiki will be up on our site soon, initially very light on content but I’m hoping you folks will help populate it quickly with searchable details. Those talk-viewers who need more can always find it here on the Forum, unless it is still NYF. There are at least six more talks to go.
Queue machines and belt machines are different, and both are different from stack machines. I was aware of some of the prior work on queue machines that canedo cites in his thesis, but not all, and his compiler work was completely new to me.
The key insight in a queue machine is that if the dependencies form a complete tree then a level-based schedule permits access to be at only two points, the front and the back of the queue, and maximal parallelism falls out automatically. I suspect there is a relationship between what a queue machine can do vs. what a general-register machine can do that is mathematically related to the difference between what a FSM can do and what a CFG can do, but am not sure. Perhaps those of you with CS skills might find a thesis topic in there
Canedo’s work relaxes the pure-tree requirement of a classic queue machine toward more realistic code, and his thesis addresses how to compile for that model. I haven’t finished the thesis yet, but it’s clear that his scheduler takes dataflow graphs as input just like ours does, but the scheduler works very differently than ours. Of course, we have to deal with additional real-world issues that he doesn’t: an exposed pipeline with varying latency, for example. But even without those complications. I’m not sure that his approach to turning the graph into sequential code will work even theoretically for a belt machine like the Mill, no more than it would produce reverse Polish for a stack machine. Still reading
- in reply to: Pipelining #1253
Other constraints limit the practical length of the belt to avoid impact on clock or power. We know 32-long works, and we think 64-long may be practical for high-end throughput-oriented members with slightly lowered clock rates, but 128 seems unlikely.
For your blur, if those 60 are really 20 RGB vectors (as PeterH suggests) then they will fit in the 32-belt of the mid- and upper-end members. I can imagine a member with say 16 FMA units and a 32-long belt, although we are starting to get into GPU-land with that. If there really is a distance of 60 then code even on the high general-purpose 64-members are going to use the scratchpad and its rotators as a belt extension, and the spills and fills of the belt overflow part that doesn’t fit will wind up needing to be pipelined with the rest of the code. The good news is that the scheduler in the specializer takes care of getting the spill-fill right, you don’t have to do it.
- in reply to: Pipelining #1249
The spiller has unlimited size, because under the buffering is all of memory. The constraining factor the spiller presents is bandwidth – high end Mills with big belt have a lot of state to save and restore. The top of the spiller, the part that connects directly to the belt and the rest of the core, has enough bandwidth to eat all the core can throw at it into buffer registers. The next level down is an SRAM, and there’s not so much bandwidth between buffers and the SRAM. The final part of the spiller, between the SRAM and the bottom level cache and thence to DRAM, has the lowest.
Any one of these steps can be overwhelmed if the core really works at it, just as the regular load-store bandwidth can be overwhelmed, on a Mill or any machine. Both spiller and the memory hierarchy are able to issue-stall the core if they need to catch up. The individual Mill members have the bandwidths sized so that stall is rare, part of the resource balancing that goes into the detail work on each family member.
- in reply to: Pipelining #1251
Actually, it’s simpler even than Will says; I wish I could have gone into the mechanics in more detail, but the talks have time constraints.
It comes down the the fact that drops are always at the front of the belt, not at some slot index position as if the belt were an array rather than the FIFO that it is. Consequently the hardware must use a canonical ordering of drops so the software can ask for the values in the right place when it wants them later. That ordering is latency order: if a multiply (latency 3) and an add (latency 1) drop in the same cycle, the add will be at a higher position number (say b1) than the multiply (say b0).
The Nones that are dropped by retire are stand-ins for values that are in flight and haven’t dropped yet, and the reason they are in flight is because they are the result of a long-latency operation that hasn’t reached its steady-state yet. Consequently, the latency-order of drop says that the ones that haven’t dropped yet would have been at low belt numbers if they were in steady state. Consequently all the Nones from retire, no matter how many there are, will be in a block at the front of the belt, followed by any real retires, followed by the rest of the pre-drop belt.
Getting that to work without a special op to say where things should go is what makes Dave’s Device so clever IMO.
Will is right that there are two forms of retire. Recall that Mill operands have metadata tags telling the operand’s width and scalarity, which guide the functional units when the operands later become arguments to ops. Nones, including those dropped by retire, need width tags too. There is one form of retire that gives the desired widths explicitly. There is also another form that gives only a drop count, as shown in the talk, and that form drops Nones with a tag with an Undef width.
When an Undef width is later eaten by an op, the width is defaulted to a width that makes sense to that op, following defined rules, and the op’s result (always a None of course) carries the width that would have resulted from that default. However, for some ops the defaulting rules can result in a wrong width, different from what the real non-None would have done, that can screw up the belt for other operations. The scheduler in the specializer knows when that happens, because it knows what the desired width is, and when it detects that the default would be wrong it uses the explicit-width form of retire.
That would have been another ten slides and an explanation of width tagging, and there wasn’t time for it.
I was unfamiliar with your work and greatly appreciate you bringing it to my attention. Your thesis will take a bit for me to digest; please email me at ivan@millcomputing.com and I’ll comment when I have gotten through it. Unfortunately, research in Japan is not as well known here in the US as it should be; for that matter, the same is true of research from anywhere outside the US, due to some mix of language barrier, NIH, old boy networks, and perhaps arrogance.
As for your Queue Machines, it’s clear that stack machines, queue machines, and belt machines like the Mill are all members of a super-family of architectures in which transient results are single-assignment; the differences among stack, queue, and belt lies in the permitted/encouraged access pattern to use those transients. Perhaps systolic array machines may be thought of as belonging to this group too. The other great super-family, relaxing the single-assignment constraint, includes accumulator and general-register machines.
Thank you.
- in reply to: Pipelining #1220
Hmm. Looks like we needed another slide.
The point about the retire op is that with it there is no prologue code any more. The whole thing – prologue, steady-state body, and epilog, is the same physical piece of code in the cache. If there were a conventional prologue these ops would be in instructions of their own, occupying cache of the steady-state body, that we need anyway, for the prologue too.
Yes, the ops that are logically “before the beginning” get executed, but they have no visible consequences and occupy no space that we wouldn’t use anyway. Yes, these ops do cost power – but they do not cost code space, which is a much more significant constraint in the big loops with low iteration counts typical of general-purpose code.
As for assembler, we agree with you that Mill concrete assembler (in which the programmer explicitly specifies the pelt positions of op arguments) is inhumane both to read and write If you have a program that you have written in concrete assembler then you can run it in the present simulator and see the belt contents and track them, instruction by instruction; the Specification talk shows this being done, live. We expect that a debugger on real hardware will provide a similar interface.
If you are writing in abstract assembler (specializer input, without belt assignments, as opposed to specializer output concrete asm) then there is no “current status of the belt” to view, because the code may be specialized for, and run on, different family members with different belt lengths and operation/instruction schedules. Only when these member-dependent aspects are blended into your abstract code – by the specializer – can a tool, or the hardware, know what is happening on the belt.
Unfortunately, while we are hard at work on the abstract assembler and specializer, they won’t be ready for some months yet. When they are, we are considering making them and the simulator available running on our servers from our web site, for anyone to play with. We’ll see. There are also some more distant plans – hopes, really – to provide a visual UI for the sim or debugger so the programmer can see the movement of data and events inside the running program, visually reminiscent of the animations in the talks. Not soon, though.
- in reply to: Inlining functions vs. call/return #1207
Addendum to Will’s comment:
By far the greatest value to inlining on the Mill is that it exposes opportunities for hoisting loads. As you point out, it is common that the first thing a method does is to load a data member. If the function is inlined then the load’s retire point can be left at the inline position, but the load itself can be turned into a deferred load, hoisted enough to hide cache latency, with the deferred retire preserving proper retire ordering. Without the inline, the very first thing that the dunction will do is to stall for a cache miss.
- AuthorPosts