Mill Computing, Inc. › Forums › The Mill › Architecture › The Belt
Tagged: belt
- AuthorPosts
- #250 |
Talk by Ivan Godard – 2013-07-11 at Google
Slides: 2013-07-11_mill_cpu_belt (.pptx)
Belt Machines
Data interchange without general registers
A large fraction of the power budget of modern superscalar CPUs is devoted to renaming registers: the CPU must track the dataflow of the executing program, assign physical registers and map them to the logical registers of the program, schedule operations when arguments are available, restore visible state in the event of an exception—all while avoiding register update hazards.
Not all CPU architectures are subject to hazards that require register renaming. Unfortunately, earlier hazard-free designs either require one-at-a-time instruction execution (stack and accumulator machines) or push hazard avoidance off onto the compiler or programmer (VLIWs). The Mill is a new machine architecture that eliminates these problems by adopting a new machine model, the “belt”.
The belt machine model is inherently free of update hazards because all operation results go onto the belt by Single Assignment; in other words, once created they never change their value. Belt machines have no general registers and thus no rename registers that physically embody them. Result addressing is implicit, which produces compact code and easily accommodates operations like integer divide that logically produce multiple results. The machine model integrates naturally with function call, eliminating caller/callee save conventions and complex call preamble and postamble code.
A belt machine has short pipelines because it lacks the extra pipe stages associated with rename; typical misprediction penalty is five cycles (if decode is also fast). Area and power consumption in a belt core is a third that of an equivalent superscalar in large part because a belt lacks the large number of physical rename registers and the interconnect needed to supply register values to the functional units.
The talk explains the belt model as seen by the programmer and the physical internals of a typical implementation.
- This topic was modified 10 years, 8 months ago by staff.
A call pushes parameters onto the callee’s belt and creates a new frame id etc, which seems straightforward; but what happens to the belt when you branch rather than call?
The loop execution has advanced the belt some number of positions, so the passed-in parameters are no longer at the top of the belt. There must be some kind of rewind mechanism? Which then makes me wonder what happens to internal values you don’t want to lose?
Hi Will,
Unless I’ve misunderstood something meaningful because the videos weren’t too clear, how is it really any different from how a typical load/store CPU works? That is: when you do a branch in a load/store CPU, you know (as the compiler writer) what’s stored where for registers as of the time of the branch: it’s no different with the Mill Belt, and unless I’m mistaken, the Belt doesn’t move merely because of the branch.
That’s not to say that there aren’t instructions in-flight that need to be accounted for: but that’s not really any different from no branch, because the compiler knows exactly how much is in-flight as well, as it’s 100% deterministic, even if there are pipeline stalls from cache misses.
Hi Jonathan,
I’m beginning to form my own (uninformed) mental model of the belt in loops.
The loop body must be in an EBB, and the parameters to an EBB are passed at the front of the belt. The
call
instruction has a list of belt positions and order to make the new belt for the callee.If you jump back to the start of the EBB, you have to put the parameters for the next iteration back at the front of the belt.
My intuition would be that the
branch
ops can all specify a list of belt positions and order to put these back at the front of the belt for the next iteration, just like acall
can.I might have missed if this has been explained in any of the talks so far.
The operation that renumbers the belt is called “conform” and is a separate op from branches. It takes a belt number list, and those are moved to the front of the belt, in order, and the rest of the belt is invalidated. There is also a “rescue” operation that takes a similar list, but the rescued values are left on the belt in original belt order, not in list order. Obviously you can use conform to get the effect of rescue; rescue exists because it has a more compact encoding, for use when you don’t care what the order is so long as they don’t fall off.
Any time you have a control-flow join point in which the in-bound arcs do not have congruent belts (such as at the top of a loop) then you pick one of the arcs, define its belt as canonical, and use conform on the other arcs to make everything congruent. For loops you generally want to define the most frequently taken backwards branch as the canonical one; that eliminates the conform that would be most frequently executed.
- This reply was modified 10 years, 8 months ago by Ivan Godard.
- This reply was modified 10 years, 8 months ago by staff.
Hi.
Some questions about the belt.
When doing branches, loops etc. The belt needs to use the “conform” instruction to rearrange the belt.
How efficient is the “conform” instruction?
A CPU with normal registers would never need a “conform” instruction, so isn’t the belt causing a processing overhead here?
I guess the “conform” could be a hardware shuffle of the belt providing very little overhead.
Another problem aspect of the belt might be micro parallelism.
From a single group of instructions to be executed, one might be able to execute some in parallel if they do not conflict. But the belt adds a conflict/race when the result is saved on the belt.
A possible solution to this might be: Multiple belts. The instructions that work in parallel could be saving to different belts, thus removing the conflict.
One would then have to implement some stall/syncronisation mechanism whereby the next instruction stalls, until the previous parallel instruction finishes that the stalled instruction depends on.
How many bytes is the belt? How much needs to be copied on a context switch?There isn’t any conform op any more; the functionality was moved to branch ops so a taken branch can rearrange the belt to match the destination. However, the same question applies: the rearranging takes zero cycles and has essentially no cost. Nothing is actually moved; all that happens is that the mapping from belt number (to decode) to operand location (dynamic) is shuffled.
There’s no real conflict/race. The “drop to the belt” doesn’t involve any real movement. The result stays in the producer’s output latches until the space is needed for something else (when a real move will happen). “Drop” just enters the new value’s physical location into the mapping from belt number to location. It tells the rest of the machine “when you want b3 it’s over there”. You can have multiple drops from different sources; each goes into its own output location, and each gets its own mapping. Yes, there are necessarily multiple updates to that mapping happening, but the values are 3-bit belt numbers, not 128-bit data, and the hardware can handle it.
The belt has no byte size; it’s a logical list of locations. The size is the number of locations it tracks; current configs exist with 8, 16, and 32 positions. It doesn’t have to be saved, because each operand self-identifies with a timestamp which is its modulo index in an infinite sequence of drops; the leading numbers of that are the current belt.
It’s a bit to get your head around, I know 🙂
We talk about the belt as if it were a shift register, and dropping as if they were copies, but that’s all “as if” metaphor.
There isn’t any conform op any more; the functionality was moved to branch ops so a taken branch can rearrange the belt to match the destination.
Recently, I have been comparing different other ISA’s calling conventions and how they assign function arguments to registers. Arguments are not necessarily specified in the order (or reverse order) of urgency, so for register-based ISA’s it will not necessarily be the arguments that best need to be in registers that get assigned to them.
And in the case of the Mill’s belt, it will also not necessarily be arguments with longer life-times that get assigned to the first belt positions, furthest away from the precipice.That got me thinking… Before removing the conform op, had you considered putting the conform op at the top of a function body to reduce the probability of having to spill/fill arguments further down?
I’m thinking that something equivalent to that could also be achieved by reordering arguments during the link-time optimisation pass right before specialisation, but of course for module-private/non-referenced functions only.A belt needs to have congruence maintained wherever two or more control flows join. The includes call and return sites, and also branch-like control flow when a value dropped in one ebb is passed to and consumed by ops in a different ebb. The argument lists of call and return provide a ready-made way to enforce congruence: the arguments must be in the same order at each site, which is easy to enforce.
The erstwhile “conforms” op rearranged the belt for branch join points. It was a separate op so that it could be omitted if by happenstance (or good scheduling) the live values were already congruent on the in-bound control flow arcs. It turns out that such natural congruence almost never happens, so we replaced the conforms op with a call-like argument list on the branches.
You suggestion to re-order call arguments to get them into belt lifetime order is valid, although the current specializer does not do it. In fact, it does no partial-parameterization yet at all – not just ordering, but such things as folding common constant arguments into the body of the called functions. Some of that sort of thing are done by LLVM’s Link Time Optimization facility, and so will also be done for Mill targets. However, argument reordering is not done by LLVM, and we have no plans to add it to LLVM any time soon.
By the way, what are the maximum number of arguments on the belt in function calls? The whole belt? Or is it the same on all members?
I could not find that info in the Wiki. The Wiki also used to have the bitwise format of instructions but that eludes me too.One less than the whole belt. The extra position is needed for the predicate, if there is one.
The ISA details in the Wiki had developed bitrot after Jan took a leave of absence. We now have a new team member who’s picking it back up again, but not ready yet.
> However, argument reordering is not done by LLVM, and we have no plans to add it to LLVM any time soon.
But doesn’t conform loop quite a bit like the block arguments that MLIR uses, as a different way to represent PHI nodes? It seems that conform is a sensible way to model branches that have multiple target PHI nodes. (I have programmed LLVM with such PHI nodes, but I understand why MLIR uses argument instead).
The separate conform op is no more; branches can now list the things they want preserved on the belt, in the order they want them. So branches now look like function calls, with belt arguments. The phi ops in the genAsm input to the specializer are removed during input, and replaced by arg lists on the exit arcs of blocks. Each arc is a go-to, while phis are come-from.
Whilst the conform instruction is an extra instruction, its arguably doing what a conventional register machine has to do anyway, but spread out across the instructions of the loop. Ideally, such as when there is only a single place in the code that can branch to a block, you don’t need it as can setup the receiving block based on the incoming belt layout. But when you do need it you are just bringing back the overhead that a conventional register machine has in the encoding of the output registers of instructions.
As for the idea of a conflict, I’m not sure where you got that from. It might be worth going back to the execution talk. The mill is a statically scheduled VLIW arcitecture. So microparallism is handled by an explicit ordering of instructions and statically defined latency (for a specific family member, see the specification talk for how this is handled) which allows the specializer to dispatch multiple opcodes per instruction, and know exactly when each will drop its result onto the belt. Thus very few instructions can actually stall. pickup being one of the few, and there’s very little you can do if you are stalled on a pickup.
And for context switches the amount of data would depend on family member, but is handled in the background by the spiller. And anyway the register file/belt on a modern processor is a tiny part of the cost of a context switch as opposed to cache flushing/TLB flushing and these days flushing the speculative execution pipeline. So the differences in the mills memory system will hide any impact the belt has on context switches.
You are right that the congruence arguments to branch instructions provide a replacement for the result specifiers of legacy ISAs. You are wrong when you assume that the costs are equivalent.
Legacy result specifiers are needed on any instruction that has a result. A congruence argument is needed only for values that are still live across a branch at the end of a basic block, and then only if the branch is going to a control flow join point. The counts of specifiers and of congruence arguments can and do diverge dramatically in real code. I see codes like sha1 where there are hundreds of results dropped in a block, and only two are passed to the next block. I’ve never actually instrumented the tools to measure, but by eyeball I’d guess that there is one congruence argument for every eight drops on average.
Just to give you a better guess I took a look at the code I happened to have been working on before writing this note: it’s five functions from a buddy-allocator package and I had been chasing an case of less than optimal asm coming from the specializer. In the whole package there were exactly three congruence arguments, one each on three branches; as it happened all three were control variables of loops. The code of the five functions dropped 52 times total.
This ratio is so high for a mix of three reasons. First is that many values die before they get to a branch; only the live ones need congruence. Second is that the specializer doesn’t use congruence for values that are live across loops but not used within the loop: instead it spills at point of creation and fills at point of use. That allocator code has five spills and six fills on the Silver config. A legacy machine with lots of registers could have just left the values in register. Of course those would have to be spilled across calls; as Heinlein said, TANSTAAFL. Third, the specializer goes to a lot of work to eliminate branches. That’s to reduce the cost of prediction and mispredicts, but as a by product increases the benefif of congruence vs. specifiers.
Thanks for that. I was speculatively hoping that conform would be less costly for the liveness reasons, but without actual code examples in front of me, I assumed the worst case of legacy blocks containing only enough instructions to equal the cost of conform. I didn’t know how much the specializer avoids branches though.
In reply to Jonathan #397: Yes, regular branches have to belt consequences. Of course, the predicate for a conditional branch must be on the belt, but that’s an argument, not the branch itself.
- This reply was modified 10 years, 8 months ago by Ivan Godard.
Check out the scratchpad, described in the Belt talk. It’s where you put stuff that you don’t want to lose.
Is it possible to issue an op even if an instance of that same op is already in flight in that same pipeline?
(This is hinted at in this talk but I want to confirm my understanding – or misunderstanding – of this)
In general all ops are fully pipelined, so each slot can issue one op per cycle. Consecutive ops in a pipeline can be the same or different, it doesn’t matter. Ops do not overtake one another in the pipeline, but they do differ in latency so a shorter latency op can retire before a longer-latency op that was issued earlier.
There are a few cases of intra-pipeline hazards that code gen must be aware of and statically avoid. An example is the FMA (fused multiply-add, sometimes called MAC for multiply-accumulate). If, for example, a regular multiply takes four cycles and a regular add takes two for some operand width, it is possible to do a FMA in five, not six, and with better precision than with separate operations. Hence, as the intermediate result is coming out of the multiplier in cycle 4 it needs to be fed directly into the adder. However, the code could be issuing an add operation in cycle 4 that is also goung to try to use the adder, and they can’t both use it.
As a result, while the FPU is pipelined and you can issue one FMA per cycle, or one add per cycle, with no problem, the right combination of FMA and add causes a pipeline hazard. The compiler should avoid this, and the hardware will throw a fault if it detects a hazard violation.
Some implementations of FMA have all their own hardware and don’t share the regular adder. These have no hazard, but do have more costly hardware; it’s all in the specification.
Wait. If you issue a FMA like that and immediately call, the code on the other side of the call knows nothing about the in-flight op. How do you handle hazards like that? Do you stall? And if you do stall, when do you decide to stall?
Calls (including the body of the called function) have zero latency. The FMA drops after the call returns.
The Mill spiller not only saves the furrent belt and scratchpad, but also everything that is in-flight. The in-flights are replayed when control returns to the caller, timed and belted as if the call hadn’t haoppened.
That how we can have traps and interrupts be just involuntary calls. The trapped/interrupted code is none the wiser.
How do operations that have multiple (always 2?) return values interact with the output registers for the pipelines? Does the extra get bumped to a higher latency result register? Do you use the extra buffer registers that were mentioned? Or are there special pipelines for dual-result operaitons that have pairs of registers?
There’s no restriction to two results, although we have none more than two except the function return op itself, which can completely fill the belt if it wants.
FUs that have ops that can return more than one result have an output result register for each result. These bump up the latency chain normally if there are the same number of registers at the next higher latency. If there are not (which is common) it’s up to the hardware guys whether to add dummy result registers at the higher latencies to bump into (simpler, but more space and more belt sources) or bump the excess direct into the spiller buffer pool (more spiller sources). The software can’t tell.
Out of curiosity, does anyone (Ivan?) happen to have the title of the Yale Patt paper/work referenced in this presentation? Seems like an interesting read.
Thanks in advance 🙂
@tino: I’m not certain what source Ivan was recalling, but this PhD thesis from one of Professor Patt’s students reports some of the value fanout data, e.g., 4% of results are produced but never used (p. 9).
https://www.lib.utexas.edu/etd/d/2007/tsengf71786/tsengf71786.pdf
I hope this helps.
Say you have two values on the belt, a and b, and you conditionally add one to a. Then either your belt looks like (a b) or (a+1 a b). How does the compiler then address value b in the code that follows? It is either b1 or b2, but how can the compiler know?
edit: I see this may have been answered in an earlier comment. If I understand right, at the end of the conditionally executed branch you would have a conform operation which reshuffles the belt. Is that a single-cycle operation?
- This reply was modified 10 years, 4 months ago by ksrm.
Yes you can rearrange the belt with the conform op.
You can also speculatively execute operations and pick whether to use the result later. The Mill has a lot of pipelines and there’s often slots available, and speculation doesn’t fault, so its routine to unbranch cheap code in this way. This is described in the Metadata talk.
KSRM, I believe the way the operation would usually be done on a Mill, first you have an unconditional increment of a, then a conditional select, the select being a “zero time” operation.
(a b)
(a+1 a b)
(a a+1 a b) or (a+1 a+1 a b)
A given operation always adds the same count of results to the front of the belt. If you use branching you need to make sure the belt is always set up correctly before your branches merge.
This is a great initiative, the best of luck to the founders.
Has anyone made the connection yet between belt machines and queue machines?
From 2004 to 2008 I was part of Prof. Sowa’s lab where we developed the Parallel Queue Processor. We had a similar idea (FIFOs holding operands) with similar objectives: high parallelism, no register renaming, and short instruction set. http://dl.acm.org/citation.cfm?id=1052359
For those interested in the compilation aspects: http://ir.lib.uec.ac.jp/infolib/user_contents/9000000330/9000000330.pdf
Some additional material about queue machines has been compiled by Prof. Abderazek:
http://web-ext.u-aizu.ac.jp/~benab/research/projects/QueueCore/
I was unfamiliar with your work and greatly appreciate you bringing it to my attention. Your thesis will take a bit for me to digest; please email me at ivan@millcomputing.com and I’ll comment when I have gotten through it. Unfortunately, research in Japan is not as well known here in the US as it should be; for that matter, the same is true of research from anywhere outside the US, due to some mix of language barrier, NIH, old boy networks, and perhaps arrogance.
As for your Queue Machines, it’s clear that stack machines, queue machines, and belt machines like the Mill are all members of a super-family of architectures in which transient results are single-assignment; the differences among stack, queue, and belt lies in the permitted/encouraged access pattern to use those transients. Perhaps systolic array machines may be thought of as belonging to this group too. The other great super-family, relaxing the single-assignment constraint, includes accumulator and general-register machines.
Thank you.
Canedo wrote:
Has anyone made the connection yet between belt machines and queue machines?
From 2004 to 2008 I was part of Prof. Sowa’s lab where we developed the Parallel Queue Processor. We had a similar idea (FIFOs holding operands) with similar objectives: high parallelism, no register renaming, and short instruction set. http://dl.acm.org/citation.cfm?id=1052359
…That’s very interesting; another architecture using a form of FIFO for operands! Thanks for sharing those links.
One thing I admire about the Mill architecture, as described, is that the behavior of the Belt is prescribed in terms of what it does (programmer’s model) independent of how it’s implemented (in general, or on any given family member.) If I’ve understood from a quick skim, the above-mentioned Parallel Queue Processor’s queues are tied to a particular hardware implementation, namely a ring buffer. A ring buffer would not appear to lend itself to Belt operations like conform() — which permits arbitrary reordering of the belt’s contents.
Queue machines and belt machines are different, and both are different from stack machines. I was aware of some of the prior work on queue machines that canedo cites in his thesis, but not all, and his compiler work was completely new to me.
The key insight in a queue machine is that if the dependencies form a complete tree then a level-based schedule permits access to be at only two points, the front and the back of the queue, and maximal parallelism falls out automatically. I suspect there is a relationship between what a queue machine can do vs. what a general-register machine can do that is mathematically related to the difference between what a FSM can do and what a CFG can do, but am not sure. Perhaps those of you with CS skills might find a thesis topic in there
Canedo’s work relaxes the pure-tree requirement of a classic queue machine toward more realistic code, and his thesis addresses how to compile for that model. I haven’t finished the thesis yet, but it’s clear that his scheduler takes dataflow graphs as input just like ours does, but the scheduler works very differently than ours. Of course, we have to deal with additional real-world issues that he doesn’t: an exposed pipeline with varying latency, for example. But even without those complications. I’m not sure that his approach to turning the graph into sequential code will work even theoretically for a belt machine like the Mill, no more than it would produce reverse Polish for a stack machine. Still reading
I finally saw the talk on the belt to the end. I agree that the hardware implementation of queue machines and belt machines is probably different. However, their programming model is similar. I also think that belt, queue, and stack machines are very good at decoupling the programming model from their hardware implementation.
As Ivan points out, “pure” queue machines need perfect trees. Real code with DAGs needs a relaxation of the queue model. We thought about two variants: allowing random access to operands in the queue with an offset relative to the head of the queue (producer-order model), or relative to the tail (consumer-order model). The queue machine’s offsets are a cousin of the belt’s temporal references to operands.
An interesting observation about offsets or temporal references is that once they are introduced, they make the generated code dependent to a particular hardware architecture. This is because the queues and belts have a fixed length. Like VLIWs, the code needs to be recompiled every time is migrated to a new hardware implementation. A ring buffer creates the illusion of an infinite queue and, theoretically, it allows the same code to run on different Hw. This was one of the motivations of having the ring buffer as our main implementation, but we also architectural variants (e.g. a “multi-dimensional” queue processor where functional units were fed from different queues – the code is not pretty).
The only explanation on the conform operation I found is in this discussion thread. I am not sure what it really does. In the queue processor’s programming model we had a “rotation” and a “duplicate” instruction that were used to reorder operands in the queue. The compiler can find these locations in a straightforward manner. Is this what the conform and rescue operations are used for?
Queue machines can’t execute RPN code. We relied on the level-order scheduling to expose the ILP. The relied on the hardware to pick up the stream of instructions and issue multiple at a time. I have not yet seen the videos on the belt’s instruction scheduling, but I suspect you pack the parallel instructions in bundles as in VLIW’s? At the time we retired from the queue machines this was in my todo list.
The discussion on varying latency of the exposed pipeline is interesting. That makes me wonder how would the queue compiler’s scheduling look like? Probably similar to what you propose with the multipliers that span across cycle levels. I think playing with the vertical (depth) position of these operations in the DAG could solve the problem.
You are right that exposing the belt length by admitting temporal addressing does make the binary target-dependent, but the Mill is an exposed-pipeline machine and so the binaries are target-dependent for many other reasons, so temporal addressing does not add any complication that is not already there. Our solution is the specializer, which let’s us have target-independent load modules even though the binary is not. As no one but hardware verifiers (and the code of the specializer library) ever looks at the bits in memory the nominal binary dependence doesn’t matter; you get the same portability you get on any other machine family.
It is my impression (please correct me) that the ring-buffer-based queue machine forces a uniform and universal opset in each executing position. This is trivially supplied if there is only one execution pipe, but if there are more than one then non-uniformity has advantages. Thus if there are three uniform pipes the the QM can issue three ops if there are arguments available, but if say there are two pipes with ALUs and one pipe with a multiplier then the arguments for the adds and the multiply must be in relative order to each other to align with the intended pipes.. Alternatively, the adds and the mul could use your positional-addressing extension to get the correct arguments, but the result is no longer a QM.
The conform op was explained in the talks – Memory if I recall. Conform normalizes the belt, arbitrarily reordering the belt contents. It is used to force all control flow paths to a join point to have the same belt structure, at least w/r/t data that is live across the join. The implementation is just a rename, because actual data movement would be much too expensive. As with the rest of the Mill, the design takes into account the need for a commercially viable product in silicon.
Rescue is a similar but more compact encoding of conform, in which the belt order is preserved but dead belt operands are squeezed out, whereas conform admits arbitrary reorder. Rescue is used, as the name suggests, to prevent live operands from falling off the end of the belt.
Mill encoding is wide-issue like a VLIW, but otherwise very different; see the Encoding talk. The need to have not only live belt operands in the belt but also multi-cycle computation in the FUs across control flow forces us to use a very different scheduler than the DAG approach which is inherently restricted to basic blocks. We use a time-reversed tableau scheduler, similar to what some VLIWs use.
I did not think about queue machines for years; this is a great discussion.
Memories started to come back. We had the Parallel Queue Processor (PQP) – the implementation with the ring buffer cited above. We also had the Queue Register Processor (QRP) that was, in spirit, similar to the Belt + Scratchpad. In the QRP, the scratchpad was analogous to a random access register bank where we stored loop induction variables, and other frequently accessed operands. The instruction set was a hybrid where operands from both the queue (with relative offsets) and the register bank (absolute addresses).
In a queue/belt processor without a scratchpad, the spill-fill pairs can be replaced by single rotation or duplicate instructions. These can also be used to recycle the operands that fall from the tail.
Ivan, I am wondering how large of a belt do you guys need for heavy sw pipelined code? Unfortunately, my queue compiler was constrained to the dataflow information that GCC 4.X would give. I wish I would have had access to a more aggressive compiler at that time.
An idea that I never had the time to explore was the predicated execution in a queue model. I’ll try to see your branch prediction videos over the weekend to see how you handle it in the belt machines.
Rotate/swap ops are not sufficient to replace spill/fill because the number of long-lived live operands in the queue/belt is unbounded. The length of the belt in individual members is determined by the scratch-pad spill-to-fill latency (three cycles on most members) and the rate at which the core can produce results, which is roughly equal to the number of execution pipelines, running from ~5 to over 30 depending on family member. As a rule of thumb we set the length to hold three cycles of production, and tune from there.
Ah, I remember some related work dating back to the year 2000.
Bernd Paysan developed a 4stack processor back then and discussed it on comp.arch .
There have been other symmetrically-split machines; for example, the TI 8-wide VLIWs are actually dual 4-wide VLIWs with the ability to communicate between the halves. The main advantage of Paysan’s is that it does not need to encode result locations, and the transients are single-assignment and so are hazard-free; the Mill has these same advantages with the Belt. The major difficulties with Paysan’s design in a modern context (which is unfair; Paysan’s work was done 15 years ago when a three-stage pipe and a 100MHz clock were cutting edge) are the communicating crossbar, and that bane of all wide-issue machines: cache miss stalls. The Mill solutions for both have been covered in the talks.
I also wouldn’t want to have to do an instruction scheduler for Paysan’s machine
Since you’re mentioning 100MHz:
I was pondering around together with a colleague recently about the forwarding network.
It seemed to us that you face exactly the same issues than anybody else implementing a forwarding network with a large number of FUs.
Did you do a sniff check on the wiring issues you’ll get with the larger members, especially GOLD?On a related note I’m curious if I’m right that the size of the belt scales linearly with the number of FUs. IOW, is the number of belt entries per FU roughly constant across members?
I hope none of these answers fall in the NYF department
You’re safe on the NYF department, so far
You are right that the Belt is essentially a code-visible forwarding network and faces the same problems that such networks have on other machines. The Mill partitions the network to get both speed (for some critical results) and volume (overall), in a way that we have filed for.
At the end of the belt talk there are a couple of slides on how the cascaded crossbar works in hardware. Essentially there is a fast path, that gets a single one-cycle result from each slot through in time linear in number of slots (MIMD width), while everything else first goes through a slow path that is (roughly) linear in number of FU results that can be produced in a cycle; remember a single FU can have ops of different latencies retire together.
Without the cascaded crossbar, the network would have been clock-critical on the larger machines. With cascading, it appears (based on very preliminary hardware work) to be out of the critical path.
How accurate is the belt abstraction when it comes to the actual HW? That is, is it really the case, on a machine with 32 belt entries and roughly the same number of ALUs, that in each cycle, each ALU can take its operands from any belt location? And push a result to the belt?
I’m not a uArchitect (but I work with some), but it sounds to me as though either:
– There are some restrictions in place to keep the HW cost reasonable, so the belt abstraction isn’t the whole story. That would impact a lot of the described architecture in deep ways, though.
– There is the mother of all crossbars which can, in one cycle, route full operands from any of 32 sources to any of 32 destinations, and take 32 results back into (almost) arbitrary positions – which, I am told by my uArchitect friends, is “impractical” (they’d probably use a stronger word there
– There is some very clever uArchitecture trickiness going on which wasn’t described in the videos (including the belt video). There are tantalizing hints in the videos about a tagged structure but not enough to explain how this circle is squared.
Perhaps this is “too low level” to be covered in an architectural video/forum, but I’m burning with curiosity here
The abstraction is accurate. For a belt with nominal 32 entries, any data sink can take its data from any position. There are actually twice as many physical entries (64) as nominal, to allow for phasing, but each phase can only see a nominal window into the physical belt space. Without that, we would have to renumber the belt at each phase boundary, which is not practical.
There is a crossbar, but it is split with a narrow fastpath and a wide slowpath. One-cycle latency results (a maximum of one per slot) go direct to the fastpath, while two values (for each consumer) are selected from the slow path and passed to the fast path. This is explained in greater detail in the Belt talk.
However, although there are many possible sources for the crossbar, there are many fewer sinks than you assume. In particular, slots accept at most two input arguments from the belt, and quite a few (all the Reader block) accept none at all. The cost of the crossbar is determined almost entirely by the number of consumers, which is one reason why Gold has only 8 exu slots.
Similarly, while a slot can concurrently drop results from differing-latency ops issued in that slot, the long-term steady-state peak throughput is ~one result per slot, which determines how we size the spiller.
We describe the belt implementation as in effect a tagged distributed CAM, because that is easy to explain, and for many people a concrete realization is more understandable than a more abstract description. However, the implementation of the belt is invisible to the program model, and the hardware guys are free to come up with bright ideas if they can, which will likely differ in different family members.
There is expected to be a talk this Fall giving the present most-recent final word about the gate-level hardware of the first Belt implementation.
So Gold has only 8 exu slots, but can still issue 33 operations? The Belt talk mentioned only 8 of the ops could be additions, that doesn’t leave room for anything else…
You mention the “first Belt implementation”, I guess this means you have different ones (for different family members?). This makes sense as the wider you get, you probably need to do things differently, while keeping the same abstract model for the SW.
I guess things will be clearer after the Fall talk about this; I’m very happy to hear this would be the topic.
Gold, as currently specified (the specification tends to migrate a little), has:
exu:
(for the interested, Art described an earlier Gold’s pipes: http://millcomputing.com/topic/introduction-to-the-mill-cpu-programming-model-2/#post-610 )
Greetings all,
Where did pick() operations– and the slots that implement then — go when Gold’s spec changed?The an earlier description of Mill slots and pilelines:
http://millcomputing.com/topic/introduction-to-the-mill-cpu-programming-model-2/#post-610
(Will, thanks for citing that!) Art mentions,
“4 pipes that can do pick operations”In Gold in particular — since Gold’s a specific, evolving member for which we have some info revealed —
Which slots on the current Gold Mill can pick operations now be encoded and executed?My guess is that pick Ops remain in the Mill ISA, but have been subsumed into other slots (perhaps exu-side?) I note that in Ivan’s latest slot-tally for Gold, he doesn’t break the slot functions down to as fine a level. In the earlier description, there were examples of further breakdown by function, for example how many exu-side slots could do multiply(), vs. how many others that could do simpler ALU ops, but not multiply().
If possible after considering NYF and and Guru time/effort, I’d like to see a bit more detail about the FU and slot population on at least one member, probably Gold, since some detail on it has been revealed — and the design is evolving.
M evolution, and motivation for the changes will IMHO give readers some insight about Mill performance and the sorts of trade-off that have been made — and can be made — using MillComputing’s specification and simulation tools.
One thing I’d like to see in the Wiki is a description of the instructions and any constraints they impose. (Ivan has mentioned some issues with potential FMA hazards, that the scheduler has to avoid.) Ideally, such wiki entries would automatically generated.) But if not, than user-supplied explanations, eventually corrected by the MillComuting team, would give users a better understanding of what the Mill can do, especially at the abstract Mill level.
Theoretically, the Mill could have a very few operations as its core ISA, with a larger subset emulated with series of the core operation. While this could be taken to extreme (how many operations does the classic Turing machine have?) I expect that for performance reasons — and the Mill is a performance-based sale — the Mill will define a sizeable core subset of Ops that are implemented very efficiently on all members (in hardware, most likely), instead of as graph substitution of (longer sequences of) core ops. I’ve read hints that full divide (or was it just floating-point divides) that were probably not going to be Ops in their own right, but emulated via substitution of divide-helper functions, that were expected to be in the non-emulated core. I know the divide-helper instructions are not yet filed. So I’m hoping the Wiki will soon (hope!) give us some insight into the general core vs. emulated operations, if not divide() itself.
If possible after considering NYF, I’d like to see a bit more detail about the FU and slot population on at least one member, probably Gold
I am careful not to announce things; I am not up to speed on the intricate details of filings (fortunately for me). But in the bigger scheme of things, given this next bit, perhaps tracking what Gold looks like today will only hint at tomorrow’s mix.
My hope is that info on this evolution, and motivation for the changes will IMHO give readers some insight about Mill performance and the sorts of trade-off that have been made — and can continue to be made — using MillComputing’s specification and simulation tools.
Yes, although what I really look forward to is you being able to run programs on our simulators for real. I hope we can get that going, hand in hand with public disclosure of the ISA details etc. It obviously must not be allowed to take time from the critical path such as producing hardware, but if we can somehow gain benefit e.g. tool chain testing and defect finding from it, there may be a good cost benefit balance to the exercise.
One thing I’d like to see in the Wiki is a description of the instructions and any constraints they impose
We are working on exactly this. This can be auto-generated from our specification system, and we have someone working towards documenting this. When it can be published on our new public wiki may be held back by NYF considerations, but we are always very keen to get information out sooner rather than later now we’ve started being able to talk about things at all.
Theoretically, the Mill could have a very few operations as its core ISA, with a larger subset emulated with series of the core operation.
Yes! Absolutely.
And the mix of FUs, latencies and dimensions on any particular core can be determined in simulation benchmarking for representative workloads long before HW gets built. We have to be data-driven, not hunch-driven.
Where did pick() operations– and the slots that implement then — go when Gold’s spec changed?
The pick block was folded into the writer block for encoding purposes, although the pick ops could have been put in any of the exu-side blocks, and might be moved again for entropy balancing.
Which slots on the current Gold Mill can pick operations now be encoded and executed?
The only block whose content is dictated by machine timing is reader block, which decodes a cycle before the others. Reader phase ops, which execute a cycle (or two) before the others, must be in reader block to get time for dispatch. Block assignment of the other ops is essentially arbitrary as far as execution, and the only consideration is code compactness and simplicity of the decoder matrices. As currently organized, any writer block slot on any member supports pick.
I’d like to see a bit more detail about the FU and slot population on at least one member, probably Gold, since some detail on it has been revealed — and the design is evolving.
Well, you asked for it From the file members/src/specGold.cc:
c->exuSlots = newRow
(8);
c->exuSlots[0] < < aluFU << countFU << mulFU << shiftFU <<
shuffleFU;
c->exuSlots[1] < < aluFU << mulFU << shiftFU;
c->exuSlots[2] < < aluFU << mulFU;
c->exuSlots[3] < < aluFU << mulFU;
c->exuSlots[4] < < aluFU;
c->exuSlots[5] < < aluFU;
c->exuSlots[6] < < aluFU;
c->exuSlots[7] < < aluFU;
c->exuSlots[0] < < bfpFU << bfpmFU;
c->exuSlots[1] < < bfpFU << bfpmFU;
c->exuSlots[2] < < bfpFU << bfpmFU;
c->exuSlots[3] < < bfpFU << bfpmFU;
c->flowSlots = newRow(8);
c->flowSlots[0] < < cacheFU << conFU << conformFU <<
controlFU << lsFU << lsbFU << miscFU;
c->flowSlots[1] < < conFU << conformFU << controlFU <<
lsFU << lsbFU << miscFU;
c->flowSlots[2] < < conFU << conformFU << controlFU <<
lsFU << lsbFU ;
c->flowSlots[3] < < conFU << conformFU << controlFU <<
lsFU << lsbFU ;
c->flowSlots[4] < < conFU << lsFU << lsbFU ;
c->flowSlots[5] < < conFU << lsFU << lsbFU ;
c->flowSlots[6] < < conFU << lsFU << lsbFU ;
c->flowSlots[7] < < conFU << lsFU << lsbFU ;Remember: this is an untuned dummy specification.
I know the divide-helper instructions are not yet filed. So I’m hoping the Wiki will soon (hope!) give us some insight into the general core ops vs. emulated operations, if not the details on divide() itself.
The divide helper is rdiv, the reciprocal approximation op. Before you ask, there is also rroot, the square root approximation helper, too. Emulation sequences for both div and sqrt are being worked on, along with the FP and quad integer emulation sequences.
Ivan,
Thank you for revealing the above info about the FU population in the current Mill Gold spec. I understand that it’s subject to change.
I take it that an lsFU is a load/store functional unit. An lsbFU sounds like a close relative of load/store, but what does the b in lsbFU stand for?
—-
With the exception of floating point on the exu side, it looks like slot zero on both sides can perform any operation that any higher slot on the same side can do. Is there an underlying architectural reason for that? (Other than sim indicating that doing so helps performance?)
lsbFU is a load/store for binary floating point, which converts to and from internal representation. There is also a lsdFU, but Gold doesn’t support decimal in native.
Most machines have a distinct internal representation for FP and convert on load and store. We had one that kept the number denormalized, but it turned out not to work. The current simplified form does (apparently) work, but it’s not clear that the gain is worth the complication, and lsbFU may go away later.
Packing slot zero avoids no-ops when you want an op that exists only in later slots. Of course then, if you need two slot-zero-only ops then you’ll need two instructions. Subject to (lots of) tuning.
More than two retires of latency-N ops per cycle, how do the crossbars handle this case?
In the belt lecture, in the section on data forwarding (slides 50 and 51, around 42 minutes into the talk), the latency-N crossbar is shown with only two outputs, both feeding into the latency-1 crossbar. In the accompanying description, Ivan says words to the effect that:
everybody else [results from lat-N ops] goes through a separate crossbar…. that winnows all those inputs down to two inputs , which are in turn routed to the latency-one crossbar….
If the (only?) outputs of the lat-N crossbar are the two winnowed values that feed from the lat-N crossbar into the Lat-1 crossbar, how can the results of more than two lat-N operations retiring in cycle N be forwarded for use by ops in cycle N+1? (Or even make it to the belt itself?) Since family members such as gold have FU populations that can execute many more than two latency-N operations per instruction, I suspect there must be a way for the rest of those lat-N results to be forwarded, although exactly how isn’t clear to me.
I’ll suppress my urge to speculate on how this is/could be done, and hope that the answer doesn’t require stalls.
A full crossbar connects every source to every sink simultaneously. The Mill contains a full crossbar, and indeed each sing can obtain an operand from any source, even all sinks from a single source. On larger Mill the crossbar is whopping big, although smaller than similar structure on OOO due to having fewer sources because there are no rename registers. There is no stall.
The time to transit a crossbar depends on the fanout, i.e. the number of possible sources that can feed a sink; this is a natural consequence of the mux tree implementation. This latency has a significant impact on cycle time, which must include a crossbar transit. Uniquely, the Mill splits the crossbar into a small crossbar (the fastpath) which contains only a small fraction of the total sources, and a big one (the slow path) that contains the rest of the sources. The fastpath crossbar is sized so as to have no cycle-time impact. The slowpath does have cycle-time impact, or would have if the slowpath had to fit in a cycle. However, the Mill is organized so that everything using the slowpath is already multi-cycle, and we simply accept the added latency of slowpath for those already-slower operations.
Of course, many multi-cycle operations do not fill their last cycle, and can use slowpath without causing another cycle latency. And many don’t quite fit and an op that would be three cycles if it could use fastpath becomes four cycles when it has to use slowpath. However, over the entire workload, letting popular ops use fastpath is a winner.
I just watched this video, and it was very interesting. Thank you.
As mentioned in other threads, a big issue is proper tail calls. It’s not correct to think of proper tail calls as an “optimization” that can be performed by compilers: in general, removing all tail calls from a program is a global transformation, breaking separate compilation. “Tail call optimization” really is a misnomer, it should be called “tail recursion optimization”, because proper tail calls are not necessarily recursive, and tail recursion optimization and proper tail calls can be orthogonal concerns.
Proper tail calls can be compiled onto most instruction sets just fine; the real issue is that popular compilers do not do this. So, compiling languages such as Scheme, ML, Haskell, F#, or .NET bytecode down to C (in order to avoid implementing a native code generator) or the JVM or JavaScript (because you have no choice given your deployment targets) involves some pretty monstrous kludges and/or some amount of interpretation remaining in the generated code. But compiling them down to machine code really isn’t an issue.
(As an aside, unlike some remarks in other threads, I wouldn’t worry one bit about closures. We know how to deal with closures just fine, thank you. I haven’t been excited by various efforts to implement them in hardware, unlike the Mill.)
Ideally, the Mill CPU would offer a “callret” instruction, that would be just like the current “call” instruction except that the current belt and scratch pad would be discarded before the call. When (if) the called function returns, it’s results would be dumped onto whatever belt the calling function would have returned to.
As I understand it, the Mill can almost do tail calls: simply shuffle the belt and then jump. The two problems with this, though, that the called function would not necessarily have enough scratchpad space allocated to perform its work, and that the called function would be able to read non-parameter values left in the caller’s belt and scratchpad.
The second issue certainly does raise the specter of security issues via information leakage; I don’t know enough about the Mill CPU’s security model and potential applications to offer an educated guess about the practical relevance of this issue.
The first issue, however, would certainly be a pretty fundamental stumbling block for proper tail calls if scratchpad space cannot be re-allocated, and a practical stumbling block if the initial scratchpad allocation instruction doesn’t also serve as a scratchpad reallocation instruction. After all, the called function should have no idea if it was called normally or called via a tail call.
Practical applications of tail calls include loops that span compilation boundaries, big Gordian knots of mutual recursion (as often found in lexers, parsers, and state machines), preserving object oriented abstractions, and supporting the commercially-relevant .NET bytecode instruction set.
It would indeed be nice for the Mill to solve another old problem (proper tail-calls).
In 2000, there was a discussion about adding tail-calls to gcc.At the time, the issues seemed to be (i) the mismatch of stack frame sizes in the general case and (ii) different context registers for calls across compilation unit boundaries.
Would these issues still be a problem on the Mill? IIRC in some contexts there can be a callee-accessible part of the scratchpad provided by the original caller. Would that still be accessible for the new callee? And in the case of portal calls, return appears to revoke all grants. I suspect this behaviour would be inappropriate for tail-calls, wouldn’t it?
As pointed out by lpsmith in #1552 above, the Mill does simplify tail-call considerably. At the same time it introduces other issues that other machines don’t have. Here’s a rough-out of what comes to mind; we won’t actually do any of this until we have a complete tool chain to practice on.
1) Mill makes no difference between calls within and calls across compilation unit boundaries.
2) Mill does distinguish between calls within turf and portal calls across turfs. The latter must be secure in both directions, and argument passing of memory-resident arguments is complex. It will be impossible to support portal tail calls because the spiller needs a spill frame to hold the info to return to the calling context. It might be possible to skip an intermediate spill frame if there were immediately nested portal calls, but this case is too specialized to be worth doing anything about. However, the calling code will not have to know whether it is making a portal call or not; the only consequence will be that the calling frame will not be cut back until the portal call returns.
3) Operands on the local belt are identified by a nesting number, similar to a frame number except Mill functions do not necessarily have a frame. A call creates a new nesting number, which is how the belt is emptied in one cycle but is still there after the return. A tail call cannot reuse the nesting number of its caller because that would give it the caller’s belt rather than an empty belt, so a tail cal must also get a new nesting number. However, if tail calls are statically identified by an op distinct from normal calls, the hardware can save the caller’s nest number and return to that rather than to one less than the callee’s number. The nest number is small and wraps around, so there would have to be some arm-waving in the spiller to do the cut-back, but there seems nothing insurmountable.
4) The scratchpad and stack frame (if present) can be cut back as part of the special op. The called function would have to do a normal scratch/frame allocation if it needed to, but it would always be doing that anyway.
5) I don’t see any way to support in-memory arguments (VARARGS or large objects passed by value). In a true tail call these would be on top of the caller’s frame, and hence likely on top of the actual argument that is trying to be passed. A compiler could in principal move everything out of the way, cut the frame, and copy back the actual arguments, but I don’t see any actual compiler ever going to the trouble. Nor am I aware of any tail-call implementation that handles this case; please reply with citation if you know of one.
5) All this addresses true calls. Compilers can turn tail calls into loops in the generated code as well on the Mill as any machine.
6) All the implementation issues appear to be in the spiller in the Mill; it has to know that there is a missing spill frame. In turn, that means that it has to be told, and on a Mill that means a distinct operation. Because the internals of the spiller are per-member and not visible to the application, the implementation of tail calls is likely also to be per-member. In particular the spiller in a multicore Mill is quite different from that of a monocore, to avoid race conditions and coherence overhead.
2) My understanding is that .NET has a similar limitation. This limitation may not be strictly necessary, the paper A tail-recursive machine with stack inspection (non-paywalled link) may be relevant here.
3) It’s not strictly necessary to get rid of the execution context immediately. The important thing is that the retained context for all tail calls is bounded by a (hopefully small) constant. Perhaps it would be simpler to mark a sequence number as a tail call, and then have the spiller throw away the belt when it would have been otherwise spilled? This might not be an ideal implementation, as it would increase the pressure on the on-chip storage, but it should still be quite usable, and would probably be a more than adequate first implementation.
5a) Scheme supports varargs and mandates proper tail calls. Chez Scheme, Ikarus, Racket, Stalin, and/or Larceny might do what you describe, but I don’t know enough about any of these implementations on that particular count. In any case, it’s probably not important.
5b) Yup, even with a callret instruction, there would still likely be some fairly obvious reasons to perform tail recursion optimization on the Mill. Even some Scheme compilers have TRO, though they all also implement proper tail calls. If TRO was all there really was to proper tail calls, they wouldn’t be that interesting or significantly increase the expressive power of the language. But there’s a lot more to proper tail calls than TRO.
2) Well, I waded through the paper you cite. Once the obfuscation is removed, their method appears to be to accumulate the active set of permissions at the top frame, adding to that set with each tailcall as necessary. This is different than keeping each added permission in the stack at the level of its grant, and then searching down the stack for a relevant permission when needed. That having a mutable current set at the top removes permission-stack-overflow is obvious, but it is necessary to wrap the observation in a few lambdas to achieve publication 🙂
The approach is not really relevant to the Mill, because it assumes that permissions can be granted piecemeal locally, as in an access control list (ACL) threaded through the stack. Grants are piecemeal on the Mill, but are global and heap-like. No stack search is required; instead the hardware maintains a persistent table accessed through the PLB. This table does not itself change as a result of call, and so does not inhibit TCO.
However, on the Mill packages of permissions are grouped into named turfs as an implementation optimization, and a program has the rights of a single turf at any one time, in addition to any global grants it may have acquired. The current turf is local to a frame, and hardware admits changing the current turf as a consequence of call (and restoration of the prior turf on return). Because turf is a local right, it potentially interferes with TCO, and in practice the hardware (the spiller) maintains information in each spiller frame sufficient to restore the turf context of the caller. It is this data structure that would inhibit TCO across portal calls.
To admit TCO across portals, it is necessary to overwrite the current top spiller frame with new information reflecting the new execution context, without overwriting the restoration information. This is trivial to do in the abstract, but hardware is “interesting” in the concrete. For example, what happens if the spiller has the structure half-overwritten when it gets an ECC error on a memory access? These issues are non-trivial, and in software would be addressed by Read/Copy/Update methods, but hardware doesn’t do RCU very well.
So at this point the most I can say is that I think I understand the issue; we can TCO anywhere that Java and .net can; and we might be able to TCO across portals but I don’t guarantee it.
3) Belt semantics is more than belt frames. For example, nothing stops a program from starting a MULF and then doing a tail call while the operation is in flight in the multiplier. Yes, it will be thrown away, but there’s no way to reach inside the multiplier and stop the result from coming out. We don’t want that in-flight value dropping on the called function’s belt, and the frame it should be dropping to is gone, courtesy TCO.
The problem with the belt and TCO is that we don’t want to complicate the engine in a way that everyone pays for, just to provide something of theoretical interest only (TCO over portals) or lacking a significant customer base (TCO in general). I think the spiller frame-ID problem will fall out of the spiller implementation if we are careful, which will give us regular TCO. I’m less confident about portal TCO.
5a) VARARGs etc will have to be passed on the heap or copied in such a way that they don’t overwrite themselves. Either way it’s the province of the compiler, and nothing the hardware can do.
Say frame 17 calls frame 18, which does a MULF but returns to 17 before the MULF finishes. When the result pops out of the multiplier it is labeled as belonging to 18, known to be exited, and is discarded.
If 18 does a tail-call instead of a return, the called function just replaces frame 18, it doesn’t add a new frame 19 the way a regular call would. That way it can use the return linkages of 18 when it finally returns back to 17. But now the MULF comes out looking like it belongs to the current 18 (the tail-called function), but it doesn’t. Oops.
The structures to solve this are obvious, but add cost to everything, not just tail calls.
Seems like that particular problem could be simply dealt with by the compiler. Since the call is in the tail position it doesn’t make sense to have any operations in flight. Well… it may be different for multiple conditional calls with some speculation. Still compiler could take care of that. Some might be unsatisfied with this solution – you could break normal functions by doing MULF and tailcalling them. What are security implications of this?
What are security implications of this?
Exactly. Nothing “enforced by the compiler” can be called “security”; it is at most a sanity check and aid to debugging. Compilers have bugs and Bad Guys have assemblers.
Ivan,
Would you kindly clarify some vocabulary in this discussion, specifically between frames and nesting numbers? You wrote:
Say frame 17 calls frame 18, which does a MULF but returns to 17 before the MULF finishes. When the result pops out of the multiplier it is labeled as belonging to 18, known to be exited, and is discarded.
In the above, are those (stack) frame numbers or (belt) nesting numbers? My understanding was that true calls on the Mill (not inlined or TCO’ed away) *always* get a new nesting number, and that calls may create a new stack frame, but don’t have to. So would you please clarify which you’re talking about in this context?
Thanks!
Sorry for the confusion. In fact those are *both* belt nesting numbers and stack frame numbers!
The problem is that internally (where we who work on the Mill live) each thread actually has two stacks that are disjoint in the address space, whereas a conventional has only one. One of the two Mill stacks is the data stack, rooted in the initial stacklet, and containing content that is addressable by the application. Logically each call adds a new data stack frame, but a frame may be empty and have no content.
The second stack is the spiller stack, rooted in the initial spillet. It’s content is not application-addressable (debuggers etc. go through a special library to get at it). Each call adds a new spiller stack frame, which is never empty.
So those numbers are both belt nesting numbers and spiller stack frame numbers, which are in fact the same thing. And they are also data stack frame numbers, but some data stack frames are invisibly thin.
After reading the recent posts, I have three questions:
- Say aaa (frame 17) calls bbb (frame 18) which launches a MUL and returns back to aaa (without any results values), then aaa immediately calls ccc (frame xx). As far as I understand, xx will not be 18 again because otherwise the MUL result would erroneously end up on ccc’s belt. Ideally a tail-call from bbb to ccc would try to closely mirror the above situation, wouldn’t it?
- Apropos security: is there anything to prevent the Bad Guy from creating a fake additional function entry point which launches a MUL and then simply *jump*s to the initial EBB of the target function, with the MUL still in flight?
- Would it simplify the handling of the situation if any unexpected in-flight values are not silently discarded but instead cause an processor fault? The underlying reasoning would be this:
- When bbb issues the MUL, it promises to handle the result N cycles later.
- When bbb tail-calls ccc, it basically asks for substituting its parent aaa for itself when it comes to handling any future values appearing on the belt (“continuation passing”).
- The return values of ccc will thus end up on aaa’s belt, provided their count is correct. If the return value count is incorrect, this will cause a fault.
- When the MUL result appears, it ends up on aaa’s belt as well, following the above reasoning. As this violates aaa’s expectations, it should cause a fault as well.
As some programming style relies on infinitely passing continuations around, the full power of tail-calls is realized only if the machine does not run out of any resources from an unlimited number of tail-calls. If a spiller frame is never empty and a new spiller frame is required even for a tail-call, then this ideal target probably cannot be achieved, I fear.
1. Yes we make sure that results don’t end up in the wrong frame and that results that are in-flight when a function returns are discarded. How this happens is implementation-specific; there are several approaches that could be used.
2. If the Bad Guy can create ‘fake’ functions etc then its already game-over 🙁 The Mill contains novel hardware protection mechanism to prevent vulnerabilities being exploitable. More in the Security Talk.
3. I am hopeful we can have real tail-calls on the Mill 🙂
” is there anything to prevent the Bad Guy from creating a fake additional function entry point which launches a MUL and then simply *jump*s to the initial EBB of the target function, with the MUL still in flight?”
If the target function is a properly set up service, I understand Mill security would only allow it to be called through a portal as a subroutine, not jumped to. If the function could be jumped to the code would still run in the attacker’s security domain, unable to access anything the attacker couldn’t get to by normal means. A program messing up stuff in it’s own security domain can’t realisticaly be prevented by operating system and hardware.
Spot on Peter 🙂
We go to new lengths to prevent vulnerabilities in your programs turning into exploits too. With the hardware-managed call stack and such it would be very difficult for an attacker to make your program unwittingly set up functions to call others at runtime (e.g. return-oriented-programming) too. More in the Security Talk.
It was indicated in the compiler talk (which was really interesting to me), the belt is primarily defined by the number of available spill registers. As indicated in your previous talks, most data are used exactly once.
That leads me to think that conceptually, it would make sense to, like a stack-machine, remove the operands used in an instruction from the belt.
For the example of the compiler talk, (a – (b + c)) & ((b + c) * d)), the operations look like (from the video, 52:35)
(belt) a b c d
add b2 b1
(belt) a b c d +
mul b0 b1
nop
sup b4 b0
(belt) a b c d + – *
and b1 b0
(belt) a b c d + – * &
retn b0In case of operations that consume its operands, it seems you would wind up with:
(belt) a b c d
add b2 b1
(belt) a d +
and b0 b0 mult b0 b1
(belt) a +
nop
sub b0 b1
(retire mul)
(belt) – *
and b0 b1
(belt) &
retn b0Conceptually, this seems like a low cost for a large gain, avoiding most spill/fill pairs, possibly saving quite a bit of scratch space. I could imagine this would complicate the spiller though, as “pushing the belt forwards” seems simpler than keeping track of which positions have to be dropped and adjusting belt positions for that.
Has such a scenario ever been considered?
Often what makes sense from a software point of view makes no sense from the hardware. And vice versa; the Mill design reflects years of work balancing those forces. Here conceptually one could indeed squeeze out belt positions, and thereby cut belt pressure and the need for longer belts and/or more scratchpad traffic.
We have in fact considered self-deleting belt operands, and even more bizarre ideas than that 🙂 Here the problem is hardware: removing the holes requires rather complex circuitry; the complexity increases super-linearly with longer belts; and it has to be done in the D2 logical-to-physical belt mapping stage, which is already clock-critical. The Mill now simply increments all the logical numbers by the number of new drops. The cost of that grows roughly as the log of the length of the belt (fanning out a signal costs more with increasing destinations, even though the increment itself is constant), and we expect that cost will be the major limiter for larger Mill members than Gold.
One might think that the Mill already has logic to reorder the belt (for conform and call), and that logic could be used to remove the holes. However, those ops contain the new mapping to use as a literal, precomputed by the compiler, which can be simply dropped in place in the D2 mapper; they don’t have to figure anything out, as is required by a hole compressor.
So your idea is sound from an program-model view, but runs foul of circuit realities. I’m a software guy, and a great many of my ideas have hit the wastebasket for the same reason. It’s only fair though; I have shot down my share of the ideas the hardware guys have come up with 🙂
The Belt vs Multi-Accumulator
What is the advantage of the belt abstraction over a a multi-accumulator abstraction? In particular, why not treat each functional unit as an independent accumulator machine.
As noted in some of the talks, only 14% of results are used more than once.
The belt ISA (register) abstraction provides a bit-wise efficient encoding, since it eliminates the destination register. However, a multi-accumulator model requires (typically) even less – just one argument per operation, which can source values from other functional units. Basically the compressed instruction is a bitmap of active functional units, and the 2nd (+?) value for each operation is explicitly provided in the variable-length instruction trailer.
This architectural design seems orthogonal to other aspects – memory management, load/store stations, scratchpad, spill mechanism, etc.
Admittedly the belt abstraction provides a neat fn call abstraction. On the other hand, a callee-save instruction could efficiently push accumulators towards the spiller, and a ‘call’ instruction could efficiently marshal parameters from the accumulators.
A natural and maybe pragmatic extension is to provide multiple accumulators per functional pipeline – somewhere between a full register bank and a true multi-accumulator model.
Per the comment in talk #3 about extending the backless memory model to the heap: this might be critical for JVM middleware – typically the stack only contains pointers into the heap, and thread-local eden space is treated effectively like the traditional stack in C(++).
- This reply was modified 7 years ago by rolandpj.
A multi-accumulator (MA), like the belt, has the advantage that the encoding does not need to specify a destination. It can also often avoid specifying one source, but pays for that with move operations to load the accumulator. You suggest using each FU as an accumulator machine, which is fine for ALUs but some FUs have no result to accumulate. In addition encoding would like a power-of-two number of addresses, while FUs are rarely provisioned in power-of-two numbers. Lastly, even a use-once may still need to be saved: (a OP1 b) OP2 (c OP3 d) where all ops use the same FU. All this can be worked out, but which approach would have the better entropy in real codes would depend on details of the encodings and the workload.
A larger issue is hazards: the belt is SSA and there are no W-W, R-W, or W-R hazards, while an accumulator is updatable and so someone must deal with hazards.
Because the Mill transiently hold results at the FU outputs one can see it as a kind of MA in terms of routing. It is not MA encoded; we are willing to pay the cost of the address remapping to avoid hazards and have efficient spill.
My concern was not so much entropy of encoding, as hardware efficiency, but both are interesting.
The belt on high-end (30+ ops/cycle) operates as a massively multi-port register file – particularly if all operations are allowed to access all belt positions. This is true (I think) no matter whether it’s really implemented as a CAM, as shadow register files, etc. The talks allude to some combination of the above but I am reading between a whole lot of lines, no pun intended. From the talks, for example, you do actually intend to have FU-local shift register banks, and the mapping to the belt abstraction is a hardware problem(!).
The belt abstraction is useful, indeed, for steady-state implementation of software pipelining, and you have extended the concept of intrinsically circular rotating buffers into your ‘scratch-pad’ – which can be seen as a local/remote register file concept (or internal/external register). In short, the belt abstraction is awesome for compiler writers, which is a nice reaction to RISC, and incorporates a lot of the advantages of a stack abstraction – encoding entropy in particular.
I don’t really know what I’m talking about, but I there are so many interesting aspects of the design, most of which are old ideas, maybe yours (tagged typing of local registers, entropy encoding of intent, virtual memory translation behind the cache, blah blah). I am not aware of a hardware abstraction that is a use-it-or-lose-it register file (the belt) – it’s certainly a standard software device. The other aspect that I haven’t seen before, with little conviction, is single-instruction phasing – i.e. instruction encoding that pragmatically crosses hardware pipeline boundaries – however, I’m not sure how generally useful that is, beyond efficient encoding of short subroutines (which the compiler should inline, no?).
Regarding a general-purpose belt, vs. local specialisation. Most floating-point computations are really logically distinct from integer computations. Most branch predicate results are independent from real integer results (even though they are often flag-setting variants of the same silicon area). Most vector operations are very different from integer operations, particularly when you get ridiculously wide – 512 bits. Why would you carry them on the same belt (particularly unusually bit-wide operations)? The answer, I guess, is that the belt is an abstraction, but I think there is entropy opportunity there too.
I am fascinated. When do we see silicon?
- This reply was modified 7 years ago by rolandpj. Reason: More blah
I don’t seem to be able to edit, but more blah.
My suspicion, as an ex-compiler guy and now middle-ware corporate, and occasional wannabe hardware dude, is that single-threaded performance requires a few things. Obviously massive op/instruction. But more importantly, collapsing of local branches to enable that. Without that, ILP of general-purpose code is branch-limited. So, why not have predicated execution of operations in an instruction, as a bit-mask, as an extension of single-instruction predication (per 32-bit ARM)? The masked-out instructions produce None, or are NOPs. For wider ops/instruction, allow multiple masks.
I share the same cynicism that some of the VLIW audience did at one of your talks – namely, there just isn’t enough apparent ILP available. We have to waste speculative energy to change that.
For JVM middleware, most compiler speculation is around virtual method calling versus concrete run-time types. You can unwind that, just like C++ in the JIT, to predict virtual call targets, and/or you can specialize for known concrete run-time types. It kinda looks like branches in the JIT code. In addition, most commercial (managed environment) software is now strictly pointer-bound. Unlike carefully-crafted C(++), there is no language support for stack-local structures. Apparent load/store activity is huge. Garbage collection is important, and it’s unclear exactly what hardware facilities would be useful. Local allocation areas (Eden space) are organised as stacks, independent of the conceptual call-stack, but promotion out of Eden needs to be handled.
I guess I’m saying that SpecInt is a thing, g++ performance is a thing, but what we really need is support for what is running most big corporate workloads. It’s managed environments (JVM/Microsoft). I have written some JIT’s in my distant past, but I am rusty as all hell.
I suspect, though, that the ideal target architecture is different from what might be expected from SpecInt. (And yes, efficient multi-threading is part of that, but doesn’t capture the dynamic CPU demand).
You wander a bit, but I’ll try to address some of your points.
The belt is not equivalent to a multi-port register file because there’s no update; everything is SSA, so if anything it acts like a RF with only read ports and no write ports. There is an all-source-to-all-sink distribution problem, conceptually akin to a crossbar matrix, but one of the patents explains how we can do 30X40 and have it time like 8×20. However, the distribution is still the scaling limit for the belt.
Phasing is valuable in open code because of the shape of the typical dataflow graph: a few long chains, with local short business at each node of the long chain. We’re not often executing two long chains simultaneously, but we are often doing several bush pieces at once and feeding that to the long chain. The bush fragments tend to be short (1-3 ops) and to fall naturally into the three cycles given by phasing, so we can overlap the later bush phases of this long-chain node with the earlier phases of the bush of the next node. It sounds a bit like breadth-first search, but the presence of the long chains means that the placement algorithm is different.
We get long open code because we do a lot of speculative execution. It’s a little like trace scheduling.
Some years ago we had vectors ultra-wide operands (UWOs). UWOs make sense on a machine with a dedicated UW RF, but not on the belt where scalars are mixed in, so we replaced UWOs with our skyline vectors.
About seeing silicon: go to https://millcomputing.com/investor-list/
Predication: we have predicated forms of (nearly) all the operations that cannot be speculated, notably all control flow and stores. There’s no point in wasting entropy and compiler monkey-puzzle on the rest.
Managed systems: we have some gc-centric facilities, We also expect that a Mill JIT would take advantage of the regularity and simplicity of the ISA to generate to the machine rather than to a memory-to-memory legacy abstraction. The big Mill wins in a managed environment will not come from the execution core however; it will come from the reduced memory traffic, cheap coherence, and ultra-fast task switch.
A couple of questions/thoughts.
What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?
Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn’t you just have a FU with multiple output registers feeding into the latency-1 tree? I’m not able to visualize what makes it difficult.
For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that’s the only thing I can think of that the second-level mux tree would have over the one-cycle tree.
Try and use one question per post. It’s easier for the reply and the readers.
What are the advantages of using the belt for addressing over direct access to the output registers? Is this purely an instruction density thing?
What’s an output register?
Why does the split mux tree design prevent you from specifying latency-1 instructions with multiple drops? Couldn’t you just have a FU with multiple output registers feeding into the latency-1 tree? I’m not able to visualize what makes it difficult.
Hardware fanout and clock constraints. Lat-1 is clock critical and the number of sources (and the muxes for them) add latency to the lat-1 FUs. Lettng lat-1s drop two results would double the number of sources for the belt crossbar, and it’s not worth it. Lat-2 and up have much more clock room.
For that matter, how does the second-level mux tree know how to route if the one-cycle mux tree only knows a cycle in advance? It seemed to me like either both mux trees would be able to see enough to route the info, or neither would. Does this have to do with the logical-to-physical belt mapping, because that’s the only thing I can think of that the second-level mux tree would have over the one-cycle tree.
There’s no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we’d need another pipe cycle in there to keep the clock rate up.
Sorry. Anyway, I meant like instead of an entire belt, the code specifies the sources based on the latency 1/2/3/4 results of each FU, plus the actual spiller registers. Since a specializer can know statically where the spiller puts overwritten results, that’s not a problem in conASM. Is belt renaming that important?
The specializer doesn’t know where the spiller puts things, because it cannot predict the dynamic control flow and hence the belt demand. As soon as there is a trap/interrupt or the main program goes through a function or label pointer, or even a conditional branch, well, that’s all she wrote.
There’s no problem with the routing itself; everything is known for all latencies when the L2P mapping is done. The problem is in the physical realization of that routing in minimal time. A one-level tree would have so many sources that we’d need another pipe cycle in there to keep the clock rate up.
So, to clarify: the problem is not that the routing information isn’t known far enough in advance, but that the results for the one-cycle latency ops don’t exist that far in advance? And anything specified as latency 2 or more will exist a full cycle in advance?
Not quite. Everything is known well in advance, and the time for the actual computation may be the same for one and two res ops. However, after a result is computed it must be routed from the producer to all the consumers it will have in the next cycle. That route time is independent of which op produced the result. The route time is largely dependent on the number of sources, i.e. the number of distinct operands that can be dropped in each cycle, which is the fan-in to the routing crossbar.
We want to keep the routing time low, because every operation pays that cost. So we want few belt sources, and need to move low-value ops out of the way. We do that with the two-stage crossbar, and relegate all multi-result ops and all non-lat-1 ops to the second stage. There may be a two-res op that is fast enough that its result is available as fast as a simple add. But putting those two results on the fastpath would add a cost to every other fastpath op, including critical things like add.
None we know. If we find one, or any in which the two-result op is on the program’s critical path even if there are other one-res ops, then it’s easy for us to put the second into the fastpath too. And pay the price.
- AuthorPosts
You must be logged in to reply to this topic.