Forum Replies Created
- AuthorPosts
- in reply to: new CALL opcode variants (suggestion) #711
I very much agree that there is a real need for better support for call-unwinding on exception conditions. We actually had a facility several years ago that was somewhat like what you suggest. However, there were problems.
One problem was that such a convention requires the callee to know details about the caller’s expected protocol, breaking function isolation. Alternatively, we could define a kind of universal protocol, but that would have costs for calls that didn’t need the full generality. There were encoding issues too – remember that a single Mill instruction can contain several calls, and what with needing to represent the target address and the argument list, adding another whole address did not fit well. However, what finally caused us to drop the idea was the realization that we didn’t need a call variant anyway.
The chosen resolution takes advantage of the ability of Mill calls to return more than one result, cheaply. That is, your myFunction is directly expressible in Mill hardware. So the cost of the semantics is the branch to test the error condition. With phasing (you did see the Execution talk?) the branch can be in the same instruction as the call and still see the result of the call as its predicate. The branch operation naturally carries the error-handling address that would have to be present in the fancy call operation anyway, but does not require any encoding contortions or special call semantics. So making the error handling explicit puts it in the caller (where it belongs on isolation grounds) and has zero latency cost and no power or encoding cost beyond what having the callee do the branch would need anyway.
This also gets us out of the protocol business; protocols should be app- or language-level, not dictated by the hardware. For example, a caller/callee could agree to have several possible post-call continuations reflecting various detected conditions and represented by a returned enum that the caller switches on.
You separately mention the possibility of passing arguments as addresses after the call operation. Besides such an approach not being encodable on a Mill, it also does not work on any modern Harvard-architecture CPU (Harvard has separate i- and d-caches with unique datapaths for each). In general data accesses to code would have to be satisfied from below the join point of the hierarchy, which is at least a painful 10+ cycles away from the CPU. In contrast, having the caller use a LEA operation to drop a pointer onto the belt is a one-cycle operation in the caller, and will almost certainly be overlapped with other operation in the Mill’s wide issue. When the callee dereferences the pointer (if it needs to) then it is quite likely to find the data in the d$1 cache, the latency of which can be hidden using the Mill deferred load facility.
There’s another issue that will be touched on in the upcoming talk on Security and Reliability: the caller and callee may not be in the same security domain (called a Turf in Mill-speak), so the callee may not have rights to access the code of the caller in the first place, even if the address is to a data location that it does have rights to.
We are a long way from the Z80 🙂
There’s not much reason for hardware divide on a machine as wide as the Mill. As your question shows, conventional divide is implemented as a microcode loop that blocks the pipeline, and pipe blockage gets more painful with wider machines.
While divides are defined in the abstract Ml, we do not expect them to be native hardware in any member. Instead,the specializer will substitute an emulation using Newton-Rapheson. The emulation will make use of several “divide-helper” ops that will be native. Thus the emulation, while software visible, is doing essentially the same job as the microcode on a conventional. The result has the latency of native hardware, but the individual operations can be scheduled intermixed with other ops, including those from other divides.
We expect to use the same strategy for sqrt and transcendentals.
- in reply to: Integer vs Pointer in the Mailine LLVM compiler #700
Many thanks. I’ve followed up on that site.
- in reply to: Mill for Storage #650
The Mill is intended to be general purpose, and that certainly includes I/O- and device-heavy applications. The talks to date haven’t talked much about such application, but there are several important features that would be very applicable to those applications.
That said, we would be the chips, not the solutions vendor. There’s a whole lot of expertise that goes into making those boxes, expertise that we don’t have. But we hope to win a few design-ins in the area.
Congratulations! A remarkably complete summary.
The short answer is that there are no issue constraints in your example. However, because you are issuing loads and expecting immediate retire then you will get several cycles of stall between the load and con instructions; that’s why deferred loads exist, see the Memory talk for details. The data cache cannot get you data instantaneously.
More generally however, the operation set and hardware have almost no issue hazards. “almost” being the operative word here. An example of an issue hazard that does exist is FMA. While you can issue an FMA every cycle in all slots that support it, if you issue an FMA now and then try to issue a plain FP add in the cycle when the multiply part of the FMA is done then both the add and the add part of the FMA will be trying to use the adder hardware in the same cycle; no dice. The hardware throws a fault if you try.
And currently that’s the only case.
- in reply to: Prediction #697
There are two pairs of counts in a prediction, one pair for each side.
I think you meant:
if((a==b) && (c==D)) ||(d==a))
then that becomes:
brtr(<d==a>, <then>), brfl(<a==b>, <else>), brtr(<c==D>, <then>), br(<else>);
i.e. one instruction if you have four branch units.Note that this code assumes that reordering the arguments of a conjunction or disjunction is permitted, which is true in some languages. If not permitted (as in C), you would use:
brfl(<a==b>, <disj>), brtr(<c==D>, <then>), br(<disj>); disj: brtr(<d==a>, <then>), br(<else>);
This code ensures correct C semantics if one of the predicates is a NaR.
- This reply was modified 10 years, 11 months ago by Ivan Godard.
- This reply was modified 10 years, 11 months ago by Ivan Godard.
The two-argument constraint is on the exu encoding side and the computational operations in general. The encoding optimizes the two-argum,ent case for the slots that encode those operations. The flow side has a completely different encoding because it needs to support large constants (for offsets and the like) and long argument lists.
Call, return, conform, con, rescue, and some NYF ops uses the flow big-constant mechanism for arguments, and call, branch, load, and store use it for address offsets.
Yes, except the number 16 is member dependent. One slot is used for the call proper, including its address offset (or function pointer position on the belt), a count of the expected number of results if more than one, and so on. There can be three belt numbers per slot, so any not needed for this are available for arguments. If there are still arguments then they are spread over additional slots, each of which get the same three belt numbers plus 32 bits of manifest that is also used for belt numbers.
If a belt number is 4 bits (common in members), and the first slot is completely used for administrivia, then the ganged slots add 11 more arguments each (three morsels plus 8×4 bits). However, it is not meaningfull to pass more atguments than the callee has belt. A 4-bit belt number means the belt is 16 long, so at worst needs two slots for arguments and three in all for the maximal argument list.
Similar reasoning determines the number of slots needed for other belt lengths. We currently have belt lengths of 8 and 32 defined as well as 16; it’s not clear whether a belt length of 64 is ever needed in a practical chip..
Ganging is supported on both sides. On the flow side, ganging is used for bulky arguments that don’t fit in the encoding of one slot. A Quad-precision pi constant, for example, needs three slots to supply 128 bits. This is also used for long argument lists for operations that use them, including call.
Yes, pick can select within the result of a shuffle in the same instruction.
And yes, there are vector literals, but restricted ones. The general mechanism is to use the con operation (reader phase) to get a scalar literal, and then the splat operation to make a vector out of it. However, certain popular constants, scalar or vector, are available from a form of the rd operation (reader phase). These are called popCons, short for “popular constants”. The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.
So for zero and None, assuming you have a bool mask to drive the selection, the code would be (for word element size):
rd(wv(0)), shuffle(<data>, <indices>), pick(<mask, <shuffled>, <rd>);
in one instruction.- AuthorPosts