Execution

staff

Talk by Ivan Godard - 2014-02-05
at Stanford University’s Computer Systems Colloquium (EE380).

Slides: execution.02 (.pptx)

Instruction execution on the Mill CPU:

Working at Mach 3

A perennial objection to wide-issue CPU architectures such as VLIWs and the Mill is that there is insufficient instruction level parallelism (ILP) in programs to make effective use of the available functional width. While software pipelining can reveal large quantities of ILP in loops, in open (non-loop) code studies have calculated maximum ILP in the order of two instructions per cycle (IPC), well below the capacity of even conventional VLIWs never mind super-wide architectures such as high-end Mills. The problem is that the program instructions tend to form chains connected by data dependencies, precluding executing them in parallel.

This talk addresses the ILP issue, describing how the Mill is able to achieve much higher IPC even when the nominal ILP is relatively low. The Mill is able to execute as many as six chained dependent operations in a single cycle; open code IPC numbers typically exceed nominal ILP by a factor of three. The talk will show in detail how this is achieved, and why we chose “Mach 3” as the name of the mechanism. In the course of the explanation, the talk will also introduce other operations for which the semantics of Mill execution differs from that of conventional CPUs.

allpowerful32

In the talk you describe how the Mill does tail calls "in hardware, for free". You use the example of F(G(a+b)), such that the two calls happen in one instruction, and that the call to G actually returns directly to F, not to the calling frame.

I'm having a hard time fitting these with the model of tail calls that I'm familiar with. This is where calls in "tail position" (basically, being the last instruction before a return that's either void, or returns the result of that call) are optimized such that the stack frame of the function is destroyed and execution jumps (rather than calling) to the function in tail call position.

This is most visible in recursive functions, where tail call optimization can often turn a function that will quickly hit a stack overflow (i.e. one that uses stack space proportional to some non-constant function of the input) into a function that completes fine, having used a constant amount of stack space.

As always, wikipedia explains it better: http://en.wikipedia.org/wiki/Tail_call.

Could someone help me reconcile these two models? How does the Mill's "tail calls" (forgive the quotes) relate (if at all) to tail call optimization, which is often vitally important in functional languages?

Will_Edwards

True, the analogy doesn't stand up to such scrutiny :)

Your compiler is still responsible for Tail-Call-Optimisation to turn recursion into loops. The CPU does what it's told.

jrfeenst

Does the mill support (arbitrary) vector element swizzling? I'm just wondering if the same functionality that enables free pick might also allow free swizzles. I could see how it might be machine dependent due to different vector sizes.

ivan

"permute" is an operation that supports general vector reorder, including element duplication. There are some specialized forms that do specific permutations at much smaller encoding cost.

ivan

Yes, Mill cascade is not the same as recursive tail-call, although the benefits are somewhat similar.

Will_Edwards

Does the mill support (arbitrary) vector element swizzling?

Yes. There is a shuffle op for arbitrary rearrangements of a vector.

I’m just wondering if the same functionality that enables free pick might also allow free swizzles.

I believe its in the op phase.

I could see how it might be machine dependent due to different vector sizes.

Well, you can always use the remainder op to create a mask that you then pick against with 0s or Nones to create a partially filled vector? This was covered in the strcpy example in the Metadata talk and the Introduction to the Mill CPU Programming Model post.

infogulch

So shuffle can duplicate elements, can it replace elements with 0?

ivan

Nope, nor with None, which would be even more useful. Use "pick" rather than shuffle to insert specific values where needed.

infogulch

Since shuffle is op phase and the pick phase is after the op phase (iirc), could shuffle+pick on a vector happen in a single instruction and on literal 0 or None? (That is, no explicit 0 or None vector needs to take up a belt position.).

I guess that would get what I was asking for, except you'd need a way to generate and park the vector to pick on, unless you can pick on a vector literal, if such a thing exists.

imbecile

Just a quick question about multiple inputs: Are the argument lists of the call operation implemented in terms of the args operation used in ganging, or are they completely different?

My guess would be different, because I suspect the call operation is on the branch/pointer side of the instruction streams and ganging on the computation side. If both sides had both it would be kind of redundant.

baking

Function arguments are specified the same way as for operations except you can have up to 16 of them. They also return any reasonable number of results just like operations. See the first talk on instruction decoding.

The function gets a brand-new belt with all of its arguments loaded in order. I think that was described in more detail in the second talk.

Edit: I'm not sure if I actually answered your question about the mechanics of passing those arguments to functions in the assembly language. I don't know if that has been specifically mentioned.

imbecile

> except you can have up to 16 of them

Yes, those previous talks are the reason I have this question. If you already can supply an instruction with an arbitrary number of inputs up to 16, why do you need a special mechnaism for other instructions to provide more then 2 inputs?

Could just be an encoding issue. The 2 operand case is by far the most common for all instruction except call. The new belt for the callee is invisible to the caller anyway, so from the callers belt perspective it should be irrelevant whether arguments are provided in an argument list or with additional arg-operations. The caller doesn't care that in one case the arguments are combined in one case in a new belt and in the other case in internal data paths. It doesn't see either. All it sees are the eventual results.

ivan

Yes, pick can select within the result of a shuffle in the same instruction.

And yes, there are vector literals, but restricted ones. The general mechanism is to use the con operation (reader phase) to get a scalar literal, and then the splat operation to make a vector out of it. However, certain popular constants, scalar or vector, are available from a form of the rd operation (reader phase). These are called popCons, short for "popular constants". The selection of popCons varies by member, but all include 0, 1, -1, and None of all widths.

So for zero and None, assuming you have a bool mask to drive the selection, the code would be (for word element size):
rd(wv(0)), shuffle(<data>, <indices>), pick(<mask, <shuffled>, <rd>);
in one instruction.

ivan

Ganging is supported on both sides. On the flow side, ganging is used for bulky arguments that don't fit in the encoding of one slot. A Quad-precision pi constant, for example, needs three slots to supply 128 bits. This is also used for long argument lists for operations that use them, including call.

ivan

Yes, except the number 16 is member dependent. One slot is used for the call proper, including its address offset (or function pointer position on the belt), a count of the expected number of results if more than one, and so on. There can be three belt numbers per slot, so any not needed for this are available for arguments. If there are still arguments then they are spread over additional slots, each of which get the same three belt numbers plus 32 bits of manifest that is also used for belt numbers.

If a belt number is 4 bits (common in members), and the first slot is completely used for administrivia, then the ganged slots add 11 more arguments each (three morsels plus 8x4 bits). However, it is not meaningfull to pass more atguments than the callee has belt. A 4-bit belt number means the belt is 16 long, so at worst needs two slots for arguments and three in all for the maximal argument list.

Similar reasoning determines the number of slots needed for other belt lengths. We currently have belt lengths of 8 and 32 defined as well as 16; it's not clear whether a belt length of 64 is ever needed in a practical chip..

ivan

The two-argument constraint is on the exu encoding side and the computational operations in general. The encoding optimizes the two-argum,ent case for the slots that encode those operations. The flow side has a completely different encoding because it needs to support large constants (for offsets and the like) and long argument lists.

Call, return, conform, con, rescue, and some NYF ops uses the flow big-constant mechanism for arguments, and call, branch, load, and store use it for address offsets.

imbecile

Exactly the answer I was looking for :)

tldr; arg-ganging is used to extend the argument list of any instruction that may need it, including call.

JonathanThompson

Having now watched the slideshow after watching the original live broadcast, I'm wondering if my theory is correct: with a 32-long belt, it appears for predicated branches (first true wins) that it's possible to have a 16-way speculative branch, as there's 1 load and 1 branch specifier for each case.

Is this correct?

ivan

The value of N for N-way branches us determined by the number of branch functional units, not the number of operands on the belt. It is quite common for two branches to reference the same condition value, but with opposite sense:
btrt(b0, <then>), brfl(b0, <else>);