Execution

JonathanThompson

Ok, I'm not surprised: I was very hopeful, that's true ;)

A great example where such multiway branches can be very useful can be found in the recursive-descent lexer/parser code generated by ANTLR, as it generates (depending on generation options) lots of switch/case situations, recursively.

Now, I'm wondering: is there a way in a single instruction to encode a rather complex if/then/else where you have multiple values meet some conditions (say if(a==b&&c==D|d=a)? I'd expect it'd take 3 CPU cycles, due to phasing, and the practicality of data to load, etc. but can it be encoded and set off in a single instruction?

Using the ganged operations in instructions, I'm suspecting that could be synthesized, but I'm not sure: that could really make code fly.

ivan

I think you meant:
if((a==b) && (c==D)) ||(d==a))
then that becomes:
brtr(<d==a>, <then>), brfl(<a==b>, <else>), brtr(<c==D>, <then>), br(<else>);
i.e. one instruction if you have four branch units.

Note that this code assumes that reordering the arguments of a conjunction or disjunction is permitted, which is true in some languages. If not permitted (as in C), you would use:

    brfl(<a==b>, <disj>), brtr(<c==D>, <then>), br(<disj>);

disj:

    brtr(<d==a>, <then>), br(<else>);

This code ensures correct C semantics if one of the predicates is a NaR.

baking

Found this useful explanation in the split-stream encoding white paper:

if operations with similar encoding
requirements are assigned to the same stream,
then the bundle or operation formats can be
tailored to each stream, saving instruction space
and likely reducing decoding difficulty. For
example, all operations with an effective-address
parameter (including loads, stores, and branches)
can be assigned to one of the streams, so the logic
to decode an address is not needed in the other
decoder.

PeterH

Is there a summary of opcodes posted somewhere we could reference?

baking

The first talk on decoding instructions explains how instructions are split into two streams (Exucode and Flowcode) with each half-instruction (bundle) composed of a header and three blocks, for a total of six blocks per instruction. The most recent (sixth) talk on execution explains how these blocks are decoded and executed in phases. Together, they create a more complete picture of the Mill instruction set.

In the exucode stream, the three blocks correlate directly to the functional pipelines. (Numbers correspond to the “Gold” member of the Mill family.)

Block 1x: 8 read operations (fill from scratchpad*)
Block 2x: 4 integer or binary floating point (including multiply) operations
and 4 integer only (no multiply) operations
Block 3x: 4 pick** operations
and 5 write operations (spill to scratchpad*)

* The only exucode side read/write operations that we currently know of are the fill and spill scratchpad ops.
** The pickPhase is describe as being between phase 2 and phase 3 but there are numerous reasons to include pick ops in the third block. For one, the pick logic resides in the crossbar that the writePhase ops use to select their inputs. Also, the maximum number of operations in one block is said to be nine.

The flowcode stream is quite different with only 8 dedicated pipelines compared to 25 for the exucode stream. The reason for this is that flowcode operations are much larger due to their in-line address and constant operands so the code size (and cache size) for both streams balances out. Another difference is that the pipelines include functional hardware for multiple phases:

4 pipes that can do either immediate constant, load/store or control transfer (branch, call) operations
4 pipes that can do either immediate constant or load/store operations

Suppose you have an instruction with 8 stores (writePhase), followed by an instruction with 8 loads (opPhase), followed by an instruction with 8 constants (readPhase). Because of the phasing, all 24 operations would hit the 8 pipelines during the last cycle. Would the the separate functional units handle it easily, or are there constraints on the mix of flowcode ops to balance out the phasing?

ivan

Congratulations! A remarkably complete summary.

The short answer is that there are no issue constraints in your example. However, because you are issuing loads and expecting immediate retire then you will get several cycles of stall between the load and con instructions; that's why deferred loads exist, see the Memory talk for details. The data cache cannot get you data instantaneously.

More generally however, the operation set and hardware have almost no issue hazards. "almost" being the operative word here. An example of an issue hazard that does exist is FMA. While you can issue an FMA every cycle in all slots that support it, if you issue an FMA now and then try to issue a plain FP add in the cycle when the multiply part of the FMA is done then both the add and the add part of the FMA will be trying to use the adder hardware in the same cycle; no dice. The hardware throws a fault if you try.

And currently that's the only case.

ivan

Not yet; there are still a few ops that are NYF.

rmmh

The integer/floating point dividers are fully pipelined?

ivan

There's not much reason for hardware divide on a machine as wide as the Mill. As your question shows, conventional divide is implemented as a microcode loop that blocks the pipeline, and pipe blockage gets more painful with wider machines.

While divides are defined in the abstract Ml, we do not expect them to be native hardware in any member. Instead,the specializer will substitute an emulation using Newton-Rapheson. The emulation will make use of several "divide-helper" ops that will be native. Thus the emulation, while software visible, is doing essentially the same job as the microcode on a conventional. The result has the latency of native hardware, but the individual operations can be scheduled intermixed with other ops, including those from other divides.

We expect to use the same strategy for sqrt and transcendentals.

Symmetry

Some questions:

This wouldn't be a Mill architecture, but I wonder if there would be a problem with adding more phases and having different otherwise identical functional units do their work in different phases. For example, maybe read, pick1, op1, call, op2, pick2, write. I suppose that would induce limitations as to how quickly you could schedule instructions back to back, but it would make decoding easier for a given width. Were there other important benefits and limitations I'm not thinking of? Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?

With regard to calls, how many cycles do they normally take in overhead?

What happens if you have an FMA instruction but fail to also issue an Args? Or is it that the existence of an Args in an instruction turns the multiply operation into a FMA operation?

Am I right in guessing that which operations are connected to which in ganging is a result of which slot in the instruction the ops occupy?

Will_Edwards

Answering particular parts of your question where I can:

> Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?

Well, definitely informed by code analysis. There's an upcoming talk on Configuration, and its been described in the talks and in the Hackaday interview.

A central tenet of the Mill is that people building products can prototype custom processor specifications in the simulator very quickly and choose a configuration with the best power performance tradeoff informed by benchmarking representative code.

> With regard to calls, how many cycles do they normally take in overhead?

One. Calls and branches (that are correctly predicted) transfer already in the next cycle. In the case of mis-predictions where the destination is in the instruction cache, again the Mill is unusually fast; it has a penalty of just five or so cycles.

Additionally, there is is none of the conventional pre- and post-ambles.

ivan

This wouldn’t be a Mill architecture, but I wonder if there would be a problem with adding more phases and having different otherwise identical functional units do their work in different phases. For example, maybe read, pick1, op1, call, op2, pick2, write. I suppose that would induce limitations as to how quickly you could schedule instructions back to back, but it would make decoding easier for a given width. Were there other important benefits and limitations I’m not thinking of? Wow much of the selection of the current phasing setup was a result of the way the hardware works, and how much profiling actual code to see what would be useful in practice?

The phase structure is largely dictated by the encoding and the decode process. It takes three cycles to fully decode the extremely wide Mill instructions, but ops in some blocks are completed in each of those cycles and we issue them right way as soon as available; that gives us three phases. The pick op has no cycle, and the call op uses the cycle of the called function, which gives us five phases in all.

We could do as you suggest and have added phases with dedicated FUs. However, the ops for those phases would already have been decoded and would just be waiting for their phase, so why not issue them at once? We could of course go to a five-block encoding, which would take four cycles to decode, requiring another phase. However, three seems to be all that's useful.

Assignment of ops to blocks (and hence phase) is ad hoc but guided by hardware constraints. For example, it is not possible to get the belt bypass set up in parallel with decoding the first (reader) block, so ops in reader phases cannot have belt arguments. Consequently the reader phase can only have pure-source ops or administration ops that do not take a belt argument. Reader phase also cannot have an op taking a large constant (such as the stackf frame allocation op) because the constant would wreck the entropy of the encoding of the other reader ops which are quite small.

ivan

What happens if you have an FMA instruction but fail to also issue an Args? Or is it that the existence of an Args in an instruction turns the multiply operation into a FMA operation?

Am I right in guessing that which operations are connected to which in ganging is a result of which slot in the instruction the ops occupy?

Decode never needs to look at an adjacent slot to disambiguate an operations; needing to do so would complicate and slow down decode. Consequently FMA is a distinct opcode and not a mul-with-baggage.

The sanity check for ganging (at least on the exu side where these ops live) is done during execution. The FMA functional unit knows that it needs arguments from its buddy, but it has the execution cycle to work out whether or know it has a buddy with the right arguments. No buddy causes an invalidOperation fault, even though the decoder proper didn't complain.

Gangs are always encoded in adjacent slots. Otherwise there are encoding issues and added data paths to get everything to the FU involved.

Symmetry

>A central tenet of the Mill is that people building products can prototype custom processor specifications in the simulator very quickly and choose a configuration with the best power performance tradeoff informed by benchmarking representative code.

Yes, I'm really looking forward to seeing what tech stack you use to do that. I gues I'l have to wait for the specialization talk to come out to find out how.

davidm

I was wondering about what happens to FUs that are occupied at the time of a call. Do they complete while the call is executing and simply output their value to an output buffer for the slot (potentially to get saved and restored by the spiller) or is their internal execution state saved and restored?

Suppose you have a mul and a call in the same instruction and that the mul take 3 cycles. With phasing the mul will have completed 1 cycle's worth of work prior to the call. Will the mul complete its work (in two more cycles) during the called function's cycles or would its work be stopped and its internal state be saved (again, to the spiller?) and then restored on return? If the mul completes during the called function then it could show up in the caller's belt prematurely (unless that's handled by a naming scheme?). If the state is saved and restored, then potentially more information would need to be spilled and filled and the mechanism for doing such a thing must be costly in complexity (wires, area, power, etc.).

Will_Edwards

Good question. Its the former, and its done using naming, as you guessed.

Internally, each belt item has a frame id; this is how the CPU ensures the caller's and callee's belts are separate.

Operations mark their output belt items with the frame id their instruction was issued in. The spiller takes care of saving and restoring this in the background.

This was described in the Belt talk, about 30 minutes in.

ivan

To expand a bit on Will's answer, and introduce some terminology for the glossary (you there, Mill-fan-registered-as-"imbecile"?).

CPUs that do not expose the pipeline go to great lengths to support the illusion that instructions are executed one-by-one to completion. They must be able to unwind absolutely any action, completed or not, in the event of an interrupt or other event that lets the program see state. The unwind goes back to just before the first incomplete (unretired) operation; everything after that, done or not, is discarded and has to be done over. This approach is called execution replay (or sometimes issue replay.

No modern machine actually has one-at-a-time execution internally, and an exposed pipeline machine like the Mill does not even pretend that it does. IBM 7094 semantics are so 1960, after all. Instead, we let all operations run to completion, save the results, and on restart we inject the saved results with the same timing and semantics that would have occurred had the interrupt not happened. This approach is called result replay.

Result replay has real advantages over issue replay: you never have to recompute things. Moreover, while you have to save the in-flight state if an interrupt happens, you don't have to save all the state all the time in case an interrupt might happen. And the state is a lot smaller, too.

davidm

Thanks for the explanations. That makes a lot of sense. Just to make sure I have this right: if you have a "mul-call" instruction followed by an "add" then the mul will complete during the called function (presuming it's more than a couple of cycles) before the add is issued. Even though the mul ends up being computed before the add (because the mul's result will show up in an output buffer during the cycles of the called function) it will show up in the callers belt at the appropriate time (via naming) *after* the add. The mul's result might get saved and restored by the spiller if the called function is of sufficient length (or itself calls functions that are of sufficient, length, etc.) but that is all transparent to the software. An interrupt is handled essentially the same (via result replay). Honestly, that sound pretty straight forward, given the alternatives. Let me know if that's not correct.

I'm curious to see how exceptions play into all of this. I know I'll have to wait a while, but that must be an interesting story to tell.

In any event, thanks, again!

davidm

In the talk you mention that if you have multiple calls in an instruction that the return from one call returns directly to the subsequent called function. You give the example of F(G(a+b)). The instruction would be "add-call-call" where the add is in op phase and the calls are in call phase. A couple of questions:

1. Is this true if the code were G(a+b); F(c+d); (i.e. does the retn in G go directly to F)?
2. How are the belt and stack setup for F? Presumably G's stack must be cut back and F's stack established (although with IZ, this should be a simpler process on the Mill) but G's belt must be established as well. Is this also information stored by the spiller (in addition to return addresses, as described in the security talk)? Does the FU that processes the retn have access to that saved state (something like a belt list you'd pass with a call) to setup the next call (i.e. the call to F)?

Thanks in advance for your answers.

Will_Edwards

Yes this is how it works. The compiler statically schedules everything, and everything happens in-order. The spiller transparently takes care of storing results that where inflight during calls and makes them available when their retire cycle comes, if necessary.

Exception is an overloaded term but I guess you mean c++ -like exceptions?

These are in the language and runtime, and to the core knows nothing about them. They are just rarely-taken branches and returns.

The stack trace, however, may be recovered using a special system service which can talk to the hardware; or a language runtime may record the call stack itself using its own metadata.

« Previous Page Next Page »