Renesac
decode 0
decode 1
readerPhase execute/opPhase issue
opPhase execute
callPhase/first called instruction
… repeated
writerPhase (which is also opPhase of the next instruction and readerPhase of the one after that)
Let's see if I understand it right. The first part could well be in the encoder talk topic:
Decode 0 don't produces anything? Are the readerPhase ops already ready, but waiting for the opPhase ones to be decoded so they can issue while reader phase executes? Or do the reader side ops on the flow side take one extra cycle to decode compared to the simple encoding in the exu side, and that is why there is this delay?
The decode 1 decodes the opPhase operations (and maybe the pick and writer phase too, finishing the effective decode?) so that the next cycle can be a issue phase (Mill does had an issue stage after all, only masked by phasing) while the reader operations that were already ready execute.
Is there a decode 2 stage?
Parallel to the opPhase execute, is the readerPhase of the called function/loop executing, assuming that the branch/exit predictor did it's job? Or there is a bubble there? That is the only place where I see the readerPhase increasing the work done in a cycle when assuming a infinitely powerful decoder.
So, is the "callPhase/first called instruction" already at the opPhase of the called instruction?
And in the cycle the function returns (already at transfer phase, that occurs at the same time as the writer phase?), is the calling site already at opPhase? Or there is a bubble? Again, assuming perfect exit prediction.
ivan
The execution timing is the same so long as decoded ops are available, regardless of whether phasing is used or not: each execution pipe is fully pipelined and so can start an op each cycle, and the dataflow constrains the earliest that an op can be started. If you could decode it and had enough pipes, an OOO x86 would pipe dataflows into the pipes the same way that a Mill does. The barrel is no different.
The gain to phasing occurs at the boundary conditions, where the sequence of instruction decodes detaches from the sequence of issue and execution. This occurs at control flow points. For example, when you call a functions, unless you are OOO you must complete all issues in the caller, transfer, decode in the callee, and refill the pipelines in the callee. The same occurs at return, branches, and, importantly) at mispredicts.
Phasing is "really" a way to do short-range OOO without the OOO hardware. If the code is: 3-cycle dataflow -> branch -> 3 cycle dataflow -> branch -> 3 cycle dataflow, then an OOO machine will get it all done in three issue cycles and five cycles overall - and so will the Mill. A strict in-order will take 11 cycles, or nine if it has a predictor.
So phasing is just poor-man's OOO, taking advantage of the fact that real code contains a lot of short independent dataflows that, if you have the MIMD width, you can overlap.
Of course, once you have phasing then other possibilities are enabled: multi-call, First Winner Rule, 0-cycle pick, ...
ivan
1)
Yes, pick takes no cycle at all, nor does conform, because both are in the belt-mapping/crossbar that is logically at the cycle boundary. It actually takes a cycle, or more, to set up a conform. The timing is different depending on whether the belt is CAM (logical) or mapping (physical) addressed, but for physical the hardware is tracking issue and drops a cycle ahead of when they happen and so it actually remaps the belt naming in what is the opPhase cycle to do a conform. However, that mapping is only effective for ops in the following cycle, so we can think of it as taking place at the cycle boundary, even though the hardware guys think of it as being in the X0 clock.
Transfer phase doesn't really have a hardware equivalent; it is an artifact of the way the sim is implemented as it steps through the state machine that does phasing. The Mill is running at least three cycles ahead down the predicted path, so we have already transferred long before we actually execute the branch; possibly several times in fact. So transfer phase is really the point at which we recognize and stall a mispredict. A modern machine cannot stop on a dime, when the core can be several cycles across at speed-of-light. Stall is one of the most difficult parts of hardware design.
Call phase doesn't take a clock if there are no calls, but if there are then in effect it preempts and uses the clock that writer phase would have used.
2)
Unused phases just don't issue anything from the instruction. A sequence of instruction with nothing but compute-phase ops will execute them one per cycle, but there will not be any reader/writer/ phase ops overlapped with them because the instructions don't have any of those kinds of ops. Adjacent instructions run the same whether they have anything to do in a particular phase or not. There's no cost to an unused phase, just a lost opportunity.
ivan
Neither decode0 nor decode 1 "produce" anything except whole or party decoded operations.
Internally to the team, the phase between decode 1 and execute 0 is variously called decode 2, issue 0, or execute -1 depending on which aspect of it you are interested in. For external documentation we try to call it the issue stage, but there is still the potential for confusion because there are three issue stages because of phasing.
The phase sequence continues, with overlap, across all control flow boundaries, following the predicted path. It is entirely possible for a cycle's reader, compute, and writer ops to belong to three different functions. A mispredict is a pain getting the bogus state discarded and things restarted. Big OOO machines have a separate retire stage at the end, with a reorder buffer, to make the update to the architected state be in the proper sequence; if they mispredict they throw away everything that is in flight and re-execute the instructions; this is issue replay. The Mill takes a different approach, marking each in-flight with the cycle that it issued in. At a mispredict we let the in-flights drain to the spiller buffers, the same way we do for in-flights over a call or interrupt, discarding those marked as being from the cycles down the wrong path.
Meanwhile we are restarting the decoder, and as the new ops start execution we replay the former in-flights just as we do after a return. This is result replay.
In summary: phasing proceeds in the presence of control flow just as it would have had all the control flow been inlined.
Renesac
Right, I was confounded by your slide 28. It shows the con being moved to the brtr cycle. Apparently this is wrong. The add that should be moved up to the brtr cycle, and the con should be moved even higher to the cycle before, and both will not appear to have executed if the branch wasn't taken. And the store is in the same cycle as the following branch (writterPhase) not moved lower.
Now I see that the reader phase is really advantageous in all situations (except some times for exit mispredictions, as any other pipeline lengthening, but that is ok).
Renesac
Neither decode0 nor decode 1 “produce” anything except whole or party decoded operations.
I was referring exactly to those decoded or partially decoded instructions, of course. So here the break down of what happens with one instruction by cycle after it was fetched, as I understand:
1 - In decode 0 stage readerPhase operations are at least partially decoded, right?
2 - Then on decode 1 the rest of the instruction is at least partially decoded, and the readerPhase operations must have ended decoding. Are some long latency registers already probed at this stage as an advanced readerPhase? Or all readPhase operations patiently waits the next cycle because the opPhase needs a stage to issue and/or the flow side const encoding only finishes decoding this cycle?
3 - Reader phase operations are executed. It is also the issue stage for opPhase (also called decode 2). By now every operation is probably fully decoded already, only waiting for their phase, right?
ivan
I see where we led you astray. The slide intends to show the "brtr" as the principle cycle (i.e. compute phase) of the instruction containing the brtr; the branch would happen in the following cycle, along with the add, or more correctly at the boundary leading to the add. That is, the slide describes instructions, and tries to show how they are spread out over the phases so as to overlap.
It's very hard to come up with lucid and unambiguous pictures for this, even with the animations to help. I don't know how the slide could be improved; just putting it in terms of phases from the start and showing how those turn into instructions, as you interpreted it, would be just as confusing but in the opposite way.
Perhaps a color coding, with one color to represent the op as an instruction and another to represent it in phase as executed? That would work for the con, but now there would be two "brtr"s and the other ops, and the clutter would be confusing too :-(
If any of you readers have access to PowerPoint I would welcome sample slides that you feel explain this better. Don't worry about the color scheme, but use whatever you think looks nice.
ivan
At the end of decode0 (D0) all the exu-side reader block has been decoded. This is important because we can start the scratchpad fill operations in D1, with the data available at the end of D2 (I0, X-1) for use by the adds in X0. Two cycles to read the scratchpad SRAM is plenty.
We also have the flow-side flow block decoded at the end of D0, but that doesn't help because we need the flow-side extension and manifest blocks (which are really just data arrays) to make sense out of those decodes. The extension and manifest blocks are interleaved into the long (very long) constant in D1 and the selectors set up to extract the bit fields belonging to each op (based on the extCount/conCount fields in the already-decoded flow block) at the end of D1. The selectors extract the constants of any con ops and have them available at the end of D2, ready for the adds in X0 again.
The clock-critical part of all this is the priority-encodes of the extcount/conCount fields that set up the selectors. Priority encode is a linear parse, and with as many as eight flow slots that's a lot to figure out. The priority encode and the actual selection can blur over the two cycles D1/D2, but this may constrain clock rate at high clock on very wide Mills, forcing us to add another decode cycle. A Gold at 4GHz is not for the faint of heart! It's all heavily process dependent of course. Best guess is that the width vs clock constraints will pinch in the fastpath belt crossbar before they pinch in the con operation decode, but we don't know.
LarryP
Changes in the divide instructions -- now native on Silver?
Greetings all,
According to the wiki:
http://millcomputing.com/wiki/Instruction_Set/divu
Divide operations are no longer all emulated via loops around (less expensive in hardware) helper operations. This is an interesting departure from the strategy of emulating divide by looping around a helper function (that's less hardware costly.)
It seemed to me that this might happen at some point, for some Mill core(s), but I expected that it wasn't on the critical path, so am interested in why change it now?
In some ways, this reminds me of what seemed to happen with the ARM7 series of chips. The base ARM7 lacked hardware divide, and although there were software emulations, I think ARM licensed far more ARM7's with hardware divide than without. I think system designers were worried about worst-case soft-divide performance, should the software guys intend on writing divide-heavy code. (Shame on them!)
Thoughts?
ivan
You happened to catch work in progress. The Wiki shows a periodically-updated snapshot of the actual specs for the members, and so may be internally inconsistent with reality at times as various things are being worked on. We still have no plans to have hardware div, rem, or divRem operations, but they are part of the abstract Mill and so the tool chain and sim must support them for eventual family members. Of course, to test the software we need a member spec that does define these ops, and Silver seemed to have gotten elected. I expect it will go away again.
LarryP
Effective IPC during divide emulation?
What happens to a Mill's effective IPC while it's emulating a divide operation?
An iterated-in-software approach to divide (and relatives) would appear to require in the general case both multiple cycles and an unknown-at-compile-time number of cycles. If a divide emulation takes on average N cycles and there's no way to get anything else done -- because the number of iterations is unknown at compile time -- that situation seems to imply that the effective ops/clock would drop to something proportional to 1/N, unless there's a way to make use of the rest of Mill's width (or at least a good portion thereof), during the emulated divide. If there's no way to schedule other ops during the emulation, that sounds like it would be a substantial performance hit!
I won't speculate publicly (I hate that constraint, but understand the need!),
but I'm very curious about:
a) How can the Mill CPU and toolchain get anything else useful done during an emulated divide?
b) What does the Mill's approach to divide do to its effective IPC in codes that need to do a substantial number of division or remainder ops?
Apologies if this is a sensitive issue, but eventually people are going to want to know. (On the off chance that you're not satisfied with your solution, I'd be happy to share my speculations via email.)
As always, thanks in advance for anything you can share publicly.
ivan
Frankly, nobody has written the divide emulation code yet, so we have no numbers; it's in the pipeline, but the emulation team has mostly been worrying about FP. If you'd like to help then feel free to post candidate code for div, rem, and divRem for the various widths. Remember that either or both inputs can be NaR/None. I do not expect there to be a loop; simply unroll the whole thing.
That said, the emulation substitution occurs at the start of the specializer where we are still working with dataflows, so if there is any code that is part of a different flow than the divide then the code will overlap.
LarryP
Ivan et all at Millcomputing,
No promises, but I can take a stab at emulating integer division, probably unsigned to start out. In order to do that, I'll need a clear definition of exactly what the rdivu operation calculates, including width(s) and any constant of proportionality.
I find examples very helpful to understand key details. For simplicity, let's start with uint8 as the size of the input numerator and denominator. Would you kindly post what rdivu(x) returns for, respectively: 0, 1, 63, 64, 127, 128, and 255?
(I've got an idea what I'd like rdiv to return, but I don't want to mis-assume -- nor do I know ALUish hardware well enough to guess what's fast and cheap to implement as a native op.) Far better to start with what rdivu really does.
Also, what should integer division (OK divrem) on the Mill return for quotient and remainder when given a zero divisor? NaRs?
Thanks,
ivan
Like I said, we haven't done anything about it yet. As for the correct values, use the C/C++ standard for everything. For divide by zero, return NaR(undefinedValueNaR).
Things are tricky when the domain is signed and either or both arguments are negative; see https://en.wikipedia.org/wiki/Modulo_operation. Currently we do not have a mode argument in the spec for these ops, but we probably should; the natural candidate is the roundingCode attribute that FP ops use. The integral result would then be the value produced by an infinitely precise FP divide operation followed by the FP integerize operation, plus the Euclidean rule for the remainder. As far as I can see the available rounding codes provide all the sign rules demanded by the various languages listed in the Wikipedia article, plus some (nearest-ties-to-even for example) that don't seem to be used.
I suggest that you see how far you get in unsigned, and when that gets through the specializer (you'll write in genAsm) then we can revisit signed and decide whether to use roundingCode, a new attribute, or roundingCode with some enumerates faulting as undefined.
It's OK to post your code here, although I'd suggest starting anew topic. Other readers/posters may chime in on your efforts; please be polite. When you have something that you believe works then either we will have published access to the tool chain for all or (if that's still not done) we can NDA and let you use our servers.
LarryP
Ivan,
If you folks haven't defined rdivu, may I please have permission to suggest (namely speculate publicly) on what rdivu might calculate? Or would you prefer I email that to you first, for your approval.
I really doubt that the rudiments of numerical analysis, Taylor series and the Newton-Raphson algorithm{1} would have anything patentable, but I'd like to help -- and thus try my best not to cause Millcomputing any IP headaches.
Unless I'm mis-remembering, the convergence of any Newton-Raphson-like algorithm will depend strongly on the accuracy of what rdivu produces. So there'll probably be a trade-off between the hardware/speed cost of implementing rdivu and the precision -- and thus how many iterations are needed to converge (for a given width of input args.)
Is there an existing forum category you'd like to see this under?
Actually, my instinct is to put it on the Wiki (once I have something coherent), under community-contributed software. That way we have history and multiple people can try their hand at emulating division.
----
{1} According to a lecture I heard some years ago at NIST (wish I could recall the speaker's name), Newton's original approximation method, in his Principia Mathematica is rather different from the modern Newton-Raphson root-finding algorithm. (But Raphson still got second billing. So it goes....)
ivan
You have permission, and can conduct your business here or on the Wiki as seems best to you.
In addition, you may propose novel helper ops beyond rdevu if you find the need, but please send the proposal for such an op to me directly first for vetting before putting it in the public space.
LarryP
Greetings all,
I've started a Wiki page for user-contributed division suggestions, algorithms, pseudocode -- and eventually genAsm code, at:
http://millcomputing.com/wiki/Division:_Algorithms_and_user-submitted_genAsm_code
I don't see a "code" formatting button on the Wiki, as I do here in the forums.
Would somebody who knows the Wiki look into adding a "code" formatting button on the Wiki?
Veedrac
It seems you should use <pre> tags for code on the Wiki. I've edited it in.
LarryP
Veedrac,
Thank you much!
Larry
mkbosmans
It's curious to see the rdiv instruction specified with only a vague idea about it's specification. May be it is just the public information that is unclear about the specifics, but from Ivan's comments it looks like this really is uncharted territory for the Mill team.
On the wiki, at LarryP's page on division I posted a small comment about an alternative algorithm, Goldschmidt division. It could very well be that when that is used as a software div implementation, the instruction providing the initial guess should be another function or could be omitted altogether.
Anyway, it's probably time for a separate topic on integer division.