Forum Replies Created
- AuthorPosts
- in reply to: Loop pipelining and aliasing #1175
Oops.
Yes, you are entirely right. It should be A[i] = A[i] + A[i+1] to match the animations. And none of us caught that in reviewing the slides. No, nothing about deferred loads, just a goof.
Back to the PowerPoint š Sorry. I think we can keep the video we shot and sync it with corrected slides in post-production.
Ivan
- in reply to: HP's next RISC jump #1145
We can hope that they have it working by the time we’re ready š
- in reply to: Loop pipelining and aliasing #1186
How will the compiler and specializer jointly handle this case?
A. Quite well indeed.
(To this reader, it sounds like thereās already been lots of sim results and tests that have exercised the existing compiler, specializer and simulator on a whole bunch of loop-heavy code. So thereās a body of NYF/NY-disclosable evidence that the tools and software pipelining of looks works well and will continue to do so. No hand waving involved.)
Well, I could have said āquite wellā, but donāt like to brag
However, Iām sorry to say that we really donāt have the kind of exercise-based evidence you suggest. The prototype compiler never did dataflow analysis (which is necessary for pipelining), and the new one isnāt to that point yet either. So the architecture, and the discussion, is from first principles and not from experimentation.
It is certainly possible that subsequent experimentation will expose flaws in our first-principles analysis, and if so weāll go back and rethink the analysis. One of our reasons for taking the Mill public as fast as we can get the filings in is so we can get a large number of skeptical eyeballs onto the design; eyeballs that may not be as comprehensive as a test suite but are likely to be already familiar with troublesome corner cases.
We thank you, and your legion of friends who ask us hard questions.
- in reply to: Loop pipelining and aliasing #1181
Most of this is addressed in post http://millcomputing.com/topic/loop-pipelining-and-aliasing/#post-1180.
We won;t have to shoot new video for the pipelining talk; only the slides have to be changed so it should not add much delay (famous last words, I know).
Yes, only the specializer knows how many in-flight deferred loads are possible on the target. That constraint may force the specializer to produce a longer (more cycles) schedule for the piped loop than would be necessary without the constraint. As a simple example, if specializer policy is to always use a deferral large enough to hide the d$1 cache latency (three cycles) and there is only one load in the loop and only one retire station, then the schedule must use at least three cycles per iteration, whether the other slots have anything useful to do or not. Pipelining is still quite possible, and will be done, although the constraint may force no-ops that would not be necessary if loads were true one-cycle ops. This consideration was ignored in the talk, which was already over-long.
There are fewer retire stations on smaller members, so a random loop is more likely to hit this constraint than on a bigger member. However, other constraints (such as shortage of ALU units for the loop work) will also pinch on smaller members, causing the pipeline to need more cycles per iteration anyway. These effects offset, and the members are configured so that over the range of loops in average code no particular one of the constraints is exceptionally pinchy compared to the others. The result is that pipelines will be slower on smaller and faster on bigger members, as one would expect.
The compiler does not emit alternates for pipes (we’re not even sure that it is worth having the compiler emit alternates in any cases, but the possibility is there). The compiler also does not do scheduling: compiler output is a pure dataflow graph, with no temporal notions at all. The algorithm that the specializer uses to pipe loads is described in the cited post.
Ask again if the explanation isn’t clear.
- in reply to: Loop pipelining and aliasing #1180
Exactly.
In more detail: a retire station is a dedicated, blocking, not-hardware-pipelined device, much like a classic divide unit in which the unit can have only one divide in flight at a time; contrast with a hardware-pipelined device like a multiplier where you can start a new multiply every cycle and have several in-flight at different stages of execution.
Now apply that information to the scheduler in the specializer. A software pipeline is constrained by the resources it needs for the code it executes, so if (for example) the loop body contains six adds and the target member has four ALUs then the software pipeline must be at least two cycle long. The scheduler determines this minimum by considering blocking ranges as it schedules. For an ordinary, hardware-piped operation like an add the blocking range is the one cycle for issue, regardless of the latency of the operation; you can issue an add every cycle. In contrast, a 14 cycle blocking divide has a 14 cycle blocking range, so if the loop body has a divide and the hardware has one divide unit then the software pipeline must be at least 14 cycles long, even it has to fill those cycles with no-ops.
Now to apply that to loads. The amount of deferral of a load is up to the code, up to a maximum number of cycles determined by the representation (for count-deferred loads) or with no upper limit (for tag-deferred loads). In particular, the deferral can be zero so that a load issued in this cycle can retire in the next issue cycle, albeit not in the next clock cycle. A load uses the retire station as a blocking resource, but the blocking range is only for the period from issue to retire of that load. Consequently, if loads use zero deferral, the limit of parallelism is the lesser of the number of load units (i.e. the number of loads that can be decoded) and the number of retire stations. All Mill configurations will have more retire stations than load units (otherwise deferral would be pointless) so in practice the constraint for non-deferred loads is the number of load units. Thus, non-deferred loads can be pipelined just like adds, because they have a one-cycle blocking range just like an add.
Now introduce deferral. If there are eight retire stations, then we can have eight loads in flight at any one time. If there are four loads in the loop body, then each load could have a blocking range of two, which means that loop can be software pipelined with each of the four loads specifying a one-cycle deferral (i.e. a two cycle flight time). This constraint is static and the blocking range can be figured as soon as you know what the loop does and have the specs for the target, i.e. at specialize time.
Now it happens that stalls are expensive because modern hardware doesn’t stop on a dime, nor start immediately either. A two-cycle deferral will always cause a stall, even if the data is in the top level cache, because the cache access latency is longer than that. So it is better to have a deferral that is at least the d$1 latency, assumed to be three cycles for initial Mills.
So for performance the specializer (when scheduling a pipelined loop) wants to use a deferral of at least three. That means that it cannot issue all four loads every cycle, because that would require 12 loads in-flight and there are only eight retire stations. Consequently the schedule must add additional cycles to the software-pipeline, just like it does with a blocking divide. The resulting code will contain no-ops.
Of course, if the schedule does not run out of retire stations even with a deferral of three, it can use a larger deferral for all or some of the loads until the cumulative blocking region exceeds the available stations. The same algorithm is use for all ops with a blocking region greater than one; the only thing that is different about loads is that the size of the blocking region is determined by the code rather than by the hardware spec.
Incidentally, loads are uncommon in Mill software-pipelines, for reasons that are NYF š
- in reply to: ensuring bounds of arrays #1167
Any bounding scheme would have to support the language standards, which explicitly permit one-past-the-end addresses, though not actual access at that address.
- in reply to: Specification #1158
I hope so š
The LLVM middle-to-back representation is more a linearization than a serialization. We plan to back up into the middle end just a little and serialize the internal (graph) representation after all the regular passes and a few Mill-specific passes are done. The compiler-to-specializer representation knows it’s a Mill but doesn’t know which one.
- in reply to: Specification #1156
Exactly š
- in reply to: Specification #1155
In effect yes, though not really. The abstract form has no belt; it is a dataflow graph overlain with a control-flow graph. That is, it’s not a P-code or Java code or any of the other serializations; there’s no reason for the specializer to have to figure out the data dependencies that the compiler already knew but threw away in the intermediate representation.
Of course, it is trivial to go from the graphs to a serialized instruction stream with infinite belt, and we might even offer a command option to produce that, for those who find such a representation to be an easier way to understand what is going on. But it would only be documentation.
- in reply to: Specification #1152
A final reply: as you can see, it’s much easier to reply if you split your posting on the Forum into multiple postings with one topic each š
- in reply to: Specification #1151
Also, as a Lisp programmer and heavy DSL creator, I must say that when you were showing C++ metaprogramming and especially the enum issue, I could only think of Greenspunās Tenth Rule. š
Had to go look that one up.
The counter-argument embeds Greenspun’s Seventh Rule.
- in reply to: Specification #1150
Do you anticipate any temporal quantization or aliasing problems with tracking time as (presumably) integer picoseconds in simulation for systems with multiple clock rates? It seems like there could be edge cases where sim would consider activities simultaneous which would be ordered in hardware, depending on the usefulness of that distinction at the sub-picosecond scale.
The sim models multiple clock domains, multiple clock sources (the xtal component), and multiple clock converters (the PLL component). It allows for the cost of the handshaking required to cross domain boundaries.
The sim does not model skewed clock strategies such as appeared in the P4 and some other barn-burner chips. It could, but the hardware guys are not expecting to use such techniques any time soon, and so we wouldn’t know what to specify.
- in reply to: Specification #1149
When bin-filling the binary instruction encoding, does your software take into account expected popularity of instructions to prefer common instructions for tighter encoding and uncommon instructions as an otherwise needlessly larger encoding to make room for the common ones?
Not really. What you describe is a simplified Gray coding, and we don’t use frequency in determining the encoding of the ops proper. We just determine the natural minimal encoding of each op, and pack for minimal total length. If there’s one op with so many arguments or information to represent that it takes several bits more than any other op then the algorithm will likely use half of the overall entropy just on that one op, and it will encode as the natural length plus one bit to distinguish between that one and all the rest. And so on, for minimal total length.
That said, there are cases in the encoding in which we are in effect encoding by frequency. For example, excess entropy in the exu block (the computational 2-in-on-out ops like ADD) is soaked up by immediate forms, such as add immediate and equal immediate. The range of immediate values supported in any given slot is determined by how much entropy is left over; we add immediate values until we almost but don’t actually go over a bit boundary. The added supported values are small integers, in order, so one slot may support an add-immediate with values 1-17 and the next slot with values 1-243, depending on the rest of the encoding.
This algorithm is in a sense a Gray code, because immediate value usage in actual programs drops in frequency with increasing value – add(3) is more common than add(27). So the algorithm is choosing more frequent operations to encode specially, even though the encoded bit length in any given slot is fixed.
There are other places where the encoding is frequency-sensitive, but in all of them the situation is like with the immediates: we don’t pack single operations by frequency, but we do pack multi-operation sequences by frequency. If there is not an immediate form with the right immediate value, then an expression such as X+243 must be scheduled and encoded as two ops, an con to get the constant on the belt, and then the add. Doing it in one op has better entropy, and also improves schedule latency and slot and belt pressure, so there really is frequency consideration in the algorithm, just not in the encoding of single ops.
- in reply to: Specification #1148
(Related to JITs, but slightly off-topic, but I donāt recall which talk covered it: With the way EBBs and function semantics work, is it possible for a subroutine to have multiple entry points? Just considering optimization possibilities for always-vararg languages.)
It is possible for a function to have multiple entry points, but each entry point must be the entry point of an EBB. All transfers must be to the head (i.e. middle) of an EBB.
- in reply to: Specification #1147
So just for extreme clarity: The simulator examples given here, and the assembly language shown, is only ever for member-specific code, and never for abstract family code? JITs would generate a LLVM-ish intermediate serialized representation?
The notation used in assembler is common across all members. What you can say in that notation is member specific. Thus
addu(b3, b6), addu(b4, 3);
is syntactically valid on all members, but only will produce binary on members with two or more adders. JITS (we expect) will produce specializer input notation, or rather a binary form of that, and the same library that the specializer itself uses will turn the JIT code into target binary. - AuthorPosts