Mill Computing, Inc. › Forums › The Mill › Architecture › Pipelining
Tagged: GenAsm, pipelining, specializer
- AuthorPosts
- #1211 |
Talk by Ivan Godard – 2014-07-14 at Facebook
Slides: pipelining.06 (.pptx)Software pipelining on the Mill CPU:
Instant pipeline: add loop, no stirring needed
The Mill CPU architecture is very wide, able to issue and execute 30+independent MIMD operations per cycle. Non-looping open code often cannot use this raw compute capacity, but fortunately >80% of cycles are in loops. Loops potentially have unbounded instruction-level parallelism and can absorb all the capacity available – if the loop can be pipelined.
This talk addresses how loops are pipelined on the Mill architecture. On a conventional machine, pipelining requires lengthy prelude and postlude instruction sequences to get the pipeline started and wound down, frequently destroying the benefit of pipelining the main body; conventional pipelining can be of negative benefit on short loops, especially “while” type loops whose length is unknown and data dependent. Not so on a Mill: Mill pipelines have neither prelude nor postlude, and early conditional exit has no added cost.
Pipelines on conventional machines also have problems with loop-carried data, values produced by one iteration but consumed by another. Conventional code must resort to bucket-brigade register copies, or fail to pipeline altogether. Even architectures like the Itanium, which have special hardware for support, provide it only for the innermost loop. In contrast, the Mill needs no copies and can pipeline outer as well as inner loops.
Familiarity with prior talks in this series, especially the Belt and Metadata talks will be helpful but not essential. Hello, I have a question about the prologue of the software pipeline. Specifically the example you gave for the retire operation where normally in the prologue there would be nothing there you replaced with operations which would return “none” because they are calculating something before the loop.
What is the point exactly of having those operations which do not calculate anything, would that not waste cache space instead of just having nothing there? I don’t know much at all about compilers and I mainly program in assembly for different architectures and perhaps that is a compiler optimization which I don’t know about.
Secondly you have stated (and as I found out by experimenting a little) the belt is terrible to program with assembly. I know its would be rather silly to optimize for assembly but will there be a tool (Perhaps in the debugger) that allows you to look at the current status of the belt when writing the program, It would make the assembly much nicer to program in.Hmm. Looks like we needed another slide.
The point about the retire op is that with it there is no prologue code any more. The whole thing – prologue, steady-state body, and epilog, is the same physical piece of code in the cache. If there were a conventional prologue these ops would be in instructions of their own, occupying cache of the steady-state body, that we need anyway, for the prologue too.
Yes, the ops that are logically “before the beginning” get executed, but they have no visible consequences and occupy no space that we wouldn’t use anyway. Yes, these ops do cost power – but they do not cost code space, which is a much more significant constraint in the big loops with low iteration counts typical of general-purpose code.
As for assembler, we agree with you that Mill concrete assembler (in which the programmer explicitly specifies the pelt positions of op arguments) is inhumane both to read and write If you have a program that you have written in concrete assembler then you can run it in the present simulator and see the belt contents and track them, instruction by instruction; the Specification talk shows this being done, live. We expect that a debugger on real hardware will provide a similar interface.
If you are writing in abstract assembler (specializer input, without belt assignments, as opposed to specializer output concrete asm) then there is no “current status of the belt” to view, because the code may be specialized for, and run on, different family members with different belt lengths and operation/instruction schedules. Only when these member-dependent aspects are blended into your abstract code – by the specializer – can a tool, or the hardware, know what is happening on the belt.
Unfortunately, while we are hard at work on the abstract assembler and specializer, they won’t be ready for some months yet. When they are, we are considering making them and the simulator available running on our servers from our web site, for anyone to play with. We’ll see. There are also some more distant plans – hopes, really – to provide a visual UI for the sim or debugger so the programmer can see the movement of data and events inside the running program, visually reminiscent of the animations in the talks. Not soon, though.
Thanks again for another great presentation. The brief recap was nice, and the extended explanations of Mill nuances was even better.
I also very much appreciated the discussion of your business strategy on the Hackaday “Mill CPU for Humans” thread on YouTube, particularly Part 4:
I mostly program in Forth — and the FIFO Belt is as far away from a LIFO stack as you can get — but I see what you folks have achieved here — bravo, and well done!
Dave’s Device makes complete and perfect sense to me.
cheers, – vic
The retire() operation is brilliant!
To me, it shows the hallmark of a great invention — that it seems “simple” — but only after I fully absorbed the flash of insight.rotate(),
inner(),
leave()….To quasi-quote President Carter, I lust in my heart for this CPU architecture!
Not just because it’ll be fast, but because it’s fast yet clean.
(I suppose it’s irrational, but even when I don’t have to look at the assembly code, I’d much rather work on a clean machine than on something I consider ugly.)I just asked on Reddit because the password recovery email took its sweet time, but how large is the spiller in a typical Mill family member? Gold, copper, tin.
I am interested in how large a loop can be pipelined for something like a box blur, which has 3 colors times blur length of info that it would need to save across loop iterations. I would be interested in blur lengths of ~20 pixels, making it need 60 saved belt entries. Is this feasible?
Given that the Mill will be capable of operating on vectors, I’d expect the RGB pixel to be represented, at least on large enough members, by a single vector. I’m not clear on what operations might be available for vector shift and element access, that might be useful for filtering a stream of data.
I just realized that the blur in my algorithm only happens on a single channel array, so ignore all mentions of three colors. ::sigh::
That was used in another context of the same program, and it was actually faster than separate arrays for each color, at least on x86. It is a weird algorithm indeed.
My original question still stands, though.
The spiller has unlimited size, because under the buffering is all of memory. The constraining factor the spiller presents is bandwidth – high end Mills with big belt have a lot of state to save and restore. The top of the spiller, the part that connects directly to the belt and the rest of the core, has enough bandwidth to eat all the core can throw at it into buffer registers. The next level down is an SRAM, and there’s not so much bandwidth between buffers and the SRAM. The final part of the spiller, between the SRAM and the bottom level cache and thence to DRAM, has the lowest.
Any one of these steps can be overwhelmed if the core really works at it, just as the regular load-store bandwidth can be overwhelmed, on a Mill or any machine. Both spiller and the memory hierarchy are able to issue-stall the core if they need to catch up. The individual Mill members have the bandwidths sized so that stall is rare, part of the resource balancing that goes into the detail work on each family member.
Other constraints limit the practical length of the belt to avoid impact on clock or power. We know 32-long works, and we think 64-long may be practical for high-end throughput-oriented members with slightly lowered clock rates, but 128 seems unlikely.
For your blur, if those 60 are really 20 RGB vectors (as PeterH suggests) then they will fit in the 32-belt of the mid- and upper-end members. I can imagine a member with say 16 FMA units and a 32-long belt, although we are starting to get into GPU-land with that. If there really is a distance of 60 then code even on the high general-purpose 64-members are going to use the scratchpad and its rotators as a belt extension, and the spills and fills of the belt overflow part that doesn’t fit will wind up needing to be pipelined with the rest of the code. The good news is that the scheduler in the specializer takes care of getting the spill-fill right, you don’t have to do it.
How does the retire operation handle the order of the results? In a steady state, the arguments on the belt should always be in the same order. But the retire operation only knows about the number of expected results, not which FUs should actually produce results in which order. So what if it has to fill in a None somewhere in between? How does it do that?
(If I explained the question badly, maybe I can try to come up with some sample code).
In practice, the
retire
op has variants with arguments to cope with this and other implementation details.Actually, it’s simpler even than Will says; I wish I could have gone into the mechanics in more detail, but the talks have time constraints.
It comes down the the fact that drops are always at the front of the belt, not at some slot index position as if the belt were an array rather than the FIFO that it is. Consequently the hardware must use a canonical ordering of drops so the software can ask for the values in the right place when it wants them later. That ordering is latency order: if a multiply (latency 3) and an add (latency 1) drop in the same cycle, the add will be at a higher position number (say b1) than the multiply (say b0).
The Nones that are dropped by retire are stand-ins for values that are in flight and haven’t dropped yet, and the reason they are in flight is because they are the result of a long-latency operation that hasn’t reached its steady-state yet. Consequently, the latency-order of drop says that the ones that haven’t dropped yet would have been at low belt numbers if they were in steady state. Consequently all the Nones from retire, no matter how many there are, will be in a block at the front of the belt, followed by any real retires, followed by the rest of the pre-drop belt.
Getting that to work without a special op to say where things should go is what makes Dave’s Device so clever IMO.
Will is right that there are two forms of retire. Recall that Mill operands have metadata tags telling the operand’s width and scalarity, which guide the functional units when the operands later become arguments to ops. Nones, including those dropped by retire, need width tags too. There is one form of retire that gives the desired widths explicitly. There is also another form that gives only a drop count, as shown in the talk, and that form drops Nones with a tag with an Undef width.
When an Undef width is later eaten by an op, the width is defaulted to a width that makes sense to that op, following defined rules, and the op’s result (always a None of course) carries the width that would have resulted from that default. However, for some ops the defaulting rules can result in a wrong width, different from what the real non-None would have done, that can screw up the belt for other operations. The scheduler in the specializer knows when that happens, because it knows what the desired width is, and when it detects that the default would be wrong it uses the explicit-width form of retire.
That would have been another ten slides and an explanation of width tagging, and there wasn’t time for it.
Ah, I see. I thought it was slot order (for those slots that produce results, so not like in an array), not latency order. Maybe I missed that in the earlier talks. So I assume it’s by slot order for ops with the same latency? Is the latency order guaranteed to be the same across different mills, or does it not matter because this scheduling is done during the “specialization” phase anyway?
That would have been another ten slides
To be honest, I fast forwarded through the first 3/4 or so of the talk. (Amazingly enough, your voice is so deep it’s still quite easy to understand at 2x speed, at 4x it gets a bit difficult :-)) I know you probably try to cater to an audience of varying background, but a few quick asides don’t hurt. Those people interested can usually figure them out without slides.
Is the latency order guaranteed to be the same across different mills, or does it not matter because this scheduling is done during the “specialization” phase anyway?
The latency of ops varies by member, and can vary by operand width too. Some ops may even be emulated on some members.
It doesn’t matter to the compiler because its the specializer which schedules ops.
Yes, latencies are member dependent, so order is too. Yes, same latency drops in slot order, and ops with multiple results drop in result-number order. Yes, the specializer deals with all that in the scheduling.
As for level of presentation, it’s real hard to balance the needs of the newbies against the impatience of the old hands, and fit it all into a talk of reasonable length. I try to cut things into reasonably complete topics with adequate background, mindful of the physical audience; at FaceBook only half the audience had prior exposure to notions like the Belt or Nones, and I don’t want to blow off the other half.
In many talks I can speak faster, but in this one the animations largely paced the talk; they can’t be too fast for the audience either. I was forced to leave out quite a few niceties; the drop-order issue for example, and things like extended scratchpad – I’m a little surprised that no one here has asked about what you do when you run out of scratchpad. And the whole question of epilogues.
The new Wiki will be up on our site soon, initially very light on content but I’m hoping you folks will help populate it quickly with searchable details. Those talk-viewers who need more can always find it here on the Forum, unless it is still NYF. There are at least six more talks to go.
(Thanks Ivan for describing latency ordering of drops; I don’t think you’d previously covered that publicly.)
It wasn’t – we just said the order didn’t matter so long as the hardware and scheduler agree. The actual order, and why it matters, was in the 7/13 provisional filing in advance of the Pipelining talk, so INF (It’s Now Filed)
You know, just when I think I want to share some insight, I once again realize we’re still chipping flakes from the periphery The Mill’s Big Picture. It’s like every “Ah-ha!” reveals just how much more of this there is to disclose.
As Ivan repeatedly promised, the lectures are a “gross oversimplification” of The Mill reality.
The last processors I worked on that were a “compiler writer’s dream” were the CISC VAX and 68K. When Ivan mentioned the B6500, another easy compiler machine, I was immediately thrown back into the reason the B6500 hardware was so expensive: Stack machines are DOA without fast memory, and the fast B6500 memory was terribly expensive. (You could buy slow memory, but why?)
Then I worked on a CDC machine, and I saw the light: Speed is king, Then you have to do the software. Seymour Cray was my hero. Don’t know Seymour: Check this out:
I appreciate Ivan’s emphasis on The Mill being a software-driven hardware architecture, but I’m beginning to hunger for some examples (however trivial) of how the hardware implementation will be accomplished. Block diagrams or schematics would suffice, but I’d kill for some VHDL/Verilog to play with.
Sorry, NYF
We hope to have some of the hardware team give one or more talks on the gate-level hardware this fall. Don’t hold your breath for Verilog though
Ok, so what do you do if you run out of scratchpad?
I’d assumed that the answer is “the magical spiller”. And yes, I’d still really like to see a talk about the spiller. It’s not so hard to imagine how it should work, but I guess the devil is in the details. From what we’ve seen so far, it seems to be complex enough.
If there are no operations with side effects in the “in-flight” portion of the loop after it finishes, you don’t need an epilogue, as you said in the talk. If there are, I guess you’d unroll it once for each “block” with side-effect ops in it, and then do a usual epilogue, probably within another inner/leave. Or am I missing something?
Looking forward to the rest of the talks and the wiki. A description of the instruction set on the wiki would be nice.
What to do when you run out of scratchpad:
The next and final level is “extended scratchpad”, in memory. The extended space is not (at present; some arguments) accessible to load/store, but uses specific spill/fill ops and its own allocator and a stack-like allocation regime. As with regular scratchpad, the extended spill/fill deal with operands, not bytes, and unlike memory but like scratchpad the operands retain their metadata. As a consequence, an extended spill/fill must double-pump the cache datapaths, once to get the operand data and once to get the metadata. Extended is cached normally.
Epilogue:
There are loop bodies where you have loop work that still needs to be done after a while-predicate has determined the that exit should occur. If you want to think about it, the general solution is to cause any data sources in the pipeline after the predicate to produce Nones. However, how to do that depends on the kind of source, some of which are NYF, and all really need pictures to explain what to do. We’re also not sure we have found all the cases yet.
Wiki:
Working to get the ISA Wiki doc mechanically generated from the specs.
If you want to think about it, the general solution is to cause any data sources in the pipeline after the predicate to produce Nones. […] We’re also not sure we have found all the cases yet.
This looks really hard to get right in every case, and I probably wouldn’t trust the compiler to get it always right.
Why not introduce an artificial dependency in every store (a sort of phantom argument, maybe just add 0 directly before the store) and let the compiler taint these extra dependencies after the loop has completed instead of trying to taint a source? KISS. One can taint a source because normally you know when it’s a bad value. Similarly, one should dually taint the destination if it becomes invalid.
Remember, this is hardware, not compiler internals; there is no such thing as taints
In hardware terms, your suggestion would turn into predication on the store, where the value of the predicate results from the exit test. There have been predicated machines, most recently Itanium and older ARMs; I did a compiler for the Data General Nova1200 once, which had predicated operations implemented as skip predicates; it was an interesting machine.
The problems arise in the implementation. Predication adds another operand to all the operations (or at least the non-speculable ones, on a Mill) which adds another data path between the Belt and the functional units, which is an expensive proposition. One also needs a way to encode the extra argument, which would completely change the encoding; store in particular fully saturates the argument space available. Then there’s the problem of keeping the predicate value live on the belt; it increases belt and scratchpad pressure.
This is not to say that it couldn’t be done; the Mill ISA already has conditional branches and conditional returns, which can be thought of as a form of predication. When the tool chain gets further and we start implementing pipelining in the specializer and try it on real code then we may rethink the design. For now, we don’t think we need anything beyond what we have now.
Constraints on Mill software pipelining:
Based on the pipelining talk and the discussion above, I’ve been thinking about how far Mill pipelining can be pushed vs. where it cannot go. Specifically, what properties would make a loop un-pipeline-able (or unprofitably pipelinable) on the Mill? — and how to recognize those cases. So far, I can see two situations where the Mill approach to pipelining will likely have problems. This isn’t meant as a criticism. I suspect that universal loop pipelining is probably infeasible with finite hardware. IMHO, the key is to recognize when pipelining either won’t work, or would give poorer performance than leaving the loop unpipelined.
1. Loops where there are several different pieces of loop-carried data, for example, coming from different arrays. Since the Mill’s rotate() operation acts only on the most recent scratchpad allocation, I don’t see an obvious way to software pipeline a loop that has loop-carried data from multiple arrays, especially if the loop-carried data have differing loop distances. I suspect this is a fairly rare case.
2. Loops with a single chunk of loop-carried data, but where the loop distance (times the size of each loop-carried datum) is greater than the largest-guaranteed scratchpad allocation. Again, I suspect this situation will happen occasionally, though most loops won’t run into this constraint.
The second of these situations suggests that there are some parameters that the compiler will need to know about, in order to emit intermediate code that the specializer can cause to run correctly on all Mills, even the little ones. In my second situation, the maximum scratchpad allocation permitted in a single scratchpad allocation operation (scratchf()?) on the smallest-scratchpad Mill, is probably one such parameter.
Ivan, if it won’t run into NYF or related issues, can you give us an idea of how many belt operands the scratchpad can hold, before scratchpad needs help from the spiller? Even a rough Tin — Gold range would be nice to have a rough idea about. (I have a rough guess, but I won’t speculate here, so as not to cause IP/patent problems.)
Similarly, I’m curious about what other parameters are “baked” into the Mill architecture. If I recall correctly, the bit-width of the address space is another such. (64 bit pointers, with 60 bits of true address + four reserved bits for garbage collection and forking.) Since pointers must fit in (single) belt positions, it sounds like this requires a minimum height of 8 bytes for all Mill family members. The shortest belt length mentioned to date is 8 for the Tin. I suspect that a shorter belt length (e.g. 4) would effectively rule out wide issue (since results need to live on the belt for at least the spill/fill latency.)
Similarly, two streams of half-bundles (at least until we get that hypercube memory ) and three blocks to decode within each half bundle. Other baked-in Mill-architecture parameters?
For all posters, I recommend that you put one question in each posting, or at least collect questions on a single topic together so that replies can be coupled to that topic.
This reply addresses LarryP’s scratch-related questions.
1) several different pieces of loop-carried data
This makes no difference at all. The rotate operator is invoked once per iteration, and all the scratchpad data local to the iteration rotates, no matter where it came from. The scratch space must cover a loop’s distance worth of data, and individual LCVs may have shorter distances, for example,
A[i] = B[i] + B[i-6] + C[i] + C[i-10];
Here the loop distance is 10, but B has a distance of only six. It is true that the values from B will reside in the rotating scratch for four more iterations after they are dead before they are overwritten with newer values, but that’s harmless.
2) where the loop distance (times the size of each loop-carried datum) is greater than the largest-guaranteed scratchpad allocation.
This is our old friend the running-out-of-scratch problem. The Mill solution is “extended scratchpad” in spiller-owned memory, with scratch-like spill/fill semantics. There wasn’t time for extended in the talk, but it was described elsewhere here.
As for compilation, we believe (not that far yet to say “we have done”) that it is sufficient for the compiler to identify LCVs and their distances, and let the specializer deal with all spill/fill. Compiler output is a dataflow graph, so if the scheduler finds that a live datum will fall of the belt it must allocate scratch and insert the necessary spill and fill. Because the datum is marked with its distance, the specializer knows that a rotate op is needed and how many copies of the LCV will be live in the scratchpad. The rest is reasonably straightforward.
This reply addresses LarryP’s questions on configuration.
1) size of scratch in different members:
Copper: 128
Decimal16: 256
Decimal8: 256
Gold: 512
Maxi: 2048
Mini: 128
Silver: 256
Tin: 128Sizes in bytes. All numbers are placeholders awaiting tuning with real code.
Scratch is byte addressable and packed, so less is needed than for a register file that needs a whole register to hold anything. Ignoring vector data, we project an average width of spilled data ~3 bytes, so a Tin scratch can hold ~40 separate operands. We project a peak non-pathological scratch load in open code to be perhaps 10 operands, so there’s plenty of extra space to buffer the spiller. In a piped loop, LCVs may demand many more than 10, but we expect to stay in such a loop for a while so the spiller won’t be active during the loop and won’t need to buffer, so the loop can use the whole of scratch without causing spiller stalls.
Ivan wrote:
size of scratch in different members:
Copper: 128
Decimal16: 256
Decimal8: 256
Gold: 512
Maxi: 2048
Mini: 128
Silver: 256
Tin: 128Ivan,
Thanks for that info!
In the bargain, we learn the names of four, previously unmentioned (so far as I’m aware) Mills. From two of those names, I’ll assume that there’s been some real interest in decimal floating-point capability. I wonder who/what market sector?
I find it interesting that Maxi has so much more scratchpad (4x) than a “mere” Gold. Is Maxi associated with the request (from Lawrence Livermore Labs, if I recall) about a supercomputing version? (The one that would have needed over a thousand contacts bonded out?)
The Mill architecture does support IEEE Decimal, although only some members will have it in hardware; the rest will use specialize-time emulating substitution. Decimal8 and Decimal16 are test configurations (topping out at 8- and 16-byte decimal representations respectively) for markets that want it: mainframe, DB, and Big Data mostly. As test vehicles they won’t be products, but derivatives may be, when and if we decide to tackle those markets; don’t hold your breath.
Mini is a straw man to see how small we can get the specification before execution becomes completely impractical. It will never be a product.
Maxi is a straw man at the other end, to see how muscle-bound we can configure a Mill before it hits diminishing returns with the clock rate. Also not a product per se, although there may be products between Gold and Maxi.
The reason Maxi needs so big a scratchpad is because it has a huge vector size, taking whole cache lines as single operands (SIMD) and lots of them (MIMD). It won’t have that many more operands in scratch than say a Gold, but they will be much bigger on average (no one would use a Maxi for anything but vector-busting numeric work) so the scratch size, in bytes, must be bigger.
Ivan,
Thanks for explaining the basic intent of mini and maxi. I appreciate how much of the Mill architecture so far revealed has so few “baked in” parameters. The ability to specify the length, width, height, scratchpad size, etc. parameters per member is a fascinating capability and — I suspect — will supply great leverage in product/market agility.
IMHO, changing the number of blocks per half bundle would be changing the recipe significantly, if an outsider who’s barely peeked under the hood can be allowed such.
This reply addresses LarryP’s question about “baked-in” (as opposed to member-dependent) aspects of the Mill.
Similarly, I’m curious about what other parameters are “baked” into the Mill architecture. If I recall correctly, the bit-width of the address space is another such. (64 bit pointers, with 60 bits of true address + four reserved bits for garbage collection and forking.) Since pointers must fit in (single) belt positions, it sounds like this requires a minimum height of 8 bytes for all Mill family members. The shortest belt length mentioned to date is 8 for the Tin. I suspect that a shorter belt length (e.g. 4) would effectively rule out wide issue (since results need to live on the belt for at least the spill/fill latency.)
Very little is baked-in. Cache-in-virtual is, so the address space must be big enough to hold the space of all processes without overlap, which implies 64-bit pointers for general-purpose work. Extending shared addresses off-chip might require 128-bit pointers in some future supercomputer, but that’s not GP. Likewise, single-process apps (some embedded for example) might fit in 32 bits or smaller, but that’s not GP either.
Belt size is not baked; it is set so that spill of single-use operands is rare. Eight might be too small even for a Tin, but we might leave Tin somewhat hamstrung with a too-small belt for market segmentation reasons anyway. All this tuning waits on real data.
Similarly, two streams of half-bundles (at least until we get that hypercube memory ) and three blocks to decode within each half bundle. Other baked-in Mill-architecture parameters?
This is not really baked in; you could add a clock to either side’s decode and get two more blocks if you had a use for them. Earlier Mills had four blocks on the exu side, but we merged two of them to save the clock. If, on some member, we find that a decode stage is clock-limiting the rest of the machine then we might split and rearrange the existing blocks, costing us another clock in decode but speeding up the overall clock rate. Whatever sim says works best overall.
The PIPELINE presentation is really interesting but the question of the epilog is not treated at all.
Ran out of time; sorry. We hope to get a white paper out at some point.
It’s hard to explain without the animations, but the general solution recognizes that if the exit condition depends on (or can be scheduled to depend on) all non-speculable operations in the body then the leave operation (conditional on the exit condition) replaces the epilog.
Answering one of the last questions, you explain a proposal for saturated arithmetic that requires adding a keyword and treating it specially. What would be wrong with using a C++ class with overloaded arithmetic operators that encapsulates the intrinsics you need? That would solve issues such as integer + saturated, since the C++ language already has rules for that (that you might not like, but at least they exist).
That’s a perfectly reasonable implementation path for C++.
The compiler will still have to recognise it for what it is in order to know to use the Mill’s saturating opcodes in the emitted binary.
More generally, many uses of saturating arithmetic is in DSP and graphics programs which are often written in C.
The compiler will still have to recognise it for what it is in order to know to use the Mill’s saturating opcodes in the emitted binary.
My suggestion was to use intrinsics for that part. Something like:
struct sat64 { sat64(u64 x): value(x) {} sat64(i64 x): value(x) {} sat64 &operator += (sat64 o) { value = __mill_sat64_add(value, o.value); } ... private: u64 value; };
No need of anything special for the compiler.
Regarding the comment that programs are written in C: are they written in C that is so bad that it won’t compile with a C++ compiler?
Moreover, one would like a notation for literals, something like “17usl” for an unsigned saturating long, which cannot be done in either language even with overloading, and casts would be necessary.
First, just curious why you’d need literals? How is the binary representation of a 64 saturating int different from a 64 int?
Second, C++ operator overloading lets you deal correctly with
X+Y
,X+1
and1+X
. The only case missing is1+1
, and at least for saturation, the cases where it matters are few and can be spelled out explicitly e.g. withsat64(1)+sat64(1)
.Finally, C++11 introduced user-defined literals which are I believe almost exactly what you ask for, see http://en.cppreference.com/w/cpp/language/user_literal.
Ah yes – I recall reading that, but never used it and promptly forgot
I’m allergic to intrinsics, especially when they don’t overload. With five widths, signed and unsigned, that’s 10 functions per operation and 20 or so operations. Squared if you want mixed-mode support. Yuck.
If I understand it correctly, this conversion of literals is baked into Go itself and not extensible.
But it’s interesting that newer languages seem to converge on the “no implicit conversion between numeric types, but conversion of literals” point of view.
Just FYI, Haskell has a nice solution for extensible literals without suffixes: All numeric literals are [implicitely applied](http://www.haskell.org/onlinereport/basic.html#numeric-literals) to an overloaded function
fromInteger
, which will be optimized away by the compiler when a concrete numeric type is known. So implementing all the saturated/signalling/etc. numeric types in Haskell together with overloaded arithmetic operations (the typeclasses already exist) would be really easy.Of course, lazy evaluation is not everyone’s cup of tea. Sometimes I wish there was a C variant with a Haskell-style type system.
So, Rust? I mean, I think traits are a little less powerful than full typeclasses with all the bells and whistles enabled, but they’re pretty good for most work. Shame that the Rust guys decided to keep the angle-bracket syntax, though.
My point is really that providing a library that does that seems like a solution past the “committee hell” that Ivan was talking about. You first create a C++ library that implements the semantics you want. And once this gets adopted and used widely, the C++ standard committee takes it and integrates it into the language. Probably with slight and subtle incompatibilities
When we get working chips with the hardware support, I suspect the language enthusiasts will have a field day incorporating Mill-isms into their languages. We’d be happy to support their efforts, with hardware and development assistance. For commercial reasons, our own work will focus on those languages that we ourselves or critical customers need.
Seeing how much work needs to go into the compiler to use the hardware well is relatively daunting for somebody looking at distinctly Not-C-family compilers for the Mill. However, I then remember LLVM. Since it has loop IR classes, I hope this means all the pipelining analysis & transformation you’re writing lives in the LLVM middle-end for any front-end to use?
Maintaining a language myself, I completely understand and empathize with the years of battling unclean semantics and finally having that eureka moment that makes it all so simple. We’re very slow to make language changes even for obviously good ideas for precisely that reason.
It was nice to see mentioned that my Lisp & JavaScript ABI questions are still points of consideration, even if they’re well outside the initial assumptions of the Mill. In the meantime, a lot of consideration on my end has taken a JVM style approach of tracking common usages and issuing runtime recompiles to optimize specific parameter layouts.
It would be great to see some “ah hah!” allowing dynamic languages (specifically flexible calling conventions & multiple stacks) to slot into direct Mill support. But if not, it may not matter too much. These last few decades have witnessed compiler advances delivering effectively native speeds for dynamic languages on existing architectures so far.
Again, a great talk, and it’s wonderful to see all these ideas be able to coalesce in our minds, filling in all the previously NYF gaps.
The Mill pipelining code in the middle-end will be contributed. However, we don’t expect there to be much. As I understand it, LLVM already does pipeline analysis, and we will use that. Our job will be to remove/bypass the LLVM code that actually does pipelinin transformations and simply pass on the fact that pipelining is applicable, and detected parameters such as iteration interval, to the specializer where the needed transformations will be applied.
The specializer is coming along nicely. LLVM is still to unfamiliar to project a schedule.
I thank you for your comments on dynamic language issues. They prompted internal design work that has led to a new approach for VMs on the Mill. NYF though
Unbounded ILP in loops vs. average iterations in loops?
In several of the talks, Ivan has said words to the effect that,
“loops have unbounded ILP.”Unbounded ILP would seemingly be true only for unbounded (e.g. while 1 {}) loops; for all others (whether simple counting loops or some form of while loop), the ILP is roughly proportional to the loop’s number of iterations — and thus bounded. Have I missed something/being too pedantic?
I’ll freely concede that loops, including loops repeated enough to benefit from pipelining, are frequent enough in general-purpose code that an architecture that can pipeline virtually all loops should be a genuine boon. So infinite or not ILP, I’m really looking forward to more info on epilogs and LLVM progress for the Mill.
Well, now that you ask, I’ll go with “pedantic”
Yes, the total number of operations performed by a loop is the product of the number of operations per iteration and the number of iterations, which is finite if the loop ever terminates. Termination is in general undecidable but is assured for many loops of interest.
I prefer to avoid a 20-minute digression into the halting problem in a public general-interest talk. The point of the “unbounded” remark is to refute the widely bandied-about claim that MIMD (and width in general) is pointless because there is only 2X ILP in code. The data of the studies showing 2x is reasonable, but the interpretation of that data is misleading and the design conclusions are flat-out wrong.
Yes, there are loops with iteration counts so low (like, one) that the total operations count cannot fill a MIMD design. But by definition such loops are not where the program spends its time, and are irrelevant. The loops that matter all last long enough to be piped. More, they last long enough that no practical MIMD design can run out of iterations, so from the designer’s point of view, yes, they are unbounded.
The 2x claim has been used to hamstring designs in open, non-loop code also. From saying “there are only an average two operations that are independent of each other and hence can issue together” to saying “more than two pipe are wasted” is to assume that issue must be sequential. Open code on the Mill gets 6 operations in an instruction from 2x ILP code because phasing lets us have three steps of a dependency chain going at once. The 2x data is not wrong, but the conclusions from that data show a woeful lack of imagination.
Ivan,
Thanks for clarifying. “Effectively unbounded ILP in the loops that matter” may be more accurate, but admittedly doesn’t have the punch that “unbounded ILP” has. IMHO, Intel’s increase to eight micro-ops peak per cycle (e.g. in Haswell) shows that they think there’s enough detectable parallelism in x86 code to warrant the increased power and chip area required.
I have tinkered with the Software Pipelining approach for the last few months and I think it’s a good idea to do it the Mill way instead of using the usual modulo technique.
I have a few open questions though and I’ll start with the most high-level one:
Which part of the SWP transformation happens in the compiler vs. in the specializer?
I’m currently thinking of these transformation steps:
* determine if SWP is applicable to this loop
* determine steady state
* prime values with NaRs
* add SPILL/FILL for long term dependencies
* add computation of condition vector
* add PICKs
* … whatever I forgotOne must distinguish pipelining from vectorization; the two are often conflated but present very different issues.
Essentially any closed loop can be scalar pipelined on a Mill, although it may not be worthwhile. Because the compiler does not have all the code available (due to specialize-time emulation substitution and inlining of external functions and templates) we must determine the applicability of SPL in the specializer; this is done after all the control-flow folding has been done and the code is ready to schedule. If-conversion (including pick) is done at this stage, as well as conventional control-flow simplification like jump-to-jump elimination. We also flop the sense of branches so that the fall-through case is the likely one (if profile info is available) or the longest path (which gives better schedules). We also recognize loops at this point, and replace exiting branches with leave ops and put an inner op in the loop head (the dominator that is the entry to the loop).
The heuristic for SPL is straightforward: we count the latency of one full iteration (following the dataflow of execution) as if it were straight-line code, and count the number of ops in that code in the various resource categories. Because we know how many resources (of each category) are present on the target, we know the minimal number of issue cycles that are needed if every op were completely packed into the available width. If the linear latency (following the dataflow) is twice or more the packed latency (maximal entropy) then we SPL. The “twice” is heuristic’ we could pack three linear into two packed, for example, but don’t have enough experience to know if it’s worth it.
Scheduling is pretty much the same as for open code, except that open code uses an extensible linear tableau to allocate slots to ops, whereas SPL uses a slice of the tableau that is packed-latency high and schedules it as a torus, with wraparound as it looks for free slots. The schedule is guaranteed to be feasible by the definition of the packed latency, but the placement is greedy and does not necessarily lead to a feasible placement (the bin-packing problem is NP). If it fails then the packed latency is increased by one and the whole mess is rescheduled. This is guaranteed to lead to a feasible placement, because eventually the increase in packed-latency will be enough for an open-code placement, which is always feasible.
Once placement is done, we calculate belt lifetimes of intermediates. This is the same as for open code, although we must distinguish the intermediates from different iterations that are simultaneously live. As with open code, those lifetimes are measured in terms of relative position on an infinite belt, rather than in terms of cycles, and if the relative position between producer and consumer exceeds the actual belt size then we insert spill and fill ops at the producer and the first consumer that needs a lost operand. There can be several of these of course. If this is the first spill/fill we add a rotate op at the top of the loop and put a scratchf op (scratchpad allocation) in the loop head; these get fixed up later when we know how much scratchpad will be used. If we added any spill/fill then we throw away the schedule and redo everything from the top with the modified code. This too is guaranteed to reach a feasible schedule, because the limit is for every produced value to be spilled and the belt reduced to a single position. In practice, one reschedule is needed ~20% of the time, two in single-digit percents, and we’ve not found any code that needs more.
The torus schedule is a normal schedule and may embed control flow, and represents the steady state. Any dataflow on the back arc to the entry point represents either a recurrence (for example the loop control variable) or a loop-carried temporary. The recurrences are those with a dataflow from the loop head (or earlier) into the loop; those with flows in from the head but not on a back arc are initialization, not recurrences. Initialization flows are handled like any other dataflow across a join point, and may require injection of a conforms op to get all the arcs into the top of the loop to have a congruent belt. Back-arc-only flows require creation of a None value in the loop head, which acts as an initializer just like explicit initialization. And true recurrences need both the initial value and the injection of a recurs op at the join. We could be smarter about recurs injection; it’s only needed if a None can reach the recurrence. Truth be told, there’s probably other cases where we could be smarter, but by and large the resulting code looks decent.
While SPL needs very little from the compiler and is mostly specializer only, vectorization needs a lot from the compiler. We need the compiler to tell us the iteration interval, the expected iteration count (if it can tell), and the strides of all recurrences. These are hard for the specializer to figure out, so we don’t try to recognize vector loops in the specializer, so far anyway. From the target description we know what the vector size will be for each reference, and from the iteration interval we can tell if that size is feasible or if we have intra-vector recurrences. There are ways to handle such recurrences in the lit, but don’t hold your breath. If it’s feasible we transform the loop to one that is still a scalar loop but where each “scalar” is a vector of the appropriate size; this involves scaling the control variable calculations. The smear operations for while loops are inserted at this point, and the trailing reductions when the loop yields a vector that must be reduced to a scalar. The result is just a normal loop, but with some funny operations in the code, and is scheduled (including SPL) normally.
Only some of all this is working as I write this; in particular none of what the specializer needs from the compiler is yet available. If you have compiler experience (especially LLVM) and are willing and able to put in significant time on a sweat-equity basis then send me a resume.
Thank you for your very detailed answer.
You flatter me with your offer to contribute on the LLVM side but I’m really not so deep into compilers, I’m a hardware designer with just a basic course in compilers back at unitversity.
During the last few weeks I was peeking into the LLVM optimization stages to find out what is there already and what’s missing from my POV. I have a completely different microarchitecture in mind but it shares some of the problems of producing optimal code with Mill (in fact only the loop portion, I handle open code differently).As to your comments:
I understand that most SWP code generation steps must happen in the specializer. Here’s a bunch of comments to your answer:Why do you need the heuristic? If you compare latency of code scheduled following the dataflow with the latency the SWP schedule gives you, you can select the best option every time. Is it to save compile time?
If I were to generate the packed schedule, I’d formulate it as a MIP (Mixed Integer Program) and let the solver do the hard part for me. IMO, using a greedy algorithm at the prototype stage is fine, but it has to be replaced by an optimized algorithm most of the cases anyway.
In your algorithm to determine spills and fills you only guarantee a feasible schedule in the degenerated case. If there are too many inter-loop dependencies it might get spiller bandwidth limited. In these cases it might me worthwhile to additionally use the caches on top of the scratchpad. Of course this introduces the well-known problems with caches.
The None-insertion matches what I figured very closely. I was thinking about using a proof-engine to check reachability before inserting the recurs statements.
Perfect scheduling for any machine (or more generally for any set of resources and consumers) is known to be NP. You can detect when you have found a perfect schedule, but unless the best possible schedule is also a perfect schedule you cannot tell if you have the best schedule without exploring all other schedules. See A solver changes the representation of the problem but not the difficulty. Good heuristics is the goal, and in practice work fine.
Feasibility is guaranteed because it is evident in the degenerate case and the algorithm will reduce to the degenerate case if it does not find a feasible solution earlier. This does not mean that it always produces the degenerate case; in fact it never does. One of the problems with CS results is that they are good at determining the limits of an algorithm, but determining and proving average behavior is much harder 🙂
We don’t cache the scratchpad because deterministic latency is critical to its function. If varying latency were useful then we would use the memory cache and eliminate the scratchpad.
How will the compiler convey pipelining and/or vectorizing info to the specializer?
Ivan or other Millcomputing folks,
Can you tell us how the compiler will convey information about pipelining and vectorizing loops to the specializer? From the GenAsm EBNF grammar on the Wiki, it looks like there are several ways you could do this, but I’ll simply ask (rather than speculate, per forum rules.)
Sorry, NYF. There will be a talk on compilation as soon as we can get to it, expected to be in Seattle.
I re-watched this talk, and there was one thing that was not answered:
1. How would a “break” (“leave” instruction?) out of a nested loop look like in Mill code?, and:
2. How would a Python-style for-break-else be done in Mill code?
In C/C++, either would be written using a “goto”. I think the two above are the most common uses of “goto” and therefore more often permitted in coding standards than any other use.
Java has labelled loops, where you would “break <label>” to break out of a nested loop. You could also do (2) in Java with nested loops blocks where the outer loop has only one iteration.In case you haven’t found this yet:
Intel proposes a Software Pipeling mechanism very similar to the one described here for its AVX512 extension (AVX512 SPL)Intel seems to be up to something similar:
AVX512 based Software Pipelining method- AuthorPosts
You must be logged in to reply to this topic.