Forum Replies Created
- AuthorPosts
- in reply to: Loop pipelining and aliasing #1178
Greetings all,
Regarding pipelining loops, deferred loads and loop-carried data:
1. I’m glad a confusing issue was caught by an alert audience member, and I’ll look forward to the corrected slides/video as soon as posted. With the noticed mistake and the resulting need to redo slides, plus extra video editing, I suspect that we won’t have the video + animated slides online for at least a couple weeks. 🙁
IMHO, better to be later and correct, than quick and confusing. (Though I’m still waiting with imperfect patience.)2. If I understand correctly, the number of *simultaneously-executing* deferred loads varies considerably among family members, because family members differ in the number of traffic-monitoring load buffers they have. Since the Mill intermediate representation apparently compiles for an abstract Mill — and thus doesn’t now how many (aliasing-immune) load buffers the specific target has, then how does the combination of the compiler and specializer handle the case where the compiler’s implementation of a pilelined loop requires more *simultaneous* in-flight deferred loads than the target hardware has to do such loads?
It seems to me that the compiler will consider most loops pipeline-able for the *abstract* Mill and thus emit its intermediate representation for a fully-software-pipelined loop — only to have the specializer potentially need to (partially, or worse fully) “de-pipeline” it, due to the target’s constraint on the number of simultaneous in-flight loads it can do.
How will the compiler and specializer jointly handle this case? Is this one of the cases where the compiler has to emit a series of alternative sequences for the specializer to choose from? Or can the specializer always implement any software-pipelined loop sequence, though it may need more instructions to do the pipeline-required loads using its limited load hardware?
This issue about whether/how-to software pipeline loops for a given target member sounds like (a) rather a common case, especially given the Mill’s key goal to be the CPU that performs dramatically better on loop-heavy code and (b) one with great variability (that the specializer will have to handle well), given the range of family member’s and their functional-unit populations.
IMHO, the details will need to wait for talks on the compiler and specializer, but I’m hoping for some early/high-level clarification that the Mill’s software pipelining of loops will work as close to optimally as possible on the full range of Mill targets — without additional/major surgery on the compiler and specializer.
Any comments would be most welcome.
Passing more than a belt’s worth of arguments into/out of functions?
How do functions that require more than a belt’s worth of arguments get those input args that don’t fit on the callee’s belt? Conventional machines typically pass these via the stack. But stacks for Mill functions, so far as I understand them, are function local and always initially zeroed. So passing the extras via the stack sounds like it can’t work for this case. One of the belt-passed input parameters could be a pointer the rest of the arguments, somewhere else in memory. Is that how this case is handled, or is there some other mechanism? I know that huge argument lists are considered poor form (so this situation should be rare), but the machine + compiler will have to to handle this case. Are output parameters beyond what fits on the belt handled the same as excess input args?
Does the pick operation get its third operand via ganging?
The pick operation takes three operands:
pick(<mask, <shuffled>, <rd>)
Since functional units (and the crossbars) are set up for two inputs per F.U., does the pick operation use ganging, or is there some other way for pick to get its third operand? If pick uses ganging to get its 3rd operand, what types of FUs can do Args, and thus gang with pick?
- in reply to: Specification #1093
Ivan,
Thanks much for your detailed answer, and on a holiday no less! If I understand you correctly, having a single address space (of whatever width) is fundamental to being a Mill (versus being some other flavor of belt machine.) However, the number of bits required for a Mill’s address space is not intrinsically (and forever) 64 with four reserved.
> So I guess the answer to your final question is: eventually possible and very low priority.
Thank you; I will try to cultivate patience.
- This reply was modified 10 years, 6 months ago by LarryP. Reason: clarity
- in reply to: Instruction Encoding #1088
In the specification talk, Ivan mentioned that the encoding was optimal-entropy encoding, with otherwise-unused bit patterns used to encode common/useful constants. On the Mill, there are separate encodings for the two instruction streams. Does optimal-entropy encoding therefore imply that there are constant-supplying operations in both instruction streams?
- in reply to: Specification #1087
After watching the specification talk, I got the impression that a key parameter of the Mill’s architecture, namely the bit-width of pointers (64, with 4 bits reserved) was not a free parameter, but is a “hardwired” part of the Arch. Are pointer width (and the number of reserved address bits) hardwired, or are those both parameters that I managed not to see in the specification?
64-bit wide pointers makes sense for general-purpose computing and for competing on the high end. However in the higher end of the embedded CPU world (e.g. 32-bit ARM and MIPS based chips), which has large and increasing volume, 64-bit wide pointers would hurt code density, especially for processors with all their memory on chip. Even with 4 bits of a pointer reserved, as in the Mill as shown in the presentations, 2^28 bytes of address space would dwarf the memory on such chips, and will for quite some time. (Right now, on-chip memories in such chips are typically a megabyte or less.) Is a 32-bit Mill completely out of the question, or is it eventually possible, but simply a very low priority?
I’m curious:
How does/will the Mill implement the semantics of C’s volatile qualifier?
From what I’ve gathered from watching the talks, the Mill always loads data from the cache hierarchy, not directly from main memory. However, the data in volatile variables (for example memory-mapped peripheral registers), can change independently of CPU activity, so cached values of volatile variables may well not be correct.Is there a mechanism, such as designating certain memory regions as non-cacheable, to ensure that the Mill implements the semantics of volatile variables? A forum search turned up a single forum entry that contains the word volatile. In Mr. Godard’s reply #733 in topic “Side channel timing attacks” he writes:
One of those flags is “volatileData”, which forces the request to bypass private caches and go to the top shared cache.
However, going to the top shared cache (instead of to main memory itself) doesn’t appear to implement the semantics of C-language volatile variables. Since the Mill is targeted at existing code bases, which include C and C++, I assume that there is (or will be) a way to preserve the semantics of volatile. If it can be revealed at this point, I’d very much like a confirmation that the Mill will preserve the semantics of the volatile qualifier — and, if possible, how the Mill handles volatile.
Thanks in advance for any reply.
Windmills and waterwheels are not apt.
The name comes from Babbage, who separated his design into two parts he called the mill and the store.
I’m not expert at etymology or Babbage’s work, however I strongly suspect that Babbage’s use of the word mill for part of his invention was due to its strong connotations as the ur-machine (where “ur-” means original/earliest.) And the origins of the word mill trace back to mills for grain. Thus, I’m not sure that a logo based on a windmill or a water-wheel is such a bad idea. Since wind has connotations of speed, a windmill-derived logo might be a good way to go.
- in reply to: Loop pipelining and aliasing #1184
Ivan,
Thanks for your detailed explanations about how the compiler and scheduler work together to implement pipelined loops! I found your detailed explanation very helpful in understanding more about how the specializer schedules code and handles resource allocation/tracking as part of its scheduling. The explanation about blocking ranges (and the impact of non-pipelined functional units) on the number of instructions needed to implement a loop body makes clear the downside of having a dedicated, long-latency functional unit (like some implementations of monolythic divide) to a Mill machine’s overall performance.
From what you wrote above, the answer to my previous questions appear to be:
How will the compiler and specializer jointly handle this case?
A. Quite well indeed.
(To this reader, it sounds like there’s already been lots of sim results and tests that have exercised the existing compiler, specializer and simulator on a whole bunch of loop-heavy code. So there’s a body of NYF/NY-disclosable evidence that the tools and software pipelining of looks works well and will continue to do so. No hand waving involved.)
Is this one of the cases where the compiler has to emit a series of alternative sequences for the specializer to choose from? Or can the specializer always implement any software-pipelined loop sequence, though it may need more instructions to do the pipeline-required loads using its limited load hardware?
A. The compiler could issue alternative sequences, but at present there appears to be no need to do so for loops.
A. The specializer can schedule any intermediate-level “code” the compiler gives it — for any Mill target — including for all known and foreseen pipelined loops. On lower-end Mills, with fewer slots/functional units (including fewer load units and retire stations), the pipelined loop body may, and probably will, take more instructions to do the loop body, as should be expected.
—-
Incidentally, loads are uncommon in Mill software-pipelines, for reasons that are NYF
Especially given the example, that iterates through an array, I’ll be very interested in hearing about how/why loads are uncommon in Mill software pipelines! Of course, I’m already eagerly awaiting the availability of the pipelining talk, and I’ll hope that the post-production work goes smoothly and you don’t have the hassle of re-shooting any video.
Thanks again for your cogent and detailed responses,
Larry
- in reply to: Hard/soft realtime #1052
Greetings all,
I found Mr. Godard’s post in another thread,
http://millcomputing.com/topic/memory/#post-623about the spiller quite informative.
My impression is that when the spiller’s SRAM is full and and a call executes, the spiller must spill either enough SRAM to the L2 cache to save the caller’s entire state (Is frame the correct term in this context?) or enough of the caller’s frame to make room for the callee’s frame. So, the worst-case latency for a call should depend on the the largest permitted frame size and the time to write the corresponding number of cache lines to L2. Since the belt contents are a key part of a function’s state, a mill with a long belt may have a longer worst-case call latency than a model with a shorter belt.
Of course, if the (first cache line of) callee’s code isn’t already in cache, there will also be a time hit to bring that in. In the case of an insufficiently frequent interrupt or exception, the latency to bring in the first cache line of the corresponding handler may dominate worst-case interrupt latency.
- in reply to: Hard/soft realtime #1038
Mr. (Dr.?) Godard,
First off, thanks much for your reply to my question! Indeed, clarification helps; please continue with that, if you would.
In one of your talks describing how the Mill does function calls, you said words to the effect that, “The Mill provides each function call with the (illusion of) having a completely new belt, pre-populated with that function’s arguments.” I seem to recall that in the same talk, you mentioned that this behavior was accomplished by simply changing (incrementing?) the “BeltID,” and voila, new belt.
What I’m trying to learn is whether this behavior of the Mill (that all functions start with a new, pre-populated-with-args belt), however mechanized, implies some finite on-chip resource (e.g. counters of finite width, or a finite number of hardware-realized belt structures, etc.) that could be exhausted by, for instance, a deeply-nested sequence of short function calls, such that the Mill would eventually have to pause to spill some “state of multiple belts” to memory, in order to maintain the (illusion of) each new function starting with its own pre-populated belt. It is the HRT consequences of this sort of “save/restore” (if it exists in the Mill) that I’d like to better understand.
My thoughts on this may be contaminated by assumptions from other architectures (hardware stack machines, or SPARC’s large-but-finite number of windowing registers.) If there truly no resource limitation on the Mill’s “depth of stack-able belts”, other than the spiller’s memory bandwidth and memory size, that’s wonderful! In which case, I’ll be eagerly awaiting more detail on how the spiller works.
Since I haven’t yet seen anything in the Mill architecture that corresponds to a conventional stack pointer, may I request that you (or somebody else knowledgeable) add a bit more info to the glossary thread about what a Mill stack is and how Mill stacks, frames and BeltIDs work, as soon as possible, modulo all the patent filings to be done.
I’ll guess that there must be (more-persistent-than-belt-entry) special registers or their functional equivalents lurking under the hood, even if the programmer’s model cannot see (all of) them. The instruction/bundle pointer seems like one, and a stack pointer (or its functional equivalent) would seem to be another. My apologies if I’ve misinterpreted what I’ve read and watched.
Thanks much for your time, attention and reply!
- in reply to: Hard/soft realtime #1031
Prior to finding this topic, I asked a question (under topic execution) about the Mill’s behavior when the maximum number of belt IDs is exhausted, e.g. by a deeply nested sequence of calls. I assumed that some sort of internal-state spill/restore is needed, and Will_Edwards replied that this sort of internal-state spill is handled by the spiller and is both automatic and lazy.
For HRT, it’s important to know the time cost for such digressions, especially if there’s limited ability to monitor and/or control when they happen. What can you tell us about the behavior for “recycling” the finite number of Belt IDs and how this will effect the Mill’s real-time performance? Does the architecture permit the programmer to check how empty/full this resource is?
Please forgive me if these questions have been answered elsewhere. But since this topic mentions both logical/physical belt positions as well as frame IDs, I hope it’s OK to pose here.
However the Mill implements the belt and new-belt-per-function-call, there’s a finite number of belt IDs in any given implementation.
Q0: Correct?I assume that a sufficiently deep series of function calls will need more belt IDs than the (particular family member’s) hardware supports.
Q1: In which case, what is the Mill’s behavior?I’ll guess that some mechanism (the spiller or a close relative) has to offload a sizable chunk of internal state, so there will be available Belt IDs for the rest of those function calls. (And the spilled Belts/Belt-ID-state must eventually be refilled to complete the deeply-nested series of calls.)
Q2: If the Mill occasionally has to perform such Belt-ID spill/fill operations, does the compiler have any choice about the granularity of the spill/fill (e.g. a full spill, half, quarter, exactly Z beltIDs), or is the entire belt-ID internal state always spilled? The reason I ask is that infrequent, expensive digressions can cause problems for workloads with tight timing constraints. (And I’m hoping the Mill can be applied to such workloads, in a reasonably clean way.)
Thanks in advance for any explanation.
- AuthorPosts