Mill Computing, Inc. › Forums › The Mill › Architecture › Hard/soft realtime
- AuthorPosts
- #923 |
The static scheduling on the Mill and the vastly cheaper context switches make it look like a very promising target for (at least for soft) realtime applications. Some questions about the static scheduling…
Given an EBB specialized for a specific family member, is it possible to statically determine the worst case number of cycles taken by that EBB? Since everything except loads has fixed latency and we can just assume a DRAM access to get worst case load time.
Possibly related question: can phasing cross EBBs? On x86 the performance of a basic block can vary wildly based on what basic block was executed just before, and if phasing occurs across EBBs it seems like you could get similar effects on the Mill.
> Given an EBB specialized for a specific family member, is it possible to statically determine the worst case number of cycles taken by that EBB? Since everything except loads has fixed latency and we can just assume a DRAM access to get worst case load time.
Yes 🙂
> Possibly related question: can phasing cross EBBs? On x86 the performance of a basic block can vary wildly based on what basic block was executed just before, and if phasing occurs across EBBs it seems like you could get similar effects on the Mill.
Well it doesn’t impact the timing of the called / branched EBB. Its all fixed latency and deterministic.
Well, not quite. There is also the possibility of I$ misses. As a general rule you can’t do Hard Real Time if anything is cached.
The standard way to deal with the issue is to pin the HRT on-chip, either by pinning lines in the cache; by turning all or part of the cache into scratchpad; or by using NUMA with the HRT in an uncached but on-chip level, typically SRAM or 1T on-chip DRAM. All these approaches work as well on the Mill as for any other architecture. And perhaps somewhat better (especially if the chip is being used for both HRT and background regular apps), due to the low cost of interrupts, task switch, and inter-process security.
Prior to finding this topic, I asked a question (under topic execution) about the Mill’s behavior when the maximum number of belt IDs is exhausted, e.g. by a deeply nested sequence of calls. I assumed that some sort of internal-state spill/restore is needed, and Will_Edwards replied that this sort of internal-state spill is handled by the spiller and is both automatic and lazy.
For HRT, it’s important to know the time cost for such digressions, especially if there’s limited ability to monitor and/or control when they happen. What can you tell us about the behavior for “recycling” the finite number of Belt IDs and how this will effect the Mill’s real-time performance? Does the architecture permit the programmer to check how empty/full this resource is?
There’s a bit of confusion here. The belt is not addressed by an ID drawn from a pool; it is addressed by position number. The most recent operand to drop to the belt is by definition b0. It does not have a permanent ID as it is moved along the belt by subsequent drops, but becomes b1, b2, .. bN.
Similarly, stack frames do not really have an ID, they are indicated by the ordinal stack depth. Thus the very first frame in a stack is frame zero, and so on. These too do not get recycled, although you can think of the automatic reuse (as return operations cut back the stack and later calls add new frames) as being a form of recycling.
The only pool IDs on a Mill are the thread and turf IDs. Because creating a new thread or turf is an infrequent event, allocation and recovery can be a more expensive process. The actual mechanism involves hardware allocation from a pool of IDs which is kept full by a lazy background maintenance thread, but details are NYF.
Consequently neither belt nor call frame IDs have any HRT impact at all, nor does thread or turf allocation, although thread and turf recovery does require some (very small) spare background execution outside the HRT itself.
The spiller that Will described is concerned with the data, not with the data’s name (ID). It is possible in a deeply nested sequence of prompt calls with lots of belt arguments (or rapid exit from such deep frames) to overrun the spiller’s capacity to move data. However, the constraints are the same memory bandwidth constraints that are faced for any other use of bandwidth; the spiller does not add any demand that would not have been needed by equivalent loads and stores on a register machine saving its registers. So if a HRT program hits the spiller bandwidth limit then it would also have hit the load/store bandwidth limit on a conventional. The Mill is pretty good, but it cannot magically create memory bandwidth that the DRAM chips don’t supply. If you hit the spiller bandwidth limit, your only solution is to buy a Mill family member with more memory pins, just as you would have had to do on an x86, ARM, or whatever.
Mr. (Dr.?) Godard,
First off, thanks much for your reply to my question! Indeed, clarification helps; please continue with that, if you would.
In one of your talks describing how the Mill does function calls, you said words to the effect that, “The Mill provides each function call with the (illusion of) having a completely new belt, pre-populated with that function’s arguments.” I seem to recall that in the same talk, you mentioned that this behavior was accomplished by simply changing (incrementing?) the “BeltID,” and voila, new belt.
What I’m trying to learn is whether this behavior of the Mill (that all functions start with a new, pre-populated-with-args belt), however mechanized, implies some finite on-chip resource (e.g. counters of finite width, or a finite number of hardware-realized belt structures, etc.) that could be exhausted by, for instance, a deeply-nested sequence of short function calls, such that the Mill would eventually have to pause to spill some “state of multiple belts” to memory, in order to maintain the (illusion of) each new function starting with its own pre-populated belt. It is the HRT consequences of this sort of “save/restore” (if it exists in the Mill) that I’d like to better understand.
My thoughts on this may be contaminated by assumptions from other architectures (hardware stack machines, or SPARC’s large-but-finite number of windowing registers.) If there truly no resource limitation on the Mill’s “depth of stack-able belts”, other than the spiller’s memory bandwidth and memory size, that’s wonderful! In which case, I’ll be eagerly awaiting more detail on how the spiller works.
Since I haven’t yet seen anything in the Mill architecture that corresponds to a conventional stack pointer, may I request that you (or somebody else knowledgeable) add a bit more info to the glossary thread about what a Mill stack is and how Mill stacks, frames and BeltIDs work, as soon as possible, modulo all the patent filings to be done.
I’ll guess that there must be (more-persistent-than-belt-entry) special registers or their functional equivalents lurking under the hood, even if the programmer’s model cannot see (all of) them. The instruction/bundle pointer seems like one, and a stack pointer (or its functional equivalent) would seem to be another. My apologies if I’ve misinterpreted what I’ve read and watched.
Thanks much for your time, attention and reply!
A personal note first – it’s funny for me to be addressed as Doctor Godard. I come from the generation before ubiquitous computers – during my brief and inglorious college career, my well-respected college did not own a computer. Though I have taught computer science at the graduate and post-doc levels, I have never taken a course in the subject and have no degree of any kind. We learned the subject because some co-worker was willing to sit the green kid down in front of a whiteboard and explain how it really worked. I try to do the same today.
Speaking of which, in answer to your question: there is no constraint on the spiller or core speed other than spiller bandwidth. This is true for spilling scratchpad and internal state as well as for spilling belt operands.
About special registers:
There is a suite of specRegs, some always present and some only present in some member configurations. These are internal registers contained in the logic of the core proper and located where they are needed; they are not contained in a register file. Some of these are frame-save, some thread-save, some not saved at all; the spiller does what saving is needed. Some are visible to the rd() and/or wr() operations, and most are visible in the MMIO map. At some point we will have doc on all of them, but we’re a little busy 🙂
I get the impression that memory for stack will run out before possible belt IDs, so that combined with the spiller running asynchronously is what leads to “no constraint on the spiller or core speed other than spiller bandwidth.”
Greetings all,
I found Mr. Godard’s post in another thread,
http://millcomputing.com/topic/memory/#post-623about the spiller quite informative.
My impression is that when the spiller’s SRAM is full and and a call executes, the spiller must spill either enough SRAM to the L2 cache to save the caller’s entire state (Is frame the correct term in this context?) or enough of the caller’s frame to make room for the callee’s frame. So, the worst-case latency for a call should depend on the the largest permitted frame size and the time to write the corresponding number of cache lines to L2. Since the belt contents are a key part of a function’s state, a mill with a long belt may have a longer worst-case call latency than a model with a shorter belt.
Of course, if the (first cache line of) callee’s code isn’t already in cache, there will also be a time hit to bring that in. In the case of an insufficiently frequent interrupt or exception, the latency to bring in the first cache line of the corresponding handler may dominate worst-case interrupt latency.
The spiller is lazy and, being a hunk of hardware, is buffer-granularity and not frame (or other language construct) granularity. For a horrible example of the difference, consider a hypothetical Mill with a 14-cycle hardware divide operation, whose first instruction contains:
div(..), calltr(...)
That is, it starts a divide and recursively calls itself conditionally.By the time the very first divide result comes out of the function unit, there are already thirteen other divides in flight in the FU, and we are 14 deep in nested function calls. The spiller is going to spill that result (the FU output latch will be getting a new value next cycle), but the frame (and belt) that the value belongs to is far away, and the retires for that frame may be temporally interleaved with the retires of many other frames. Remember: a call on the Mill looks (on the belt) like any other operation, like an add for example, and has (or rather appears to have) zero latency.
Consequently, as in most things, spill/fill on the Mill doesn’t work like a conventional machine. We don’t spill frames, we spill values, and only when we need the space for something else. And this is true between the belt latches and the spiller buffers, between the buffers and the SRAM, and between the SRAM and the memory hierarchy.
Also, “frame” is an overloaded term, meaning on the one hand the stack-allocated locals of a function activation, and on the other hand the activation itself; this can be confusing. The spiller has nothing to do with the former sense; the program-declared local data is explicitly written to memory by the program. just as in a conventional. The spiller is concerned with internal and transient state. On a conventional this too is explicitly written to memory by compiler-inserted preamble/postamble code, or equivalent asm code for those writing in assembler. Not so on a Mill; the internal state is in general not explicitly accessible by the program, and save/restore is done by the spiller.
Consequently, spiller performance is dissociated from programming language constructs, and is constrained only by bandwidth. Certain program actions, if sustained for long enough, can generate internal and transient values at a rate large enough to overwhelm the spiller bandwidth capacity; you will stall. Granted, you will have to work to make it happen, but you can do it.
However, the stalls induced by the spiller will still leave your program running faster than on a conventional machine (which of course has to save the same information too) using explicit code to save and restore the general registers with every one of those recursive calls.
Yes.
But a glossary/terminology note: we call them “frame ids” or “frame depth number”, not “belt id”, because it identifies more than just belt operands, including the frame’s scratchpad, frame-saved specRegs, static and dynamic down-stack linkage, return addresses, etc. And a “belt id” might be confused with a “belt position number”.
- AuthorPosts
You must be logged in to reply to this topic.