Forum Replies Created
- AuthorPosts
- in reply to: spiller work optimisation #2113
Recall that there is no physical belt as such; the belt is a semantic part of the programmer’s model, and family members are free to implement anything that provides that model. Also recall that the physical realization of the spiller has essentially three different parts: one that is tightly integrated with the functional units and deals with stalls; an intermediate layer that is essentially an SRAM buffer pool; and a back end that connects the buffers to the memory hierarchy and eventually DRAM for when spiller traffic exceeds the buffer capacity in one direction or the other. Each of these three parts has its own job and organization.
The back end is simplest: it collects pool entries into whole cache lines and injects them into the hierarchy, or vice versa. It has no knowledge of what the entries mean, only that entries must be moved to keep the pool roughly half full.
The front end, connected with the FUs, has the hardest job because timing and power are at their most critical. The problem is stalls. A modern CPU cannot stop on a dime; it can take several cycles to get a stall signal from one side of the core to the other – across just the core, mind, not across the whole chip; that takes much longer still. Consequently, when we issue an operation to an FU there will be a result coming out X cycles later, whether the core is stalled or not. An maybe a couple more after that, that were in flight inside the FU. The Mill uses result replay after a stall, not the reissue replay that is commonly used, so those results have to be caught and made available when the core starts up again. The spiller front end does that catch and replay. Each of those results winds up as an operand entry, initially in the FU output latch but soon (if it stays live) moved to one of the entries in the spiller in front of the buffer pool that the replay mechanism has access to.
But moving out of a latch happens only if the operand is still live. There are several different ways to handle the mapping from logical belt numbers to physical operand numbers/locations; each has its own advantages. However, all have some mechanism by which an operand can tell if it is still live. Note that there may be operands in a latch or buffer that are not yet “on the belt”, having been produced by an FU that is working through a stall. Liveness is updated every cycle, and loss of liveness is what “falling off the belt” means in the hardware.’
A call leaves the caller’s live operands still live, and the execution of the callee creates new live operands, and promptly the hardware runs out of latches/skid buffers for them all, so operands belonging to frames other than the issue frame are moved into the buffer pool in the middle end of the spiller. Notice that this is not just a matter of taking all the operands that exist at the point of call and moving theme; it is possible for an FU to produce a result operand that does not belong to the current issue frame. For example, if I issue a long-running operation like a FP multiply and then immediately do a call, the program sees the multiply result as coming out after the call returns, but in reality it comes out as soon as the multiplier is done, in the middle of the callee even though the operand belongs to the caller. Likewise, if you start a multiply and then do a return op, there is no way to stop that multiply and you will get a result, which belongs to a callee that no longer exists and is dead on creation.
So much for background. You ask whether it could be of benefit to know that an operand, still on the belt, will not be referenced in the future; this could be indicated in the machine code in several ways, including those you suggest. However, all the ways refer to the operand by its logical belt number. So the first task would be to find the named operand and tell it that it is no longer live. The hardware that maps logical belt numbers to physical positions exists, and there’s a way to indicate that an arbitrary existing location is no longer live (used by return and branches to renumber the belt). So the hardware to do what you suggest exists. Would another way to use it pay?
The benefit boils down to the cost of having a redundant operand around, and how long before the operand would die anyway. The cost is pressure on the front end storage capacity and the buffer pool; by the time we get to the back end traffic has smoothed out enough that a few extra pool entries are largely irrelevant. The lifetime ends when there are enough later drops to push the redundant operand off the belt, or we transfer through a branch which discards the entry early. In general a redundant operand will only make it to the middle end pool if there is a call (or interrupt), so there is a saving in pool pressure iff the operand goes dead before a call without an intervening branch. That happens of course, but its frequency is heavily dependent on how call-heavy the code is.
Pressure in the front end is potentially a larger issue, but how big an issue will depend on the details of the hardware implementation, which will be member dependent across the line and will no doubt evolve with time. The hardware is not up enough to get a good sense of whether front-end pressure will a constraining factor for some, or any, Mills. The software sim is no use for this; we’d need Verilog sim of the hardware to know, and that’s not there yet.
What would the cost be? Primarily entropy in the encoding. If we take the common two-argument operations and add a “last use” bit, that would add two bits to each operation’s encoding. If we take the marginal cost of an exu op as 15 bits then the feature would increase code size by 13%, with impact in the caches and memory traffic. On the flow side, adding a last-use bit to things like function argument lists would cost one bit per morsel; with a typical 4-5 bit morsel size then the cost is over 20%.
The alternative would be to use an op that specifies a live or dead list. There is such an op, called rescue, already; it is intended to keep live operands from having to be spilled because they are about to be pushed off the belt, but could be used for a kill operation too. It’s a flow op, and I’d guess might have a typical entropy cost around 30 bits or so. In its intended use it saves at least one spill/fill pair, which together will have roughly the same entropy, while avoiding the scratchpad pressure and latency that a spill involves.
However, if there is nothing to be rescued and rescue is just used as a operand killer then the cost is those 30 bits and the flow slot that the op would occupy, against the lowered spiller pressure. This is certainly not an obvious win, so we have deferred any decision until we can see how big an issue spiller pressure turns out to be in hardware practice.
I suspect this is a bit more of an answer than you expected 🙂
- in reply to: When should we expect the next talk? #2108
We want the next talk to be actual numbers, albeit from sim. We expect it will be a live demo, similar to the one we did demonstrating the configuration software we use. Of course, to make such a talk it has to work well enough to demo, and we are hard at work on that. It’s taken longer than expected (surprise!). A compiler, RTS, and the guts of an OS kernel is a lot of work. So, when? Soon, I think. For some value of soon 🙂
There are also some pending subjects for architecture talks, notably multicore and DMA/streaming. However we feel that those should wait on more implementation.
- in reply to: Linux port? #2088
A port is begun, but barely. We are still beating the compiler/specializer tool chain into shape, and have started implementing syscalls as outcalls to the host so that apps (such as benchmarks) have a file system to use. Some of those syscalls are native rather than outcalls; an example is the timing/clock facility. The next step down that path is to fill in libc and libc++. We don’t see any issues with that, just a lot of bug-catching.
Moving the OS from host to native is somewhat separate from this effort to get apps to work. There are three major parts involved: a BIOS, a microkernel, and Linux. However, when you look at it as an implementations the task partitions differently: hardware-integrated system support (primarily memory management – PLB, TLB and its tables, but some spiller) that must be native; I/O (enough native to let drivers work, but the actual devices have to be host-side or a whole new piece of simulator); and the load end of the tool chain (linkers, loader and debuggers).
The expectation is that by the time the system can run an open-source BIOS and microkernel then a microkernel-based version of Linux (we still lean toward L4) will slide in on top relatively easily.
As for the timing of all this: as you know, we are a sweat-equity company and most of us are part-time and irregular, so planned schedules are impossible. We are actively seeking engineers who would like to help and earn a piece of the Mill, and have many openings. Interested?
- in reply to: Interrupt handling and SoEMT #2078
Caution: work in progress. The RTS is coming up, but is not supporting multithreading yet.
Currently the interrupt mechanism is factored out from the multithreading mechanism; while an interrupt can be delivered to an explicit core (and a trap or fault is always delivered to the core that encountered the event), the handler always runs in the thread that was interrupted in that core. That handler may in turn activate other threads of course.
When we get further in the kernel implementation, and have sim numbers for realistic code, then the idea of decorating a handler dispatch with a thread id to dispatch may prove to be advantageous. We simply don’t know yet. One factor that might force such a design is if there are quanta problems in servicing interrupts in app threads. However, currently we are thinking that quanta will be associated with turfs and not with processes, so an interrupt, which typically will portal to a different turf, won’t run using the interruptee’s quanta.
We are reasonably confident that the current sim mechanism for external interrupts is incomplete, because there isn’t one 🙂 We do model internal events though, and I/O that is instigated by the core. The architecture admits arbitrary interrupting of interrupts arbitrarily deeply, so in principle there is no need for priority in accepting the interrupt itself; priority dispersion would occur when the handler enabled a thread and yielded the core. Of course, that assumes that the handler will be relatively short-lived and well behaved.
We do have RPC well defined, albeit not yet working in the RTS. That is intra-core RPC though; intercore RPC is as yet as ill-defined as other forms of interrupt.
Sorry I can’t me more informative; we’ll have much better answers when the guts of the kernel are done.
- in reply to: Any comments on Imagine Stream Architecture? #2075
Imagine seems to be a straightforward implementation of a stream processor design, heavily indebted to early Cray designs. If your problem is stream-shaped then it should suit your needs. The market for it is graphics and network data-planes, and the competition is GPUs and custom proprietary network chips. It isn’t (and doesn’t claim to be) suitable for general-purpose applications.
The Mill is a general-purpose architecture. While we can do a better job on graphics or stream loads than other GP architectures, we are not suitable for the sort of dedicated heavy loads that one might want a GPU (or Imagine) for. The two architectures are complimentary – I can see a network product with a Mill for the control plane and an Imagine for the data plane
Thank you 🙂
About UTF-8: in general we do hardware, and hardware doesn’t deal with character sets. The hardware deals with bytes, and has no knowledge of what those bytes hold. We deal with, or don’t deal with, character sets only in our software. There will not be any “native” character type in the architecture itself, UTF-8 or otherwise.
For llvm you will get whatever llvm gives us; it’s the same with other possible third-party software for the Mill, such as gcc or Linux itself. In the diagnostics and listing of our own house-developed software I’m afraid that we have been very lax in worrying about localization; we use the standard ASCII that C++ gives us. That will have to change, we know, but we have more pressing matters right now.
As for reading UTF8, what you need in the architecture is a funnel shifter attached to a streamer rather than a specialized load. We have some ideas in that direction, but all NYF.
- in reply to: reducing data cache latency #2067
Several comments. First, the 3-cycle latency is a reasonable approximation for use in sims when we cannot yet measure the Verilog timing. The actual number will be heavily member dependent, and on process and clock rate too. Don’t count on that number, except for ball-park figuring.
Second, code using the vector registers is usually loop code, and the Mill software-pipelining of loops hides the actual latency if the data is in the cache.
Third, there is in fact a D$0 capability, although it is not a cache in the conventional sense the way that the D$1 is. The filing is in, but it needs pictures to explain. Sorry.
Fourth, on a well configured Mill member the load bandwidth is enough to keep up with the compute capacity, so if you are doing scalar compute then doing scalar load doesn’t slow you down. The only real advantage to doing a vector load followed by an explode and scalar compute is that it might save a little power compared to doing scalar load and scalar compute without the explode. Of course, if you can do vector compute then you certainly should do vector loads, and we support that.
- in reply to: spiller work optimisation #2128
Things like Fibonacci are simple enough that the compilers recognize tail recursion and turn them into loops on any machine – try it and see. What causes trouble is two-way recursion, such as depth-first search in binary trees; tail conversion gets rid of one of the calls, but the second remains. Research compilers replace the recursion with a parallel state stack, but I don’t know of any commercial compilers that do, and it’s not clear that the state stack is any better than the regular stack.
Whether the Mill is better or worse than other architectures depends on the amount of state saved in a recursive call and what the save bandwidth is. On a genReg machine the state will be at least the PC and frame registers plus any live data state. The same is true on a Mill, plus any additional belt positions that are zombie; the Mill situation is equivalent to a callee-saves protocol in a genreg machine.
The number of zombies depends on the amount of work between the calls. The Mill call instruction marks all belt positions not occupied by arguments as invalid, so they won’t zombie in an immediately following call. The same occurs with branches. As other operations execute the invalids get pushed along the belt and eventually fall off, so the belt becomes populated with operands that might be zombie. Hence zombie spill increases with distance from the last control-flow transfer, either branch or the entry of the called function.
While it is easy to invent cases in which the whole belt is zombie, in practice recursive calls usually seem to be quite close to other control flow and the actual zombie count is ignorable.
- in reply to: spiller work optimisation #2126
The expectation is that for “reasonable” nesting depths the spiller will keep spilled operands in its internal SRAM space, so spills will never go into the memory hierarchy. The spiller does use high-water marks to anticipate the need to restore values from the hierarchy, but an abrupt sequence of a lot of function returns can overwhelm that prefetching and produce a stall, just as an abrupt sequence of calls can overwhelm the spiller’s bandwidth flushing operands to the hierarchy.
The software does have the ability to discard values from the belt; every branch transfer can do so. I agree that it is tempting to add that capability to the call operation, but we have not done so because of problems with encoding such an operation. The call can specify a belt-full of arguments, and having it also specify a potential belt-full of preserved operands would overwhelm the space available in the encoding.
There are alternative encodings however, but these would significantly complicate the decoders. So we are leaving things as they are, until we have enough gate-level sim of large programs to see if there really is a problem with spiller bandwidth or to size of spiller SRAM.
- in reply to: spiller work optimisation #2117
Exactly. However, while we describe the belt and the spiller as two different physical units for clarity, in reality they are just two different ways of looking at the same collection of hardware. The rarely-mentioned guts is actually the skid buffers and their replay mechanism. It’s not possible to stall a function unit in mid-pipe, so either you throw away what it does and re-issue, or you catch what it does and replay the result, in effect re-retire.
Replay is hard if an operation can fault or if there are ordering hazards. In a conventional this is handled by OOO renaming, at some cost, but without OOO the hazards pretty much dictate use of re-issue, and even OOO uses re-issue, I suppose as much out of tradition as anything. The Mill’s NaRs and SSA belt eliminates the fault and hazard issue and we have static ordering, so we can use replay. Replay is much cheaper than re-issue, especially with long pipes. However, replay means we have to catch results produced during stall, and for that we need skid buffers, and the ability to feed buffer contents to function units as we come back up from stalls.
Given that machinery, it’s only a small step to providing the paths by which the skid buffer contents can be moved to SRAM, in addition to the paths to the FUs. That’s the top of the spiller. The bottom of the spiller, which moves content from SRAM to the memory hierarchy, is largely independent.
Skid buffer replay must work at core-clock and full machine width, which is a huge amount of data with very tight timing. Consequently it doesn’t have time to decide if something is really needed as it comes out the back of the FU; a buffer has to catch it whether it is there or not. This leads to holes in the buffers of course. Those holes get squeezed out later in the bucket-brigade that leads to eventually to DRAM. Where “later” is is up to the implementation.
- in reply to: Co-Exist and Merge: Not Supplant. #2110
There are a great many degrees of design freedom in configuring inter-core and inter-CPU relations; each has advantages, and drawbacks, which may be more or less important depending on the purpose. We already support one you mention, where the alien gets a system-wide window (we follow legacy practice and call it an “aperture”). Each component, active or passive, that has a notion of address space gets an aperture which is a hardware remapping window into the global shared address space. These apertures are mostly used for passive components such as RAMs and ROMs, all of which believe that they address from zero and must be placed in the global space by the aperture. However apertures are also meaningful for active components, possibly including alien CPUs.
The drawback to apertures is that, being hardware, each component interface only gets one of them, and while they have byte granularity, the space they describe must be contiguous. In contrast the turf system, backed in hardware by the PLB, permits arbitrary numbers of possibly overlapping regions. Of course, the alien device could be given its own turf and could attach in front of the PLB. Then the device would appear to be just another thread to the rest of the system, and cache sharing would fall out naturally. Lots of possibilities 🙂
I agree that putting a full-blown alien CPU on a Mill chip is unlikely to make much sense; a reasonable Mill should be able to emulate the alien as fast as the alien can run natively. Still, if a paying customer wanted it …
- in reply to: Tooling update #2102
We can work something out 🙂 Take a look at our recruitment page: http://millcomputing.com/join-us/
- in reply to: Tooling update #2100
Barely usable by us, and certainly not ready for anyone else. Of course, you could convert to the “us” category 🙂
- in reply to: The Compiler #2085
Generally close to the consumer is best because it reduces belt pressure; you comput a value early and it may have fallen off the belt before it gets used, forcing spill/fill ops to retain it. The exception is loads, which should be issued as early as possible while still retiring as late as possible; maximizing load deferral gives the memory hierarchy the most time to come up with a value.
Because the inputs are themselves computed as late as possible, usually you don’t have the inputs waiting around on the belt either. There are exceptions: when the inputs are call arguments, or when an input is used more than once, with an early consumer and a late consumer. However, all of these cases are examples of a general pattern: inputs must be produced early for outside reasons, and outputs must be consumed late also for outside reasons, so there is a range between production of input and consumption of output in which the op could be placed. It turns out that where you place the op makes no difference in this pattern; if you place it early then its outputs may have to spilled, but if you place it late then its inputs may have to be spilled; the impact on belt pressure is roughly the same, assuming that the number of inputs and outputs are roughly the same. Because we are placing as late as possible in general, we place this pattern late too, just for simplicity even though it could have been placed earlier without affecting overall latency.
Incidentally, operation placement is equivalent to the NP bin-packing problem, so it’s a game of heuristics because optimal placement is impractical.
Power is a different issue that our hardware team will address.
- in reply to: Equity crowdfunding? #2071
$1M will definitely not be enough when we go seriously into the hardware. We still have $1M available on the convertible Notes that you can buy today. However, the Title II of the Jobs Act allows up to $5M, which will be enough for the FPGA. The next after that will be around $25M; heavy semi is not cheap 🙂
Right now we are focused more on the software side, to get the tool chain up enough that we can run major benchmarks. The design is public enough that someone expert in CPUs can see it works, but big money tends not to be CPU-expert. Benchmarks will be more persuasive 🙂
- This reply was modified 9 years ago by Ivan Godard.
- AuthorPosts